A majority of the tests in /pkg/cri are testing/validating multiple
things per test (generally spec or options validations). This flow
lends itself well to using *testing.T's Run method to run each thing
as a subtest so `go test` output can actually display which subtest
failed/passed.
Some of the tests in the packages in pkg/cri already did this, but
a bunch simply logged what sub-testcase was currently running without
invoking t.Run.
Signed-off-by: Daniel Canter <dcanter@microsoft.com>
HostProcess containers require every container in the pod to be a
host process container and have the corresponding field set. The Kubelet
usually enforces this so we'd error before even getting here but we recently
found a bug in this logic so better to be safe than sorry.
Signed-off-by: Daniel Canter <dcanter@microsoft.com>
Schema 1 has been substantially deprecated since circa. 2017 in favor of Schema 2 introduced in Docker 1.10 (Feb 2016)
and its successor OCI Image Spec v1, but we have not officially deprecated Schema 1.
One of the reasons was that Quay did not support Schema 2 so far, but it is reported that Quay has been
supporting Schema 2 since Feb 2020 (moby/buildkit issue 409).
This PR deprecates pulling Schema 1 images but the feature will not be removed before containerd 2.0.
Pushing Schema 1 images was never implemented in containerd (and its consumers such as BuildKit).
Docker/Moby already disabled pushing Schema 1 images in Docker 20.10 (moby/moby PR 41295),
but Docker/Moby has not yet disabled pulling Schema 1 as containerd has not yet deprecated Schema 1.
(See the comments in moby/moby PR 42300.)
Docker/Moby is expected to disable pulling Schema 1 images in future after this deprecation.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
This patch adds support for a container annotation and two separate
pod annotations for controlling the blockio class of containers.
The container annotation can be used by a CRI client:
"io.kubernetes.cri.blockio-class"
Pod annotations specify the blockio class in the K8s pod spec level:
"blockio.resources.beta.kubernetes.io/pod"
(pod-wide default for all containers within)
"blockio.resources.beta.kubernetes.io/container.<container_name>"
(container-specific overrides)
Correspondingly, this patch adds support for --blockio-class and
--blockio-config-file to ctr, too.
This implementation follows the resource class annotation pattern
introduced in RDT and merged in commit 893701220.
Signed-off-by: Antti Kervinen <antti.kervinen@intel.com>
Kubelet sends the PullImage request without timeout, because the image size
is unknown and timeout is hard to defined. The pulling request might run
into 0B/s speed, if containerd can't receive any packet in that connection.
For this case, the containerd should cancel the PullImage request.
Although containerd provides ingester manager to track the progress of pulling
request, for example `ctr image pull` shows the console progress bar, it needs
more CPU resources to open/read the ingested files to get status.
In order to support progress timeout feature with lower overhead, this
patch uses http.RoundTripper wrapper to track active progress. That
wrapper will increase active-request number and return the
countingReadCloser wrapper for http.Response.Body. Each bytes-read
can be count and the active-request number will be descreased when the
countingReadCloser wrapper has been closed. For the progress tracker,
it can check the active-request number and bytes-read at intervals. If
there is no any progress, the progress tracker should cancel the
request.
NOTE: For each blob data, the containerd will make sure that the content
writer is opened before sending http request to the registry. Therefore, the
progress reporter can rely on the active-request number.
fixed: #4984
Signed-off-by: Wei Fu <fuweid89@gmail.com>
This commit migrates containerd/protobuf from github.com/gogo/protobuf
to google.golang.org/protobuf and adjust types. Proto-generated structs
cannot be passed as values.
Fixes#6564.
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
Parallelizing them decreases loading duration.
Time to complete recover():
* Without competing IOs + without opt: 21s
* Without competing IOs + with opt: 14s
* Competing IOs + without opt: 3m44s
* Competing IOs + with opt: 33s
Signed-off-by: Eric Lin <linxiulei@gmail.com>
Extract the names of requested CDI devices and update the OCI
Spec according to the corresponding CDI device specifications.
CDI devices are requested using container annotations in the
cdi.k8s.io namespace. Once CRI gains dedicated fields for CDI
injection the snippet for extracting CDI names will need an
update.
Signed-off-by: Ed Bartosh <eduard.bartosh@intel.com>
Background:
With current design, the content backend uses key-lock for long-lived
write transaction. If the content reference has been marked for write
transaction, the other requestes on the same reference will fail fast with
unavailable error. Since the metadata plugin is based on boltbd which
only supports single-writer, the content backend can't block or handle
the request too long. It requires the client to handle retry by itself,
like OpenWriter - backoff retry helper. But the maximum retry interval
can be up to 2 seconds. If there are several concurrent requestes fo the
same image, the waiters maybe wakeup at the same time and there is only
one waiter can continue. A lot of waiters will get into sleep and we will
take long time to finish all the pulling jobs and be worse if the image
has many more layers, which mentioned in issue #4937.
After fetching, containerd.Pull API allows several hanlers to commit
same ChainID snapshotter but only one can be done successfully. Since
unpack tar.gz is time-consuming job, it can impact the performance on
unpacking for same ChainID snapshotter in parallel.
For instance, the Request 2 doesn't need to prepare and commit, it
should just wait for Request 1 finish, which mentioned in pull
request #6318.
```text
Request 1 Request 2
Prepare
|
|
|
| Prepare
Commit |
|
|
|
Commit(failed on exist)
```
Both content backoff retry and unnecessary unpack impacts the performance.
Solution:
Introduced the duplicate suppression in fetch and unpack context. The
deplicate suppression uses key-mutex and single-waiter-notify to support
singleflight. The caller can use the duplicate suppression in different
PullImage handlers so that we can avoid unnecessary unpack and spin-lock
in OpenWriter.
Test Result:
Before enhancement:
```bash
➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...
real 1m6.172s
user 0m0.268s
sys 0m0.193s
docker pull localhost:5000/redis:latest (x20) takes ...
real 0m1.324s
user 0m0.441s
sys 0m0.316s
➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...
real 1m47.657s
user 0m0.284s
sys 0m0.224s
docker pull localhost:5000/golang:latest (x20) takes ...
real 0m6.381s
user 0m0.488s
sys 0m0.358s
```
With this enhancement:
```bash
➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...
real 0m1.140s
user 0m0.243s
sys 0m0.178s
docker pull localhost:5000/redis:latest (x20) takes ...
real 0m1.239s
user 0m0.463s
sys 0m0.275s
➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...
real 0m5.546s
user 0m0.217s
sys 0m0.219s
docker pull localhost:5000/golang:latest (x20) takes ...
real 0m6.090s
user 0m0.501s
sys 0m0.331s
```
Test Script:
localhost:5000/{redis|golang}:latest is equal to
docker.io/library/{redis|golang}:latest. The image is hold in local registry
service by `docker run -d -p 5000:5000 --name registry registry:2`.
```bash
image_name="${1}"
pull_times="${2:-10}"
cleanup() {
ctr image rmi "${image_name}"
ctr -n k8s.io image rmi "${image_name}"
crictl rmi "${image_name}"
docker rmi "${image_name}"
sleep 2
}
crictl_testing() {
for idx in $(seq 1 ${pull_times}); do
crictl pull "${image_name}" > /dev/null 2>&1 &
done
wait
}
docker_testing() {
for idx in $(seq 1 ${pull_times}); do
docker pull "${image_name}" > /dev/null 2>&1 &
done
wait
}
cleanup > /dev/null 2>&1
echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "crictl pull $image_name (x${pull_times}) takes ..."
time crictl_testing
echo
echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "docker pull $image_name (x${pull_times}) takes ..."
time docker_testing
```
Fixes: #4937Close: #4985Close: #6318
Signed-off-by: Wei Fu <fuweid89@gmail.com>
This commit upgrades github.com/containerd/typeurl to use typeurl.Any.
The interface hides gogo/protobuf/types.Any from containerd's Go client.
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
The linter on platforms that have a hardcoded response complains about
"if xyz == nil" checks; ignore those.
Signed-off-by: Phil Estes <estesp@amazon.com>
There's two mappings of hostpath to IDType and ID in the wild:
- dockershim and dockerd-cri (implicitly via docker) use class/ID
-- The only supported IDType in Docker is 'class'.
-- https://github.com/aarnaud/k8s-directx-device-plugin generates this form
- https://github.com/jterry75/cri (windows_port branch) uses IDType://ID
-- hcsshim's CRI test suite generates this form
`://` is much more easily distinguishable, so I've gone with that one as
the generic separator, with `class/` as a special-case.
Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
These unit tests don't check hugetlb. However by setting
TolerateMissingHugetlbController to false, these tests can't
be run on system without hugetlb (e.g. Debian buildd).
Signed-off-by: Shengjing Zhu <zhsj@debian.org>
We were not properly ignoring errors from
gorestrl.rdt.ContainerClassFromAnnotations() causing the config option
to be ineffective, in practice.
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
The Linux kernel never sets the Inheritable capability flag to
anything other than empty. Non-empty values are always exclusively
set by userspace code.
[The kernel stopped defaulting this set of capability values to the
full set in 2000 after a privilege escalation with Capabilities
affecting Sendmail and others.]
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Enabling this option effectively causes RDT class of a container to be a
soft requirement. If RDT support has not been enabled the RDT class
setting will not have any effect.
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
Use goresctrl for parsing container and pod annotations related to RDT.
In practice, from the users' point of view, this patchs adds support for
a container annotation and two separate pod annotations for controlling
the RDT class of containers.
Container annotation can be used by a CRI client:
"io.kubernetes.cri.rdt-class"
Pod annotations for specifying the RDT class in the K8s pod spec level:
"rdt.resources.beta.kubernetes.io/pod"
(pod-wide default for all containers within)
"rdt.resources.beta.kubernetes.io/container.<container_name>"
(container-specific overrides)
Annotations are intended as an intermediate step before the CRI API
supports RDT.
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
The ability to handle KVM based runtimes with SELinux has been added as
part of d715d00906.
However, that commit introduced some logic to check whether the
"container_kvm_t" label would or not be present in the system, and while
the intentions were good, there's two major issues with the approach:
1. Inspecting "/etc/selinux/targeted/contexts/customizable_types" is not
the way to go, as it doesn't list the "container_kvm_t" at all.
2. There's no need to check for the label, as if the label is invalid an
"Invalid Label" error will be returned and that's it.
With those two in mind, let's simplify the logic behind setting the
"container_kvm_t" label, removing all the unnecessary code.
Here's an output of VMM process running, considering:
* The state before this patch:
```
$ containerd --version
containerd github.com/containerd/containerd v1.6.0-beta.3-88-g7fa44fc98 7fa44fc98f
$ kubectl apply -f ~/simple-pod.yaml
pod/nginx created
$ ps -auxZ | grep cloud-hypervisor
system_u:system_r:container_runtime_t:s0 root 609717 4.0 0.5 2987512 83588 ? Sl 08:32 0:00 /usr/bin/cloud-hypervisor --api-socket /run/vc/vm/be9d5cbabf440510d58d89fc8a8e77c27e96ddc99709ecaf5ab94c6b6b0d4c89/clh-api.sock
```
* The state after this patch:
```
$ containerd --version
containerd github.com/containerd/containerd v1.6.0-beta.3-89-ga5f2113c9 a5f2113c9fc15b19b2c364caaedb99c22de4eb32
$ kubectl apply -f ~/simple-pod.yaml
pod/nginx created
$ ps -auxZ | grep cloud-hypervisor
system_u:system_r:container_kvm_t:s0:c638,c999 root 614842 14.0 0.5 2987512 83228 ? Sl 08:40 0:00 /usr/bin/cloud-hypervisor --api-socket /run/vc/vm/f8ff838afdbe0a546f6995fe9b08e0956d0d0cdfe749705d7ce4618695baa68c/clh-api.sock
```
Note, the tests were performed using the following configuration snippet:
```
[plugins]
[plugins.cri]
enable_selinux = true
[plugins.cri.containerd]
[plugins.cri.containerd.runtimes]
[plugins.cri.containerd.runtimes.kata]
runtime_type = "io.containerd.kata.v2"
privileged_without_host_devices = true
```
And using the following pod yaml:
```
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
runtimeClassName: kata
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
```
Fixes: #6371
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
CRI API has been updated to include a an optional `resources` field in the
LinuxPodSandboxConfig field, as part of the RunPodSandbox request.
Having sandbox level resource details at sandbox creation time will have
large benefits for sandboxed runtimes. In the case of Kata Containers,
for example, this'll allow for better support of SW/HW architectures
which don't allow for CPU/memory hotplug, and it'll allow for better
queue sizing for virtio devices associated with the sandbox (in the VM
case).
If this sandbox resource information is provided as part of the run
sandbox request, let's introduce a pattern where we will update the
pause container's runtiem spec to include this information in the
annotations field.
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
When containerd use this config:
```
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."localhost:5000"]
endpoint = ["http://localhost:5000"]
```
Due to the `newTransport` function does not initialize the `TLSClientConfig` field.
Then use `TLSClientConfig` to cause nil pointer dereference
Signed-off-by: wanglei <wllenyj@linux.alibaba.com>