Remove nolint-comments that weren't hit by linters, and remove the "structcheck"
and "varcheck" linters, as they have been deprecated:
WARN [runner] The linter 'structcheck' is deprecated (since v1.49.0) due to: The owner seems to have abandoned the linter. Replaced by unused.
WARN [runner] The linter 'varcheck' is deprecated (since v1.49.0) due to: The owner seems to have abandoned the linter. Replaced by unused.
WARN [linters context] structcheck is disabled because of generics. You can track the evolution of the generics support by following the https://github.com/golangci/golangci-lint/issues/2649.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
- fix "nolint" comments to be in the correct format (`//nolint:<linters>[,<linter>`
no leading space, required colon (`:`) and linters.
- remove "nolint" comments for errcheck, which is disabled in our config.
- remove "nolint" comments that were no longer needed (nolintlint).
- where known, add a comment describing why a "nolint" was applied.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This `//nolint` was added in f5c7ac9272
to suppress warnings about the `NameToCertificate` function being deprecated:
// Deprecated: NameToCertificate only allows associating a single certificate
// with a given name. Leave that field nil to let the library select the first
// compatible chain from Certificates.
Looking at that, it was deprecated in Go 1.14 through
eb93c684d4
(https://go-review.googlesource.com/c/go/+/205059), which describes:
crypto/tls: select only compatible chains from Certificates
Now that we have a full implementation of the logic to check certificate
compatibility, we can let applications just list multiple chains in
Certificates (for example, an RSA and an ECDSA one) and choose the most
appropriate automatically.
NameToCertificate only maps each name to one chain, so simply deprecate
it, and while at it simplify its implementation by not stripping
trailing dots from the SNI (which is specified not to have any, see RFC
6066, Section 3) and by not supporting multi-level wildcards, which are
not a thing in the WebPKI (and in crypto/x509).
We should at least have a comment describing why we are ignoring this, but preferably
review whether we should still use it.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Currently CDI registry is reconfigured on every
WithCDI call, which is a relatively heavy operation.
This happens because cdi.GetRegistry(cdi.WithSpecDirs(cdiSpecDirs...))
unconditionally reconfigures the registry (clears fs notify watch,
sets up new watch, rescans directories).
Moving configuration to the criService.initPlatform should result
in performing registry configuration only once on the service start.
Signed-off-by: Ed Bartosh <eduard.bartosh@intel.com>
As WithCDI is CRI-only API it makes sense to move it
out of oci module.
This move can also fix possible issues with this API when
CRI plugin is disabled.
Signed-off-by: Ed Bartosh <eduard.bartosh@intel.com>
This makes diff archives to be reproducible.
The value is expected to be passed from CLI applications via the $SOUCE_DATE_EPOCH env var.
See https://reproducible-builds.org/docs/source-date-epoch/
for the $SOURCE_DATE_EPOCH specification.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Rework sandbox monitoring, we should rely on Controller.Wait instead of
CRIService.StartSandboxExitMonitor
Signed-off-by: WangLei <wllenyj@linux.alibaba.com>
pkg/cri/sbserver/cri_fuzzer.go and pkg/cri/server/cri_fuzzer.go were
mostly the same.
This commit merges them together and move the unified fuzzer to
contrib/fuzz again to sort out dependencies. pkg/cri/ shouldn't consume
cmd/.
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
As part of the effort of getting hypervisor isolated windows container
support working for the CRI entrypoint here, add the runhcs-wcow-hypervisor
handler for the default config. This sets the correct SandboxIsolation
value that the Windows shim uses to differentiate process vs. hypervisor
isolation. This change additionally sets the wcow-process runtime to
passthrough io.microsoft.container* annotations and the hypervisor runtime
to accept io.microsoft.virtualmachine* annotations.
Note that for K8s users this runtime handler will need to be configured by
creating the corresponding RuntimeClass resources on the cluster as it's
not the default runtime.
Signed-off-by: Daniel Canter <dcanter@microsoft.com>
It is follow-up of #7254. This commit will increase ReadHeaderTimeout
from 3s to 30m, which prevent from unexpected timeout when the node is
running with high-load. 30 Minutes is longer enough to get close to
before what #7254 changes.
And ideally, we should allow user to configure the streaming server if
the users want this feature.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
`ioutil` has been deprecated by golang. All the code in `ioutil` just
forwards functionality to code in either the `io` or `os` packages.
See https://github.com/golang/go/pull/51961 for more info.
Signed-off-by: Jeff Widman <jeff@jeffwidman.com>
Failpoint is used to control the fail during API call when testing, especially
the API is complicated like CRI-RunPodSandbox. It can help us to test
the unexpected behavior without mock. The control design is based on freebsd
fail(9), but simpler.
REF: https://www.freebsd.org/cgi/man.cgi?query=fail&sektion=9&apropos=0&manpath=FreeBSD%2B10.0-RELEASE
Signed-off-by: Wei Fu <fuweid89@gmail.com>
All of the CRI store related packages all use the standard errdefs
errors now for if a key doesn't or already exists (ErrAlreadyExists,
ErrNotFound), but the comments for the methods still referenced
some unused package specific error definitions. This change just
updates the comments to reflect what errors are actually returned
and adds comments for some previously undocumented exported functions.
Signed-off-by: Daniel Canter <dcanter@microsoft.com>
All containers except the pause container, mount `/dev/shm" with flags
`nosuid,nodev,noexec`. So change mount options for pause container to
keep consistence.
This also helps to solve issues of failing to mount `/dev/shm` when
pod/container level user namespace is enabled.
Fixes: #6911
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Signed-off-by: Lei Wang <wllenyj@linux.alibaba.com>
This's an optimization to get rid of redundant `/dev/shm" mounts for pause container.
In `oci.defaultMounts`, there is a default `/dev/shm` mount which is redundant for
pause container.
Fixes: #6911
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Signed-off-by: Lei Wang <wllenyj@linux.alibaba.com>
The TestPodAnnotationPassthroughContainerSpec test and the
TestContainerAnnotationPassthroughContainerSpec test both depend on a
platform-specific implementation of criService.containerSpec, which is
unimplemented on FreeBSD.
The TestSandboxContainerSpec depends on a platform-specific
implementation oc criService.sandboxContainerSpec, which is
unimplemented on FreeBSD.
Signed-off-by: Samuel Karp <me@samuelkarp.com>
This change does a couple things to remove some cruft/unused functionality
in the Windows snapshotter, as well as add a way to specify the rootfs
size in bytes for a Windows container via a new field added in the CRI api in
k8s 1.24. Setting the rootfs/scratch volume size was assumed to be working
prior to this but turns out not to be the case.
Previously I'd added a change to pass any annotations in the containerd
snapshot form (containerd.io/snapshot/*) as labels for the containers
rootfs snapshot. This was added as a means for a client to be able to provide
containerd.io/snapshot/io.microsoft.container.storage.rootfs.size-gb as an
annotation and have that be translated to a label and ultimately set the
size for the scratch volume in Windows. However, this actually only worked if
interfacing with the CRI api directly (crictl) as Kubernetes itself will
fail to validate annotations that if split by "/" end up with > 2 parts,
which the snapshot labels will (containerd.io / snapshot / foobarbaz).
With this in mind, passing the annotations and filtering to
containerd.io/snapshot/* is moot, so I've removed this code in favor of
a new `snapshotterOpts()` function that will return platform specific
snapshotter options if ones exist. Now on Windows we can just check if
RootfsSizeInBytes is set on the WindowsContainerResources struct and
then return a snapshotter option that sets the right label.
So all in all this change:
- Gets rid of code to pass CRI annotations as labels down to snapshotters.
- Gets rid of the functionality to create a 1GB sized scratch disk if
the client provided a size < 20GB. This code is not used currently and
has a few logical shortcomings as it won't be able to create the disk
if a container is already running and using the same base layer. WCIFS
(driver that handles the unioning of windows container layers together)
holds open handles to some files that we need to delete to create the
1GB scratch disk is the underlying problem.
- Deprecates the containerd.io/snapshot/io.microsoft.container.storage.rootfs.size-gb
label in favor of a new containerd.io/snapshot/windows/rootfs.sizebytes label.
The previous label/annotation wasn't being used by us, and from a cursory
github search wasn't being used by anyone else either. Now that there is a CRI
field to specify the size, this should just be a field that users can set
on their pod specs and don't need to concern themselves with what it eventually
gets translated to, but non-CRI clients can still use the new label/deprecated
label as usual.
- Add test to cri integration suite to validate expanding the rootfs size.
Signed-off-by: Daniel Canter <dcanter@microsoft.com>
A majority of the tests in /pkg/cri are testing/validating multiple
things per test (generally spec or options validations). This flow
lends itself well to using *testing.T's Run method to run each thing
as a subtest so `go test` output can actually display which subtest
failed/passed.
Some of the tests in the packages in pkg/cri already did this, but
a bunch simply logged what sub-testcase was currently running without
invoking t.Run.
Signed-off-by: Daniel Canter <dcanter@microsoft.com>
HostProcess containers require every container in the pod to be a
host process container and have the corresponding field set. The Kubelet
usually enforces this so we'd error before even getting here but we recently
found a bug in this logic so better to be safe than sorry.
Signed-off-by: Daniel Canter <dcanter@microsoft.com>
Schema 1 has been substantially deprecated since circa. 2017 in favor of Schema 2 introduced in Docker 1.10 (Feb 2016)
and its successor OCI Image Spec v1, but we have not officially deprecated Schema 1.
One of the reasons was that Quay did not support Schema 2 so far, but it is reported that Quay has been
supporting Schema 2 since Feb 2020 (moby/buildkit issue 409).
This PR deprecates pulling Schema 1 images but the feature will not be removed before containerd 2.0.
Pushing Schema 1 images was never implemented in containerd (and its consumers such as BuildKit).
Docker/Moby already disabled pushing Schema 1 images in Docker 20.10 (moby/moby PR 41295),
but Docker/Moby has not yet disabled pulling Schema 1 as containerd has not yet deprecated Schema 1.
(See the comments in moby/moby PR 42300.)
Docker/Moby is expected to disable pulling Schema 1 images in future after this deprecation.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
This patch adds support for a container annotation and two separate
pod annotations for controlling the blockio class of containers.
The container annotation can be used by a CRI client:
"io.kubernetes.cri.blockio-class"
Pod annotations specify the blockio class in the K8s pod spec level:
"blockio.resources.beta.kubernetes.io/pod"
(pod-wide default for all containers within)
"blockio.resources.beta.kubernetes.io/container.<container_name>"
(container-specific overrides)
Correspondingly, this patch adds support for --blockio-class and
--blockio-config-file to ctr, too.
This implementation follows the resource class annotation pattern
introduced in RDT and merged in commit 893701220.
Signed-off-by: Antti Kervinen <antti.kervinen@intel.com>
While gogo isn't actually used, it is still referenced from .proto files
and its corresponding Go package is imported from the auto-generated
files.
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
Kubelet sends the PullImage request without timeout, because the image size
is unknown and timeout is hard to defined. The pulling request might run
into 0B/s speed, if containerd can't receive any packet in that connection.
For this case, the containerd should cancel the PullImage request.
Although containerd provides ingester manager to track the progress of pulling
request, for example `ctr image pull` shows the console progress bar, it needs
more CPU resources to open/read the ingested files to get status.
In order to support progress timeout feature with lower overhead, this
patch uses http.RoundTripper wrapper to track active progress. That
wrapper will increase active-request number and return the
countingReadCloser wrapper for http.Response.Body. Each bytes-read
can be count and the active-request number will be descreased when the
countingReadCloser wrapper has been closed. For the progress tracker,
it can check the active-request number and bytes-read at intervals. If
there is no any progress, the progress tracker should cancel the
request.
NOTE: For each blob data, the containerd will make sure that the content
writer is opened before sending http request to the registry. Therefore, the
progress reporter can rely on the active-request number.
fixed: #4984
Signed-off-by: Wei Fu <fuweid89@gmail.com>
This commit migrates containerd/protobuf from github.com/gogo/protobuf
to google.golang.org/protobuf and adjust types. Proto-generated structs
cannot be passed as values.
Fixes#6564.
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
This commit removes the following gogoproto extensions;
- gogoproto.nullable
- gogoproto.customename
- gogoproto.unmarshaller_all
- gogoproto.stringer_all
- gogoproto.sizer_all
- gogoproto.marshaler_all
- gogoproto.goproto_unregonized_all
- gogoproto.goproto_stringer_all
- gogoproto.goproto_getters_all
None of them are supported by Google's toolchain (see #6564).
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
Create lease plugin type to separate lease manager from services plugin.
This allows other service plugins to depend on the lease manager.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Parallelizing them decreases loading duration.
Time to complete recover():
* Without competing IOs + without opt: 21s
* Without competing IOs + with opt: 14s
* Competing IOs + without opt: 3m44s
* Competing IOs + with opt: 33s
Signed-off-by: Eric Lin <linxiulei@gmail.com>
Extract the names of requested CDI devices and update the OCI
Spec according to the corresponding CDI device specifications.
CDI devices are requested using container annotations in the
cdi.k8s.io namespace. Once CRI gains dedicated fields for CDI
injection the snippet for extracting CDI names will need an
update.
Signed-off-by: Ed Bartosh <eduard.bartosh@intel.com>
Background:
With current design, the content backend uses key-lock for long-lived
write transaction. If the content reference has been marked for write
transaction, the other requestes on the same reference will fail fast with
unavailable error. Since the metadata plugin is based on boltbd which
only supports single-writer, the content backend can't block or handle
the request too long. It requires the client to handle retry by itself,
like OpenWriter - backoff retry helper. But the maximum retry interval
can be up to 2 seconds. If there are several concurrent requestes fo the
same image, the waiters maybe wakeup at the same time and there is only
one waiter can continue. A lot of waiters will get into sleep and we will
take long time to finish all the pulling jobs and be worse if the image
has many more layers, which mentioned in issue #4937.
After fetching, containerd.Pull API allows several hanlers to commit
same ChainID snapshotter but only one can be done successfully. Since
unpack tar.gz is time-consuming job, it can impact the performance on
unpacking for same ChainID snapshotter in parallel.
For instance, the Request 2 doesn't need to prepare and commit, it
should just wait for Request 1 finish, which mentioned in pull
request #6318.
```text
Request 1 Request 2
Prepare
|
|
|
| Prepare
Commit |
|
|
|
Commit(failed on exist)
```
Both content backoff retry and unnecessary unpack impacts the performance.
Solution:
Introduced the duplicate suppression in fetch and unpack context. The
deplicate suppression uses key-mutex and single-waiter-notify to support
singleflight. The caller can use the duplicate suppression in different
PullImage handlers so that we can avoid unnecessary unpack and spin-lock
in OpenWriter.
Test Result:
Before enhancement:
```bash
➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...
real 1m6.172s
user 0m0.268s
sys 0m0.193s
docker pull localhost:5000/redis:latest (x20) takes ...
real 0m1.324s
user 0m0.441s
sys 0m0.316s
➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...
real 1m47.657s
user 0m0.284s
sys 0m0.224s
docker pull localhost:5000/golang:latest (x20) takes ...
real 0m6.381s
user 0m0.488s
sys 0m0.358s
```
With this enhancement:
```bash
➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...
real 0m1.140s
user 0m0.243s
sys 0m0.178s
docker pull localhost:5000/redis:latest (x20) takes ...
real 0m1.239s
user 0m0.463s
sys 0m0.275s
➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...
real 0m5.546s
user 0m0.217s
sys 0m0.219s
docker pull localhost:5000/golang:latest (x20) takes ...
real 0m6.090s
user 0m0.501s
sys 0m0.331s
```
Test Script:
localhost:5000/{redis|golang}:latest is equal to
docker.io/library/{redis|golang}:latest. The image is hold in local registry
service by `docker run -d -p 5000:5000 --name registry registry:2`.
```bash
image_name="${1}"
pull_times="${2:-10}"
cleanup() {
ctr image rmi "${image_name}"
ctr -n k8s.io image rmi "${image_name}"
crictl rmi "${image_name}"
docker rmi "${image_name}"
sleep 2
}
crictl_testing() {
for idx in $(seq 1 ${pull_times}); do
crictl pull "${image_name}" > /dev/null 2>&1 &
done
wait
}
docker_testing() {
for idx in $(seq 1 ${pull_times}); do
docker pull "${image_name}" > /dev/null 2>&1 &
done
wait
}
cleanup > /dev/null 2>&1
echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "crictl pull $image_name (x${pull_times}) takes ..."
time crictl_testing
echo
echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "docker pull $image_name (x${pull_times}) takes ..."
time docker_testing
```
Fixes: #4937Close: #4985Close: #6318
Signed-off-by: Wei Fu <fuweid89@gmail.com>
This commit upgrades github.com/containerd/typeurl to use typeurl.Any.
The interface hides gogo/protobuf/types.Any from containerd's Go client.
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
runc option --criu is now ignored (with a warning), and the option will be
removed entirely in a future release. Users who need a non- standard criu
binary should rely on the standard way of looking up binaries in $PATH.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Currently when handling 'container_path' elements in container mounts we simply call
filepath.Clean on those paths. However, filepath.Clean adds an extra '.' if the path is a
simple drive letter ('E:' or 'Z:' etc.). These type of paths cause failures (with incorrect
parameter error) when creating containers via hcsshim. This commit checks for such paths
and doesn't call filepath.Clean on them.
It also adds a new check to error out if the destination path is a C drive and moves the
dst path checks out of the named pipe condition.
Signed-off-by: Amit Barve <ambarve@microsoft.com>
The linter on platforms that have a hardcoded response complains about
"if xyz == nil" checks; ignore those.
Signed-off-by: Phil Estes <estesp@amazon.com>
The directory created by `T.TempDir` is automatically removed when the
test and all its subtests complete.
Reference: https://pkg.go.dev/testing#T.TempDir
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
There's two mappings of hostpath to IDType and ID in the wild:
- dockershim and dockerd-cri (implicitly via docker) use class/ID
-- The only supported IDType in Docker is 'class'.
-- https://github.com/aarnaud/k8s-directx-device-plugin generates this form
- https://github.com/jterry75/cri (windows_port branch) uses IDType://ID
-- hcsshim's CRI test suite generates this form
`://` is much more easily distinguishable, so I've gone with that one as
the generic separator, with `class/` as a special-case.
Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
These unit tests don't check hugetlb. However by setting
TolerateMissingHugetlbController to false, these tests can't
be run on system without hugetlb (e.g. Debian buildd).
Signed-off-by: Shengjing Zhu <zhsj@debian.org>
full diff: https://github.com/emicklei/go-restful/compare/v2.9.5...v3.7.3
- Switch to using go modules
- Add check for wildcard to fix CORS filter
- Add check on writer to prevent compression of response twice
- Add OPTIONS shortcut WebService receiver
- Add Route metadata to request attributes or allow adding attributes to routes
- Add wroteHeader set
- Enable content encoding on Handle and ServeHTTP
- Feat: support google custom verb
- Feature: override list of method allowed without content-type
- Fix Allow header not set on '405: Method Not Allowed' responses
- Fix Go 1.15: conversion from int to string yields a string of one rune
- Fix WriteError return value
- Fix: use request/response resulting from filter chain
- handle path params with prefixes and suffixes
- HTTP response body was broken, if struct to be converted to JSON has boolean value
- List available representations in 406 body
- Support describing response headers
- Unwrap function in filter chain + remove unused dispatchWithFilters
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
We were not properly ignoring errors from
gorestrl.rdt.ContainerClassFromAnnotations() causing the config option
to be ineffective, in practice.
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
With the cgroupv2 configuration employed by Kubernetes, the pod cgroup (slice)
and container cgroup (scope) will both have the same memory limit applied. In
that situation, the kernel will consider an OOM event to be triggered by the
parent cgroup (slice), and increment 'oom' there. The child cgroup (scope) only
sees an oom_kill increment. Since we monitor child cgroups for oom events,
check the OOMKill field so that we don't miss events.
This is not visible when running containers through docker or ctr, because they
set the limits differently (only container level). An alternative would be to
not configure limits at the pod level - that way the container limit will be
hit and the OOM will be correctly generated. An interesting consequence is that
when spawning a pod with multiple containers, the oom events also work
correctly, because:
a) if one of the containers has no limit, the pod has no limit so OOM events in
another container report correctly.
b) if all of the containers have limits then the pod limit will be a sum of
container events, so a container will be able to hit its limit first.
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
When the cgroup is removed, EventChan is closed (this was pulled in by
8d69c041c5). This results in a nil error
being received. Don't log an error in that case but instead return.
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
The Linux kernel never sets the Inheritable capability flag to
anything other than empty. Non-empty values are always exclusively
set by userspace code.
[The kernel stopped defaulting this set of capability values to the
full set in 2000 after a privilege escalation with Capabilities
affecting Sendmail and others.]
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Enabling this option effectively causes RDT class of a container to be a
soft requirement. If RDT support has not been enabled the RDT class
setting will not have any effect.
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
Use goresctrl for parsing container and pod annotations related to RDT.
In practice, from the users' point of view, this patchs adds support for
a container annotation and two separate pod annotations for controlling
the RDT class of containers.
Container annotation can be used by a CRI client:
"io.kubernetes.cri.rdt-class"
Pod annotations for specifying the RDT class in the K8s pod spec level:
"rdt.resources.beta.kubernetes.io/pod"
(pod-wide default for all containers within)
"rdt.resources.beta.kubernetes.io/container.<container_name>"
(container-specific overrides)
Annotations are intended as an intermediate step before the CRI API
supports RDT.
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
The ability to handle KVM based runtimes with SELinux has been added as
part of d715d00906.
However, that commit introduced some logic to check whether the
"container_kvm_t" label would or not be present in the system, and while
the intentions were good, there's two major issues with the approach:
1. Inspecting "/etc/selinux/targeted/contexts/customizable_types" is not
the way to go, as it doesn't list the "container_kvm_t" at all.
2. There's no need to check for the label, as if the label is invalid an
"Invalid Label" error will be returned and that's it.
With those two in mind, let's simplify the logic behind setting the
"container_kvm_t" label, removing all the unnecessary code.
Here's an output of VMM process running, considering:
* The state before this patch:
```
$ containerd --version
containerd github.com/containerd/containerd v1.6.0-beta.3-88-g7fa44fc98 7fa44fc98f
$ kubectl apply -f ~/simple-pod.yaml
pod/nginx created
$ ps -auxZ | grep cloud-hypervisor
system_u:system_r:container_runtime_t:s0 root 609717 4.0 0.5 2987512 83588 ? Sl 08:32 0:00 /usr/bin/cloud-hypervisor --api-socket /run/vc/vm/be9d5cbabf440510d58d89fc8a8e77c27e96ddc99709ecaf5ab94c6b6b0d4c89/clh-api.sock
```
* The state after this patch:
```
$ containerd --version
containerd github.com/containerd/containerd v1.6.0-beta.3-89-ga5f2113c9 a5f2113c9fc15b19b2c364caaedb99c22de4eb32
$ kubectl apply -f ~/simple-pod.yaml
pod/nginx created
$ ps -auxZ | grep cloud-hypervisor
system_u:system_r:container_kvm_t:s0:c638,c999 root 614842 14.0 0.5 2987512 83228 ? Sl 08:40 0:00 /usr/bin/cloud-hypervisor --api-socket /run/vc/vm/f8ff838afdbe0a546f6995fe9b08e0956d0d0cdfe749705d7ce4618695baa68c/clh-api.sock
```
Note, the tests were performed using the following configuration snippet:
```
[plugins]
[plugins.cri]
enable_selinux = true
[plugins.cri.containerd]
[plugins.cri.containerd.runtimes]
[plugins.cri.containerd.runtimes.kata]
runtime_type = "io.containerd.kata.v2"
privileged_without_host_devices = true
```
And using the following pod yaml:
```
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
runtimeClassName: kata
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
```
Fixes: #6371
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
CRI API has been updated to include a an optional `resources` field in the
LinuxPodSandboxConfig field, as part of the RunPodSandbox request.
Having sandbox level resource details at sandbox creation time will have
large benefits for sandboxed runtimes. In the case of Kata Containers,
for example, this'll allow for better support of SW/HW architectures
which don't allow for CPU/memory hotplug, and it'll allow for better
queue sizing for virtio devices associated with the sandbox (in the VM
case).
If this sandbox resource information is provided as part of the run
sandbox request, let's introduce a pattern where we will update the
pause container's runtiem spec to include this information in the
annotations field.
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
When containerd use this config:
```
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."localhost:5000"]
endpoint = ["http://localhost:5000"]
```
Due to the `newTransport` function does not initialize the `TLSClientConfig` field.
Then use `TLSClientConfig` to cause nil pointer dereference
Signed-off-by: wanglei <wllenyj@linux.alibaba.com>
These are simple metrics that allow users to view more fine grained metrics on
internal operations.
Signed-off-by: Michael Crosby <michael@thepasture.io>
See https://kep.k8s.io/2371
* Implement new CRI RPCs - `ListPodSandboxStats` and `PodSandboxStats`
* `ListPodSandboxStats` and `PodSandboxStats` which return stats about
pod sandbox. To obtain pod sandbox stats, underlying metrics are
read from the pod sandbox cgroup parent.
* Process info is obtained by calling into the underlying task
* Network stats are taken by looking up network metrics based on the
pod sandbox network namespace path
* Return more detailed stats for cpu and memory for existing container
stats. These metrics use the underlying task's metrics to obtain
stats.
Signed-off-by: David Porter <porterdavid@google.com>
This commit adds a flag that enable all devices whitelisting when
privileged_without_host_devices is already enabled.
Fixes#5679
Signed-off-by: Dat Nguyen <dnguyen7@atlassian.com>
This change ignore errors during container runtime due to large
image labels and instead outputs warning. This is necessary as certain
image building tools like buildpacks may have large labels in the images
which need not be passed to the container.
Signed-off-by: Sambhav Kothari <sambhavs.email@gmail.com>
This will allow running Windows Containers to have their resource
limits updated through containerd. The CPU resource limits support
has been added for Windows Server 20H2 and newer, on older versions
hcsshim will raise an Unimplemented error.
Signed-off-by: Claudiu Belu <cbelu@cloudbasesolutions.com>
In linux 5.14 and hopefully some backports, core scheduling allows processes to
be co scheduled within the same domain on SMT enabled systems.
The containerd impl sets the core sched domain when launching a shim. This
allows a clean way for each shim(container/pod) to be in its own domain and any
additional containers, (v2 pods) be be launched with the same domain as well as
any exec'd process added to the container.
kernel docs: https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/core-scheduling.html
Signed-off-by: Michael Crosby <michael@thepasture.io>
Currently, there are few issues that preventing containers
with image volumes to properly start on Windows.
- Unlike the Linux implementation, the Container volume mount paths
were not created if they didn't exist. Those paths are now created.
- while copying the image volume contents to the container volume,
the layers were not properly deactivated, which means that the
container can't start since those layers are still open. The layers
are now properly deactivated, allowing the container to start.
- even if the above issue didn't exist, the Windows implementation of
mount/Mount.Mount deactivates the layers, which wouldn't allow us
to copy files from them. The layers are now deactivated after we've
copied the necessary files from them.
- the target argument of the Windows implementation of mount/Mount.Mount
was unused, which means that folder was always empty. We're now
symlinking the Layer Mount Path into the target folder.
- hcsshim needs its Container Mount Paths to be properly formated, to be
prefixed by C:. This was an issue for Volumes defined with Linux-like
paths (e.g.: /test_dir). filepath.Abs solves this issue.
Signed-off-by: Claudiu Belu <cbelu@cloudbasesolutions.com>
The io/ioutil package has been deprecated as of Go 1.16, see
https://golang.org/doc/go1.16#ioutil. This commit replaces the existing
io/ioutil functions with their new definitions in io and os packages.
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
This fixes the TODO of this function and also expands on how the primary pod ip
is selected. This change allows the operator to prefer ipv4, ipv6, or retain the
ordering provided by the return results of the CNI plugins.
This makes it much more flexible for ops to configure containerd and how IPs are
set on the pod.
Signed-off-by: Michael Crosby <michael@thepasture.io>
After containerd restarts, it will try to recover its sandboxes,
containers, and images. If it detects a task in the Created or
Stopped state, it will be removed. This will cause the containerd
process it hang on Windows on the t.io.Wait() call.
Calling t.io.Close() beforehand will solve this issue.
Additionally, the same issue occurs when trying to stopp a sandbox
after containerd restarts. This will solve that case as well.
Signed-off-by: Claudiu Belu <cbelu@cloudbasesolutions.com>
The CRI-plugin subscribes the image event on k8s.io namespace. By
default, the image event is created by CRI-API. However, the image can
be downloaded by containerd API on k8s.io with the customized labels.
The CRI-plugin should use patch update for `io.cri-containerd.image`
label in this case.
Fixes: #5900
Signed-off-by: Wei Fu <fuweid89@gmail.com>
The Pod Sandbox can enter in a NotReady state if the task associated
with it no longer exists (it died, or it was killed). In this state,
the Pod network namespace could still be open, which means we can't
remove the sandbox, even if --force was used.
Signed-off-by: Claudiu Belu <cbelu@cloudbasesolutions.com>
With the introduction of Windows Server 2022, some images have been updated
to support WS2022 in their manifest list. This commit updates the test images
accordingly.
Signed-off-by: Adelina Tuvenie <atuvenie@cloudbasesolutions.com>
CRI container runtimes mount devices (set via kubernetes device plugins)
to containers by taking the host user/group IDs (uid/gid) to the
corresponding container device.
This triggers a problem when trying to run those containers with
non-zero (root uid/gid = 0) uid/gid set via runAsUser/runAsGroup:
the container process has no permission to use the device even when
its gid is permissive to non-root users because the container user
does not belong to that group.
It is possible to workaround the problem by manually adding the device
gid(s) to supplementalGroups. However, this is also problematic because
the device gid(s) may have different values depending on the workers'
distro/version in the cluster.
This patch suggests to take RunAsUser/RunAsGroup set via SecurityContext
as the device UID/GID, respectively. The feature must be enabled by
setting device_ownership_from_security_context runtime config value to
true (valid on Linux only).
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Go 1.15.7 contained a security fix for CVE-2021-3115, which allowed arbitrary
code to be executed at build time when using cgo on Windows. This issue also
affects Unix users who have “.” listed explicitly in their PATH and are running
“go get” outside of a module or with module mode disabled.
This issue is not limited to the go command itself, and can also affect binaries
that use `os.Command`, `os.LookPath`, etc.
From the related blogpost (ttps://blog.golang.org/path-security):
> Are your own programs affected?
>
> If you use exec.LookPath or exec.Command in your own programs, you only need to
> be concerned if you (or your users) run your program in a directory with untrusted
> contents. If so, then a subprocess could be started using an executable from dot
> instead of from a system directory. (Again, using an executable from dot happens
> always on Windows and only with uncommon PATH settings on Unix.)
>
> If you are concerned, then we’ve published the more restricted variant of os/exec
> as golang.org/x/sys/execabs. You can use it in your program by simply replacing
This patch replaces all uses of `os/exec` with `golang.org/x/sys/execabs`. While
some uses of `os/exec` should not be problematic (e.g. part of tests), it is
probably good to be consistent, in case code gets moved around.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
There was recent changes to cri to bring in a Windows section containing a
security context object to the pod config. Before this there was no way to specify
a user for the pod sandbox container to run as. In addition, the security context
is a field for field mirror of the Windows container version of it, so add the
ability to specify a GMSA credential spec for the pod sandbox container as well.
Signed-off-by: Daniel Canter <dcanter@microsoft.com>
Exclude the `security.selinux` xattr when copying content from layer
storage for image volumes. This allows for the already correct label
at the target location to be applied to the copied content, thus
enabling containers to write to volumes that they implicitly expect to be
able to write to.
- Fixescontainerd/containerd#5090
- See rancher/rke2#690
Signed-off-by: Jacob Blain Christen <jacob@rancher.com>
Refactor shim v2 to load and register plugins.
Update init shim interface to not require task service implementation on
returned service, but register as plugin if it is.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Remove build tags which are already implied by the name of the file.
Ensures build tags are used consistently
Signed-off-by: Derek McGowan <derek@mcg.dev>
In containerd 1.5.x, we introduced support for go modules by adding a
go.mod file in the root directory. This go.mod lists all the things
needed across the whole code base (with the exception of
integration/client which has its own go.mod). So when projects that
need to make calls to containerd API will pull in some code from
containerd/containerd, the `go mod` commands will add all the things
listed in the root go.mod to the projects go.mod file. This causes
some problems as the list of things needed to make a simple API call
is enormous. in effect, making a API call will pull everything that a
typical server needs as well as the root go.mod is all encompassing.
In general if we had smaller things folks could use, that will make it
easier by reducing the number of things that will end up in a consumers
go.mod file.
Now coming to a specific problem, the root containerd go.mod has various
k8s.io/* modules listed. Also kubernetes depends on containerd indirectly
via both moby/moby (working with docker maintainers seperately) and via
google/cadvisor. So when the kubernetes maintainers try to use latest
1.5.x containerd, they will see the kubernetes go.mod ending up depending
on the older version of kubernetes!
So if we can expose just the minimum things needed to make a client API
call then projects like cadvisor can adopt that instead of pulling in
the entire go.mod from containerd. Looking at the existing code in
cadvisor the minimum things needed would be the api/ directory from
containerd. Please see proof of concept here:
github.com/google/cadvisor/pull/2908
To enable that, in this PR, we add a go.mod file in api/ directory. we
split the Protobuild.yaml into two, one for just the things in api/
directory and the rest in the root directory. We adjust various targets
to build things correctly using `protobuild` and also ensure that we
end up with the same generated code as before as well. To ensure we
better take care of the various go.mod/go.sum files, we update the
existing `make vendor` and also add a new `make verify-vendor` that one
can run locally as well in the CI.
Ideally, we would have a `containerd/client` either as a standalone repo
or within `containerd/containerd` as a separate go module. but we will
start here to experiment with a standalone api go module first.
Also there are various follow ups we can do, for example @thaJeztah has
identified two tasks we could do after this PR lands:
github.com/containerd/containerd/pull/5716#discussion_r668821396
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
systemd uses SIGRTMIN+n signals, but containerd didn't support the signals
since Go's sys/unix doesn't support them.
This change introduces SIGRTMIN+n handling by utilizing moby/sys/signal.
Fixes#5402.
https://www.freedesktop.org/software/systemd/man/systemd.html#Signals
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
Before this change, for several of the services that `WithServices`
handles, only the grpc client is supported.
Now, for instance, one can use an `images.Store` directly instead of
only an `imagesapi.StoreSlient`.
Some of the methods have been renamed to satisfy the difference between
using a grpc `<Foo>Client` vs the main interface.
I did not see a good candidate for TaskService so have left that mostly
unchanged.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
CNI plugins that need to wait for network state to converge
may want to cancel waiting when a short lived pod is deleted.
However, there is a race between when kubelet asks the runtime
to create the sandbox for the pod, and when the plugin is able
request the pod object from the apiserver. It may be the case
that the plugin receives the new pod, rather than the pod
the sandbox request was initiated for.
Passing the pod UID to the plugin allows the plugin to check
whether the pod it gets from the apiserver is actually the
pod its sandbox request was started for.
Signed-off-by: Dan Williams <dcbw@redhat.com>
Similar to other deferred cleanup operations, teardownPodNetwork should
use a different context as the original context may have expired,
otherwise CNI wouldn't been invoked, leading to leak of network
resources, e.g. IP addresses.
Signed-off-by: Quan Tian <qtian@vmware.com>
This change splits the definition of pkg/cri/os.ResolveSymbolicLink by
platform (windows/!windows), and switches to an alternate implementation
for Windows. This aims to fix the issue described in containerd/containerd#5405.
The previous implementation which just called filepath.EvalSymlinks has
historically had issues on Windows. One of these issues we were able to
fix in Go, but EvalSymlinks's behavior is not well specified on
Windows, and there could easily be more issues in the future, so it
seems prudent to move to a separate implementation for Windows.
The new implementation uses the Windows GetFinalPathNameByHandle API,
which takes a handle to an open file or directory and some flags, and
returns the "real" name for the object. See comments in the code for
details on the implementation.
I have tested this change with a variety of mounts and everything seems
to work as expected. Functions that make incorrect assumptions on what a
Windows path can look like may have some trouble with the \\?\ path
syntax. For instance EvalSymlinks fails when given a \\?\UNC\ path. For
this reason, the resolvePath implementation modifies the returned path
to translate to the more common form (\\?\UNC\server\share ->
\\server\share).
Signed-off-by: Kevin Parsons <kevpar@microsoft.com>
This commit adds support for the PID namespace mode TARGET
when generating a container spec.
The container that is created will be sharing its PID namespace
with the target container that was specified by ID in the namespace
options.
Signed-off-by: Thomas Hartland <thomas.george.hartland@cern.ch>
It does not make sense to check if seccomp is supported by the kernel
more than once per runtime, so let's use sync.Once to speed it up.
A quick benchmark (old implementation, before this commit, after):
BenchmarkIsEnabledOld-4 37183 27971 ns/op
BenchmarkIsEnabled-4 1252161 947 ns/op
BenchmarkIsEnabledOnce-4 666274008 2.14 ns/op
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Current implementation of seccomp.IsEnabled (rooted in runc) is not
too good.
First, it parses the whole /proc/self/status, adding each key: value
pair into the map (lots of allocations and future work for garbage
collector), when using a single key from that map.
Second, the presence of "Seccomp" key in /proc/self/status merely means
that kernel option CONFIG_SECCOMP is set, but there is a need to _also_
check for CONFIG_SECCOMP_FILTER (the code for which exists but never
executed in case /proc/self/status has Seccomp key).
Replace all this with a single call to prctl; see the long comment in
the code for details.
While at it, improve the IsEnabled documentation.
NOTE historically, parsing /proc/self/status was added after a concern
was raised in https://github.com/opencontainers/runc/pull/471 that
prctl(PR_GET_SECCOMP, ...) can result in the calling process being
killed with SIGKILL. This is a valid concern, so the new code here
does not use PR_GET_SECCOMP at all.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Looks like we had our own copy of the "getDevices" code already, so use
that code (which also matches the code that's used to _generate_ the spec,
so a better match).
Moving the code to a separate file, I also noticed that the _unix and _linux
code was _exactly_ the same (baring some `//nolint:` comments), so also
removing the duplicated code.
With this patch applied, we removed the dependency on the libcontainer/devices
package (leaving only libcontainer/user).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Move `pkg/cri/opts.WithoutRunMount` function to `oci.WithoutRunMount`
so that it can be used without dependency on CRI.
Also add `oci.WithoutMounts(dests ...string)` for generality.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Currently image references end up being stored in a
random order due to the way maps are iterated through
in Go. This leads to inconsistent identifiers being
resolved when a single reference is needed to identify
an image and the ordering of the references is used for
the selection.
Sort references in a consistent and ranked manner,
from higher information formats to lower.
Note: A `name + tag` reference is considered higher
information than a `name + digest` reference since a
registry may be used to resolve the digest from a
`name + tag` reference.
Signed-off-by: Derek McGowan <derek@mcg.dev>
golang has enabled RFC 6555 Fast Fallback (aka HappyEyeballs)
by default in 1.12.
It means that if a host resolves to both IPv6 and IPv4,
it will try to connect to any of those addresses and use the
working connection.
However, the implementation uses go routines to start both connections in parallel,
and this has limitations when running inside a namespace, so we try to the connections
serially, trying IPv4 first for keeping the same behaviour.
xref https://github.com/golang/go/issues/44922
Signed-off-by: Antonio Ojea <aojea@redhat.com>
- process.Init#io could be nil
- Make sure CreateTaskRequest#Options is not empty before unmarshaling
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
This enables cases where devices exist in a subdirectory of /dev,
particularly where those device names are not portable across machines,
which makes it problematic to specify from a runtime such as cri.
Added this to `ctr` as well so I could test that the code at least
works.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
This will be used instead of the cri registry config in the main config
toml.
---
Also pulls in changes from containerd/cri@d0b4eecbb3
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
pkg/cap has the full list of the caps (for UT, originally),
so we can drop dependency on github.com/syndtr/gocapability
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
This change is needed for running the latest containerd inside Docker
that is not aware of the recently added caps (BPF, PERFMON, CHECKPOINT_RESTORE).
Without this change, containerd inside Docker fails to run containers with
"apply caps: operation not permitted" error.
See kubernetes-sigs/kind 2058
NOTE: The caller process of this function is now assumed to be as
privileged as possible.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Newer golangci-lint needs explicit `//` separator. Otherwise it treats
the entire line (`staticcheck deprecated ... yet`) as a name.
https://golangci-lint.run/usage/false-positives/#nolint
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
Looks like this import was not needed for the test; simplified the test
by just using the device-path (a counter would work, but for debugging,
having the list of paths can be useful).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
For some tools having the actual image name in the annotations is helpful for
debugging and auditing the workload.
Signed-off-by: Michael Crosby <michael@thepasture.io>
bump version 1.3.2 for gogo/protobuf due to CVE-2021-3121 discovered
in gogo/protobuf version 1.3.1, CVE has been fixed in 1.3.2
Signed-off-by: Aditi Sharma <adi.sky17@gmail.com>
This was dumping untrusted output to the debug logs from user containers.
We should not dump this type of information to reduce log sizes and any
information leaks from user containers.
Signed-off-by: Michael Crosby <michael@thepasture.io>
The event monitor handles exit events one by one. If there is something
wrong about deleting task, it will slow down the terminating Pods. In
order to reduce the impact, the exit event watcher should handle exit
event separately. If it failed, the watcher should put it into backoff
queue and retry it.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
Use a PrefixFilter() to get only the mounts we're interested in,
which removes the need to manually filter mounts from the mountinfo
results.
Additional optimizations can be made, as:
> ... there's a little known fact that `umount(MNT_DETACH)` is actually
> recursive in Linux, IOW this function can be replaced with
> `unix.Umount(target, unix.MNT_DETACH)` (or `mount.UnmountAll(target, unix.MNT_DETACH)`
> (provided that target itself is a mount point).
e8fb2c392f (r535450446)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
containerd is responsible for creating the log but there is no code to ensure
that the log dir exists. While kubelet should have created this there can be
times where this is not the case and this can cause stuck tasks.
Signed-off-by: Michael Crosby <michael@thepasture.io>
The build tag was removed in go-selinux v1.8.0: opencontainers/selinux#132
Related: remove "apparmor" build tag: 0a9147f3aa
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
The "apparmor" build tag does not have any cgo dependency and can be removed safely.
Related: https://github.com/opencontainers/runc/issues/2704
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
When a base runtime spec is being used, admins can configure defaults for the
spec so that default ulimits or other security related settings get applied for
all containers launched.
Signed-off-by: Michael Crosby <michael@thepasture.io>
recent versions of libcontainer/apparmor simplified the AppArmor
check to only check if the host supports AppArmor, but no longer
checks if apparmor_parser is installed, or if we're running
docker-in-docker;
bfb4ea1b1b
> The `apparmor_parser` binary is not really required for a system to run
> AppArmor from a runc perspective. How to apply the profile is more in
> the responsibility of higher level runtimes like Podman and Docker,
> which may do the binary check on their own.
This patch copies the logic from libcontainer/apparmor, and
restores the additional checks.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This is a followup to #4699 that addresses an oversight that could cause
the CRI to relabel the host /dev/shm, which should be a no-op in most
cases. Additionally, fixes unit tests to make correct assertions for
/dev/shm relabeling.
Discovered while applying the changes for #4699 to containerd/cri 1.4:
https://github.com/containerd/cri/pull/1605
Signed-off-by: Jacob Blain Christen <jacob@rancher.com>
Address an issue originally seen in the k3s 1.3 and 1.4 forks of containerd/cri, https://github.com/rancher/k3s/issues/2240
Even with updated container-selinux policy, container-local /dev/shm
will get mounted with container_runtime_tmpfs_t because it is a tmpfs
created by the runtime and not the container (thus, container_runtime_t
transition rules apply). The relabel mitigates such, allowing envoy
proxy to work correctly (and other programs that wish to write to their
/dev/shm) under selinux.
Tested locally with:
- SELINUX=Enforcing vagrant up --provision-with=shell,selinux,test-integration
- SELINUX=Enforcing CRITEST_ARGS=--ginkgo.skip='HostIpc is true' vagrant up --provision-with=shell,selinux,test-cri
- SELINUX=Permissive CRITEST_ARGS=--ginkgo.focus='HostIpc is true' vagrant up --provision-with=shell,selinux,test-cri
Signed-off-by: Jacob Blain Christen <jacob@rancher.com>
Made a change yesterday that passed through snapshotter labels into the wrapper of
WithNewSnapshot, but it passed the entirety of the annotations into the snapshotter.
This change just filters the set that we care about down to snapshotter specific
labels.
Will probably be future changes to add some more labels for LCOW/WCOW and the corresponding
behavior for these new labels.
Signed-off-by: Daniel Canter <dcanter@microsoft.com>
Previously there wwasn't a way to pass any labels to snapshotters as the wrapper
around WithNewSnapshot didn't have a parm to pass them in.
Signed-off-by: Daniel Canter <dcanter@microsoft.com>
The tricks performed by ensureRemoveAll only make sense for Linux and
other Unices, so separate it out, and make ensureRemoveAll for Windows
just an alias of os.RemoveAll.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
In containerd, there is a size limit for label size (4096 chars).
Currently if an image has many layers (> (4096-39)/72 > 56),
`containerd.io/snapshot/cri.image-layers` will hit the limit of label size and
the unpack will fail.
This commit fixes this by limiting the size of the annotation.
Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
The default values of masked and readonly paths are defined
in populateDefaultUnixSpec, and are used when a sandbox is
created. It is not, however, used for new containers. If
a container definition does not contain a security context
specifying masked/readonly paths, a container created from
it does not have masked and readonly paths.
This patch applies the default values to masked and
readonly paths of a new container, when any specific values
are not specified.
Fixes#1569
Signed-off-by: Yohei Ueda <yohei@jp.ibm.com>
Currently the shims only support starting the logging binary process if the
io.Creator Config does not specify Terminal: true. This means that the program
using containerd will only be able to specify FIFO io when Terminal: true,
rather than allowing the shim to fork the logging binary process. Hence,
containerd consumers face an inconsistent behavior regarding logging binary
management depending on the Terminal option.
Allowing the shim to fork the logging binary process will introduce consistency
between the running container and the logging process. Otherwise, the logging
process may die if its parent process dies whereas the container will keep
running, resulting in the loss of container logs.
Signed-off-by: Akshat Kumar <kshtku@amazon.com>
This allows development with container to be done for NRI without the need for
custom builds.
This is an experimental feature and is not enabled unless a user has a global
`/etc/nri/conf.json` config setup with plugins on the system. No NRI code will
be executed if this config file does not exist.
Signed-off-by: Michael Crosby <michael@thepasture.io>
This commit adds a config flag for allowing GC to clean layer contents up after
unpacking these contents completed, which leads to deduplication of layer
contents between the snapshotter and the contnet store.
Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
This allows an admin to set the upper bounds on the category range for selinux
labels. This can be useful when handling allocation of PVs or other volume
types that need to be shared with selinux enabled on the hosts and volumes.
Signed-off-by: Michael Crosby <michael@thepasture.io>
Adds support to mount named pipes into Windows containers. This support
already exists in hcsshim, so this change just passes them through
correctly in cri. Named pipe mounts must start with "\\.\pipe\".
Signed-off-by: Kevin Parsons <kevpar@microsoft.com>
Reason: originally it was introduced to prevent the loading of the SCTP kernel module on the nodes. But iptables chain creation alone does not load the kernel module. The module would be loaded if an SCTP socket was created, but neither cri nor the portmap CNI plugin starts managing SCTP sockets if hostPort / portmappings are defined.
Signed-off-by: Laszlo Janosi <laszlo.janosi@ibm.com>
This adds a configuration knob for adding request headers to all
registry requests. It is not namespaced to a registry.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
How to test (from https://github.com/opencontainers/runc/pull/2352#issuecomment-620834524):
(host)$ sudo swapoff -a
(host)$ sudo ctr run -t --rm --memory-limit $((1024*1024*32)) docker.io/library/alpine:latest foo
(container)$ sh -c 'VAR=$(seq 1 100000000)'
An event `/tasks/oom {"container_id":"foo"}` will be displayed in `ctr events`.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
This moves most of the API calls off of the `labels` package onto the root
selinux package. This is the newer API for most selinux operations.
Signed-off-by: Michael Crosby <michael@thepasture.io>
We encountered two failing end-to-end tests after the adoption of
https://github.com/containerd/cri/pull/1470 in
https://github.com/cri-o/cri-o/pull/3749:
```
Summarizing 2 Failures:
[Fail] [sig-cli] Kubectl Port forwarding With a server listening on 0.0.0.0 that expects a client request [It] should support a client that connects,
sends DATA, and disconnects
test/e2e/kubectl/portforward.go:343
[Fail] [sig-cli] Kubectl Port forwarding With a server listening on localhost that expects a client request [It] should support a client that connects
, sends DATA, and disconnects
test/e2e/kubectl/portforward.go:343
```
Increasing the timeout to 1s fixes the issue.
Signed-off-by: Sascha Grunert <sgrunert@suse.com>
This swaps the RunningInUserNS() function that we're using
from libcontainer/system with the one in containerd/sys.
This removes the dependency on libcontainer/system, given
these were the only functions we're using from that package.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The `DualStack` option was deprecated in Go 1.12, and is now enabled by default
(through commit github.com/golang/go@efc185029bf770894defe63cec2c72a4c84b2ee9).
> The Dialer.DualStack field is now meaningless and documented as deprecated.
>
> To disable fallback, set FallbackDelay to a negative value.
The default `FallbackDelay` is 300ms; to make this more explicit, this patch
sets `FallbackDelay` to the default value.
Note that Docker Hub currently does not support IPv6 (DNS for registry-1.docker.io
has no AAAA records, so we should not hit the 300ms delay).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
use goroutines to copy the data from the stream to the TCP
connection, and viceversa, removing the socat dependency.
Quoting Lantao Liu, the logic is as follow:
When one side (either pod side or user side) of portforward
is closed, we should stop port forwarding.
When one side is closed, the io.Copy use that side as source will close,
but the io.Copy use that side as dest won't.
Signed-off-by: Antonio Ojea <antonio.ojea.garcia@gmail.com>
This changes adds `default_seccomp_profile` config switch to apply default seccomp profile when not provided by k8s.a
Signed-off-by: Maksym Pavlenko <pavlenko.maksym@gmail.com>
Should destroy the pod network if fails to setup or return invalid
net interface, especially multiple CNI configurations.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
Currently, CRI plugin passes each layer digest to remote snapshotters
sequentially, which leads to sequential snapshots preparation. But it costs
extra time especially for remote snapshotters which need to connect to the
remote backend store (e.g. registries) for checking the snapshot existence on
each preparation.
This commit solves this problem by introducing new label
`containerd.io/snapshot/cri.chain` for passing all layer digests in an image to
snapshotters and by allowing them to prepare these snapshots in parallel, which
leads to speed up the preparation.
Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
For a privilege pods with PrivilegedWithoutHostDevices set to true
host device specified in the config are not provided (whereas it is done for
non privilege pods or privilege pods with PrivilegedWithoutHostDevices set
to false as all devices are included).
Add them in this case.
Fixes: 3353ab76d9 ("Add flag to overload default privileged host device behaviour")
Signed-off-by: Thibaut Collet <thibaut.collet@6wind.com>
containerd loads timeout values from config.toml and populated those
values to `timeout` package at launch. So when using `timeout` package
from shim, there are default values and config file is ignored.
So use a hardcoded value for binary IO.
Signed-off-by: Maksym Pavlenko <makpav@amazon.com>
Throughout container lifecycle, pulling image is one of the time-consuming
steps. Recently, containerd community started to tackle this issue with
stargz-based remote snapshots, as a non-core
subproject(https://github.com/containerd/stargz-snapshotter).
This snapshotter is implemented as a standard proxy plugin but it requires the
client to pass some additional information (image ref and layer digest) for each
pull operation to query layer contents on the registry. Stargz snapshotter
project provides an image handler to do this and stargz snapshot users need to
pass this handler to containerd client.
This commit enables to use stargz-based remote snapshots through CRI by passing
the handler to containerd client on pull operation.
Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
Along with type(Sandbox or Container) and Sandbox name annotations
provide support for additional annotation:
- Container name
This will help us perform per container operation by comparing it
with pass through annotations (eg. pod metadata annotations from K8s)
Signed-off-by: Chethan Suresh <Chethan.Suresh@sony.com>
With go RWMutex design, no goroutine should expect to be able to
acquire a read lock until the read lock has been released, if one
goroutine call lock.
The original design is to reload cni network config on every single
Status CRI gRPC call. If one RunPodSandbox request holds read lock
to allocate IP for too long, all other RunPodSandbox/StopPodSandbox
requests will wait for the RunPodSandbox request to release read lock.
And the Status CRI call will fail and kubelet becomes NOTReady.
Reload cni network config at every single Status CRI call is not
necessary and also brings NOTReady situation. To lower the possibility
of NOTReady, CRI will reload cni network config if there is any valid fs
change events from the cni network config dir.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
The resize chan is never closed when doing exec/attach now. What's more,
`resize` is a recieved only chan so it can not be closed. Use ctx to
exit the goroutine in `handleResizing` properly.
Signed-off-by: Li Yuxuan <liyuxuan04@baidu.com>
Instead of having several dialer implementations, leave only one in
`pkg/dialer` and call it from `pkg/ttrpcutil`, `runtime/v(1|2)/shim`
which had their own
Closes#3471.
Signed-off-by: Kiril Vladimiroff <kiril@vladimiroff.org>
The pkg/store errors are duplicated errors of NotFound and AlreadyExist from
containerd's errdefs package and thus do not properly serialize when running
errdefs.ToGRPC on them. CRI runs this function on every return from a CRI
method so the conversion fails if there is a cache miss from the store caches
for containers or sandboxes. This change verifies that the errors are properly
converted to their gRPC values.
Signed-off-by: Justin Terry (VM) <juterry@microsoft.com>
In cgroup v1 container implementations, cgroupns is not used by default because
it was not available in the kernel until kernel 4.6 (May 2016), and the default
behavior will not change on cgroup v1 environments, because changing the
default will break compatibility and surprise users.
For cgroup v2, implementations are going to unshare cgroupns by default
so as to hide /sys/fs/cgroup from containers.
* Discussion: https://github.com/containers/libpod/issues/4363
* Podman PR (merged): https://github.com/containers/libpod/pull/4374
* Moby PR: https://github.com/moby/moby/pull/40174
This PR enables cgroupns for containers, but pod sandboxes are untouched
because probably there is no need to do.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
The former default runtime "io.containerd.runc.v1" won't support new features
like support for cgroup v2: containerd/containerd#3726
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Reized the I/O buffers to align with the size of the kernel buffers with fifos
and move the close aspect of the console to key off of the stdin closing.
Fixes#3738
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>