The Pod Sandbox can enter in a NotReady state if the task associated
with it no longer exists (it died, or it was killed). In this state,
the Pod network namespace could still be open, which means we can't
remove the sandbox, even if --force was used.
Signed-off-by: Claudiu Belu <cbelu@cloudbasesolutions.com>
With the introduction of Windows Server 2022, some images have been updated
to support WS2022 in their manifest list. This commit updates the test images
accordingly.
Signed-off-by: Adelina Tuvenie <atuvenie@cloudbasesolutions.com>
CRI container runtimes mount devices (set via kubernetes device plugins)
to containers by taking the host user/group IDs (uid/gid) to the
corresponding container device.
This triggers a problem when trying to run those containers with
non-zero (root uid/gid = 0) uid/gid set via runAsUser/runAsGroup:
the container process has no permission to use the device even when
its gid is permissive to non-root users because the container user
does not belong to that group.
It is possible to workaround the problem by manually adding the device
gid(s) to supplementalGroups. However, this is also problematic because
the device gid(s) may have different values depending on the workers'
distro/version in the cluster.
This patch suggests to take RunAsUser/RunAsGroup set via SecurityContext
as the device UID/GID, respectively. The feature must be enabled by
setting device_ownership_from_security_context runtime config value to
true (valid on Linux only).
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
There was recent changes to cri to bring in a Windows section containing a
security context object to the pod config. Before this there was no way to specify
a user for the pod sandbox container to run as. In addition, the security context
is a field for field mirror of the Windows container version of it, so add the
ability to specify a GMSA credential spec for the pod sandbox container as well.
Signed-off-by: Daniel Canter <dcanter@microsoft.com>
Exclude the `security.selinux` xattr when copying content from layer
storage for image volumes. This allows for the already correct label
at the target location to be applied to the copied content, thus
enabling containers to write to volumes that they implicitly expect to be
able to write to.
- Fixescontainerd/containerd#5090
- See rancher/rke2#690
Signed-off-by: Jacob Blain Christen <jacob@rancher.com>
Refactor shim v2 to load and register plugins.
Update init shim interface to not require task service implementation on
returned service, but register as plugin if it is.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Remove build tags which are already implied by the name of the file.
Ensures build tags are used consistently
Signed-off-by: Derek McGowan <derek@mcg.dev>
systemd uses SIGRTMIN+n signals, but containerd didn't support the signals
since Go's sys/unix doesn't support them.
This change introduces SIGRTMIN+n handling by utilizing moby/sys/signal.
Fixes#5402.
https://www.freedesktop.org/software/systemd/man/systemd.html#Signals
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
Before this change, for several of the services that `WithServices`
handles, only the grpc client is supported.
Now, for instance, one can use an `images.Store` directly instead of
only an `imagesapi.StoreSlient`.
Some of the methods have been renamed to satisfy the difference between
using a grpc `<Foo>Client` vs the main interface.
I did not see a good candidate for TaskService so have left that mostly
unchanged.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
CNI plugins that need to wait for network state to converge
may want to cancel waiting when a short lived pod is deleted.
However, there is a race between when kubelet asks the runtime
to create the sandbox for the pod, and when the plugin is able
request the pod object from the apiserver. It may be the case
that the plugin receives the new pod, rather than the pod
the sandbox request was initiated for.
Passing the pod UID to the plugin allows the plugin to check
whether the pod it gets from the apiserver is actually the
pod its sandbox request was started for.
Signed-off-by: Dan Williams <dcbw@redhat.com>
Similar to other deferred cleanup operations, teardownPodNetwork should
use a different context as the original context may have expired,
otherwise CNI wouldn't been invoked, leading to leak of network
resources, e.g. IP addresses.
Signed-off-by: Quan Tian <qtian@vmware.com>
This commit adds support for the PID namespace mode TARGET
when generating a container spec.
The container that is created will be sharing its PID namespace
with the target container that was specified by ID in the namespace
options.
Signed-off-by: Thomas Hartland <thomas.george.hartland@cern.ch>
Looks like we had our own copy of the "getDevices" code already, so use
that code (which also matches the code that's used to _generate_ the spec,
so a better match).
Moving the code to a separate file, I also noticed that the _unix and _linux
code was _exactly_ the same (baring some `//nolint:` comments), so also
removing the duplicated code.
With this patch applied, we removed the dependency on the libcontainer/devices
package (leaving only libcontainer/user).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Move `pkg/cri/opts.WithoutRunMount` function to `oci.WithoutRunMount`
so that it can be used without dependency on CRI.
Also add `oci.WithoutMounts(dests ...string)` for generality.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Currently image references end up being stored in a
random order due to the way maps are iterated through
in Go. This leads to inconsistent identifiers being
resolved when a single reference is needed to identify
an image and the ordering of the references is used for
the selection.
Sort references in a consistent and ranked manner,
from higher information formats to lower.
Note: A `name + tag` reference is considered higher
information than a `name + digest` reference since a
registry may be used to resolve the digest from a
`name + tag` reference.
Signed-off-by: Derek McGowan <derek@mcg.dev>
golang has enabled RFC 6555 Fast Fallback (aka HappyEyeballs)
by default in 1.12.
It means that if a host resolves to both IPv6 and IPv4,
it will try to connect to any of those addresses and use the
working connection.
However, the implementation uses go routines to start both connections in parallel,
and this has limitations when running inside a namespace, so we try to the connections
serially, trying IPv4 first for keeping the same behaviour.
xref https://github.com/golang/go/issues/44922
Signed-off-by: Antonio Ojea <aojea@redhat.com>
This enables cases where devices exist in a subdirectory of /dev,
particularly where those device names are not portable across machines,
which makes it problematic to specify from a runtime such as cri.
Added this to `ctr` as well so I could test that the code at least
works.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
This will be used instead of the cri registry config in the main config
toml.
---
Also pulls in changes from containerd/cri@d0b4eecbb3
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
This change is needed for running the latest containerd inside Docker
that is not aware of the recently added caps (BPF, PERFMON, CHECKPOINT_RESTORE).
Without this change, containerd inside Docker fails to run containers with
"apply caps: operation not permitted" error.
See kubernetes-sigs/kind 2058
NOTE: The caller process of this function is now assumed to be as
privileged as possible.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Newer golangci-lint needs explicit `//` separator. Otherwise it treats
the entire line (`staticcheck deprecated ... yet`) as a name.
https://golangci-lint.run/usage/false-positives/#nolint
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
Looks like this import was not needed for the test; simplified the test
by just using the device-path (a counter would work, but for debugging,
having the list of paths can be useful).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
For some tools having the actual image name in the annotations is helpful for
debugging and auditing the workload.
Signed-off-by: Michael Crosby <michael@thepasture.io>
This was dumping untrusted output to the debug logs from user containers.
We should not dump this type of information to reduce log sizes and any
information leaks from user containers.
Signed-off-by: Michael Crosby <michael@thepasture.io>
The event monitor handles exit events one by one. If there is something
wrong about deleting task, it will slow down the terminating Pods. In
order to reduce the impact, the exit event watcher should handle exit
event separately. If it failed, the watcher should put it into backoff
queue and retry it.
Signed-off-by: Wei Fu <fuweid89@gmail.com>