This fixes the TODO of this function and also expands on how the primary pod ip
is selected. This change allows the operator to prefer ipv4, ipv6, or retain the
ordering provided by the return results of the CNI plugins.
This makes it much more flexible for ops to configure containerd and how IPs are
set on the pod.
Signed-off-by: Michael Crosby <michael@thepasture.io>
After containerd restarts, it will try to recover its sandboxes,
containers, and images. If it detects a task in the Created or
Stopped state, it will be removed. This will cause the containerd
process it hang on Windows on the t.io.Wait() call.
Calling t.io.Close() beforehand will solve this issue.
Additionally, the same issue occurs when trying to stopp a sandbox
after containerd restarts. This will solve that case as well.
Signed-off-by: Claudiu Belu <cbelu@cloudbasesolutions.com>
The CRI-plugin subscribes the image event on k8s.io namespace. By
default, the image event is created by CRI-API. However, the image can
be downloaded by containerd API on k8s.io with the customized labels.
The CRI-plugin should use patch update for `io.cri-containerd.image`
label in this case.
Fixes: #5900
Signed-off-by: Wei Fu <fuweid89@gmail.com>
The Pod Sandbox can enter in a NotReady state if the task associated
with it no longer exists (it died, or it was killed). In this state,
the Pod network namespace could still be open, which means we can't
remove the sandbox, even if --force was used.
Signed-off-by: Claudiu Belu <cbelu@cloudbasesolutions.com>
With the introduction of Windows Server 2022, some images have been updated
to support WS2022 in their manifest list. This commit updates the test images
accordingly.
Signed-off-by: Adelina Tuvenie <atuvenie@cloudbasesolutions.com>
CRI container runtimes mount devices (set via kubernetes device plugins)
to containers by taking the host user/group IDs (uid/gid) to the
corresponding container device.
This triggers a problem when trying to run those containers with
non-zero (root uid/gid = 0) uid/gid set via runAsUser/runAsGroup:
the container process has no permission to use the device even when
its gid is permissive to non-root users because the container user
does not belong to that group.
It is possible to workaround the problem by manually adding the device
gid(s) to supplementalGroups. However, this is also problematic because
the device gid(s) may have different values depending on the workers'
distro/version in the cluster.
This patch suggests to take RunAsUser/RunAsGroup set via SecurityContext
as the device UID/GID, respectively. The feature must be enabled by
setting device_ownership_from_security_context runtime config value to
true (valid on Linux only).
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Go 1.15.7 contained a security fix for CVE-2021-3115, which allowed arbitrary
code to be executed at build time when using cgo on Windows. This issue also
affects Unix users who have “.” listed explicitly in their PATH and are running
“go get” outside of a module or with module mode disabled.
This issue is not limited to the go command itself, and can also affect binaries
that use `os.Command`, `os.LookPath`, etc.
From the related blogpost (ttps://blog.golang.org/path-security):
> Are your own programs affected?
>
> If you use exec.LookPath or exec.Command in your own programs, you only need to
> be concerned if you (or your users) run your program in a directory with untrusted
> contents. If so, then a subprocess could be started using an executable from dot
> instead of from a system directory. (Again, using an executable from dot happens
> always on Windows and only with uncommon PATH settings on Unix.)
>
> If you are concerned, then we’ve published the more restricted variant of os/exec
> as golang.org/x/sys/execabs. You can use it in your program by simply replacing
This patch replaces all uses of `os/exec` with `golang.org/x/sys/execabs`. While
some uses of `os/exec` should not be problematic (e.g. part of tests), it is
probably good to be consistent, in case code gets moved around.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
There was recent changes to cri to bring in a Windows section containing a
security context object to the pod config. Before this there was no way to specify
a user for the pod sandbox container to run as. In addition, the security context
is a field for field mirror of the Windows container version of it, so add the
ability to specify a GMSA credential spec for the pod sandbox container as well.
Signed-off-by: Daniel Canter <dcanter@microsoft.com>
Exclude the `security.selinux` xattr when copying content from layer
storage for image volumes. This allows for the already correct label
at the target location to be applied to the copied content, thus
enabling containers to write to volumes that they implicitly expect to be
able to write to.
- Fixescontainerd/containerd#5090
- See rancher/rke2#690
Signed-off-by: Jacob Blain Christen <jacob@rancher.com>
Refactor shim v2 to load and register plugins.
Update init shim interface to not require task service implementation on
returned service, but register as plugin if it is.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Remove build tags which are already implied by the name of the file.
Ensures build tags are used consistently
Signed-off-by: Derek McGowan <derek@mcg.dev>
In containerd 1.5.x, we introduced support for go modules by adding a
go.mod file in the root directory. This go.mod lists all the things
needed across the whole code base (with the exception of
integration/client which has its own go.mod). So when projects that
need to make calls to containerd API will pull in some code from
containerd/containerd, the `go mod` commands will add all the things
listed in the root go.mod to the projects go.mod file. This causes
some problems as the list of things needed to make a simple API call
is enormous. in effect, making a API call will pull everything that a
typical server needs as well as the root go.mod is all encompassing.
In general if we had smaller things folks could use, that will make it
easier by reducing the number of things that will end up in a consumers
go.mod file.
Now coming to a specific problem, the root containerd go.mod has various
k8s.io/* modules listed. Also kubernetes depends on containerd indirectly
via both moby/moby (working with docker maintainers seperately) and via
google/cadvisor. So when the kubernetes maintainers try to use latest
1.5.x containerd, they will see the kubernetes go.mod ending up depending
on the older version of kubernetes!
So if we can expose just the minimum things needed to make a client API
call then projects like cadvisor can adopt that instead of pulling in
the entire go.mod from containerd. Looking at the existing code in
cadvisor the minimum things needed would be the api/ directory from
containerd. Please see proof of concept here:
github.com/google/cadvisor/pull/2908
To enable that, in this PR, we add a go.mod file in api/ directory. we
split the Protobuild.yaml into two, one for just the things in api/
directory and the rest in the root directory. We adjust various targets
to build things correctly using `protobuild` and also ensure that we
end up with the same generated code as before as well. To ensure we
better take care of the various go.mod/go.sum files, we update the
existing `make vendor` and also add a new `make verify-vendor` that one
can run locally as well in the CI.
Ideally, we would have a `containerd/client` either as a standalone repo
or within `containerd/containerd` as a separate go module. but we will
start here to experiment with a standalone api go module first.
Also there are various follow ups we can do, for example @thaJeztah has
identified two tasks we could do after this PR lands:
github.com/containerd/containerd/pull/5716#discussion_r668821396
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
systemd uses SIGRTMIN+n signals, but containerd didn't support the signals
since Go's sys/unix doesn't support them.
This change introduces SIGRTMIN+n handling by utilizing moby/sys/signal.
Fixes#5402.
https://www.freedesktop.org/software/systemd/man/systemd.html#Signals
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
Before this change, for several of the services that `WithServices`
handles, only the grpc client is supported.
Now, for instance, one can use an `images.Store` directly instead of
only an `imagesapi.StoreSlient`.
Some of the methods have been renamed to satisfy the difference between
using a grpc `<Foo>Client` vs the main interface.
I did not see a good candidate for TaskService so have left that mostly
unchanged.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
CNI plugins that need to wait for network state to converge
may want to cancel waiting when a short lived pod is deleted.
However, there is a race between when kubelet asks the runtime
to create the sandbox for the pod, and when the plugin is able
request the pod object from the apiserver. It may be the case
that the plugin receives the new pod, rather than the pod
the sandbox request was initiated for.
Passing the pod UID to the plugin allows the plugin to check
whether the pod it gets from the apiserver is actually the
pod its sandbox request was started for.
Signed-off-by: Dan Williams <dcbw@redhat.com>
Similar to other deferred cleanup operations, teardownPodNetwork should
use a different context as the original context may have expired,
otherwise CNI wouldn't been invoked, leading to leak of network
resources, e.g. IP addresses.
Signed-off-by: Quan Tian <qtian@vmware.com>
This change splits the definition of pkg/cri/os.ResolveSymbolicLink by
platform (windows/!windows), and switches to an alternate implementation
for Windows. This aims to fix the issue described in containerd/containerd#5405.
The previous implementation which just called filepath.EvalSymlinks has
historically had issues on Windows. One of these issues we were able to
fix in Go, but EvalSymlinks's behavior is not well specified on
Windows, and there could easily be more issues in the future, so it
seems prudent to move to a separate implementation for Windows.
The new implementation uses the Windows GetFinalPathNameByHandle API,
which takes a handle to an open file or directory and some flags, and
returns the "real" name for the object. See comments in the code for
details on the implementation.
I have tested this change with a variety of mounts and everything seems
to work as expected. Functions that make incorrect assumptions on what a
Windows path can look like may have some trouble with the \\?\ path
syntax. For instance EvalSymlinks fails when given a \\?\UNC\ path. For
this reason, the resolvePath implementation modifies the returned path
to translate to the more common form (\\?\UNC\server\share ->
\\server\share).
Signed-off-by: Kevin Parsons <kevpar@microsoft.com>
This commit adds support for the PID namespace mode TARGET
when generating a container spec.
The container that is created will be sharing its PID namespace
with the target container that was specified by ID in the namespace
options.
Signed-off-by: Thomas Hartland <thomas.george.hartland@cern.ch>
It does not make sense to check if seccomp is supported by the kernel
more than once per runtime, so let's use sync.Once to speed it up.
A quick benchmark (old implementation, before this commit, after):
BenchmarkIsEnabledOld-4 37183 27971 ns/op
BenchmarkIsEnabled-4 1252161 947 ns/op
BenchmarkIsEnabledOnce-4 666274008 2.14 ns/op
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Current implementation of seccomp.IsEnabled (rooted in runc) is not
too good.
First, it parses the whole /proc/self/status, adding each key: value
pair into the map (lots of allocations and future work for garbage
collector), when using a single key from that map.
Second, the presence of "Seccomp" key in /proc/self/status merely means
that kernel option CONFIG_SECCOMP is set, but there is a need to _also_
check for CONFIG_SECCOMP_FILTER (the code for which exists but never
executed in case /proc/self/status has Seccomp key).
Replace all this with a single call to prctl; see the long comment in
the code for details.
While at it, improve the IsEnabled documentation.
NOTE historically, parsing /proc/self/status was added after a concern
was raised in https://github.com/opencontainers/runc/pull/471 that
prctl(PR_GET_SECCOMP, ...) can result in the calling process being
killed with SIGKILL. This is a valid concern, so the new code here
does not use PR_GET_SECCOMP at all.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Looks like we had our own copy of the "getDevices" code already, so use
that code (which also matches the code that's used to _generate_ the spec,
so a better match).
Moving the code to a separate file, I also noticed that the _unix and _linux
code was _exactly_ the same (baring some `//nolint:` comments), so also
removing the duplicated code.
With this patch applied, we removed the dependency on the libcontainer/devices
package (leaving only libcontainer/user).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Move `pkg/cri/opts.WithoutRunMount` function to `oci.WithoutRunMount`
so that it can be used without dependency on CRI.
Also add `oci.WithoutMounts(dests ...string)` for generality.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Currently image references end up being stored in a
random order due to the way maps are iterated through
in Go. This leads to inconsistent identifiers being
resolved when a single reference is needed to identify
an image and the ordering of the references is used for
the selection.
Sort references in a consistent and ranked manner,
from higher information formats to lower.
Note: A `name + tag` reference is considered higher
information than a `name + digest` reference since a
registry may be used to resolve the digest from a
`name + tag` reference.
Signed-off-by: Derek McGowan <derek@mcg.dev>
golang has enabled RFC 6555 Fast Fallback (aka HappyEyeballs)
by default in 1.12.
It means that if a host resolves to both IPv6 and IPv4,
it will try to connect to any of those addresses and use the
working connection.
However, the implementation uses go routines to start both connections in parallel,
and this has limitations when running inside a namespace, so we try to the connections
serially, trying IPv4 first for keeping the same behaviour.
xref https://github.com/golang/go/issues/44922
Signed-off-by: Antonio Ojea <aojea@redhat.com>
- process.Init#io could be nil
- Make sure CreateTaskRequest#Options is not empty before unmarshaling
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
This enables cases where devices exist in a subdirectory of /dev,
particularly where those device names are not portable across machines,
which makes it problematic to specify from a runtime such as cri.
Added this to `ctr` as well so I could test that the code at least
works.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
This will be used instead of the cri registry config in the main config
toml.
---
Also pulls in changes from containerd/cri@d0b4eecbb3
Signed-off-by: Brian Goff <cpuguy83@gmail.com>