recent versions of libcontainer/apparmor simplified the AppArmor
check to only check if the host supports AppArmor, but no longer
checks if apparmor_parser is installed, or if we're running
docker-in-docker;
bfb4ea1b1b
> The `apparmor_parser` binary is not really required for a system to run
> AppArmor from a runc perspective. How to apply the profile is more in
> the responsibility of higher level runtimes like Podman and Docker,
> which may do the binary check on their own.
This patch copies the logic from libcontainer/apparmor, and
restores the additional checks.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This is a followup to #4699 that addresses an oversight that could cause
the CRI to relabel the host /dev/shm, which should be a no-op in most
cases. Additionally, fixes unit tests to make correct assertions for
/dev/shm relabeling.
Discovered while applying the changes for #4699 to containerd/cri 1.4:
https://github.com/containerd/cri/pull/1605
Signed-off-by: Jacob Blain Christen <jacob@rancher.com>
Address an issue originally seen in the k3s 1.3 and 1.4 forks of containerd/cri, https://github.com/rancher/k3s/issues/2240
Even with updated container-selinux policy, container-local /dev/shm
will get mounted with container_runtime_tmpfs_t because it is a tmpfs
created by the runtime and not the container (thus, container_runtime_t
transition rules apply). The relabel mitigates such, allowing envoy
proxy to work correctly (and other programs that wish to write to their
/dev/shm) under selinux.
Tested locally with:
- SELINUX=Enforcing vagrant up --provision-with=shell,selinux,test-integration
- SELINUX=Enforcing CRITEST_ARGS=--ginkgo.skip='HostIpc is true' vagrant up --provision-with=shell,selinux,test-cri
- SELINUX=Permissive CRITEST_ARGS=--ginkgo.focus='HostIpc is true' vagrant up --provision-with=shell,selinux,test-cri
Signed-off-by: Jacob Blain Christen <jacob@rancher.com>
Made a change yesterday that passed through snapshotter labels into the wrapper of
WithNewSnapshot, but it passed the entirety of the annotations into the snapshotter.
This change just filters the set that we care about down to snapshotter specific
labels.
Will probably be future changes to add some more labels for LCOW/WCOW and the corresponding
behavior for these new labels.
Signed-off-by: Daniel Canter <dcanter@microsoft.com>
Previously there wwasn't a way to pass any labels to snapshotters as the wrapper
around WithNewSnapshot didn't have a parm to pass them in.
Signed-off-by: Daniel Canter <dcanter@microsoft.com>
The tricks performed by ensureRemoveAll only make sense for Linux and
other Unices, so separate it out, and make ensureRemoveAll for Windows
just an alias of os.RemoveAll.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
In containerd, there is a size limit for label size (4096 chars).
Currently if an image has many layers (> (4096-39)/72 > 56),
`containerd.io/snapshot/cri.image-layers` will hit the limit of label size and
the unpack will fail.
This commit fixes this by limiting the size of the annotation.
Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
The default values of masked and readonly paths are defined
in populateDefaultUnixSpec, and are used when a sandbox is
created. It is not, however, used for new containers. If
a container definition does not contain a security context
specifying masked/readonly paths, a container created from
it does not have masked and readonly paths.
This patch applies the default values to masked and
readonly paths of a new container, when any specific values
are not specified.
Fixes#1569
Signed-off-by: Yohei Ueda <yohei@jp.ibm.com>
Currently the shims only support starting the logging binary process if the
io.Creator Config does not specify Terminal: true. This means that the program
using containerd will only be able to specify FIFO io when Terminal: true,
rather than allowing the shim to fork the logging binary process. Hence,
containerd consumers face an inconsistent behavior regarding logging binary
management depending on the Terminal option.
Allowing the shim to fork the logging binary process will introduce consistency
between the running container and the logging process. Otherwise, the logging
process may die if its parent process dies whereas the container will keep
running, resulting in the loss of container logs.
Signed-off-by: Akshat Kumar <kshtku@amazon.com>
This allows development with container to be done for NRI without the need for
custom builds.
This is an experimental feature and is not enabled unless a user has a global
`/etc/nri/conf.json` config setup with plugins on the system. No NRI code will
be executed if this config file does not exist.
Signed-off-by: Michael Crosby <michael@thepasture.io>
This commit adds a config flag for allowing GC to clean layer contents up after
unpacking these contents completed, which leads to deduplication of layer
contents between the snapshotter and the contnet store.
Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
This allows an admin to set the upper bounds on the category range for selinux
labels. This can be useful when handling allocation of PVs or other volume
types that need to be shared with selinux enabled on the hosts and volumes.
Signed-off-by: Michael Crosby <michael@thepasture.io>
Adds support to mount named pipes into Windows containers. This support
already exists in hcsshim, so this change just passes them through
correctly in cri. Named pipe mounts must start with "\\.\pipe\".
Signed-off-by: Kevin Parsons <kevpar@microsoft.com>
Reason: originally it was introduced to prevent the loading of the SCTP kernel module on the nodes. But iptables chain creation alone does not load the kernel module. The module would be loaded if an SCTP socket was created, but neither cri nor the portmap CNI plugin starts managing SCTP sockets if hostPort / portmappings are defined.
Signed-off-by: Laszlo Janosi <laszlo.janosi@ibm.com>
This adds a configuration knob for adding request headers to all
registry requests. It is not namespaced to a registry.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
How to test (from https://github.com/opencontainers/runc/pull/2352#issuecomment-620834524):
(host)$ sudo swapoff -a
(host)$ sudo ctr run -t --rm --memory-limit $((1024*1024*32)) docker.io/library/alpine:latest foo
(container)$ sh -c 'VAR=$(seq 1 100000000)'
An event `/tasks/oom {"container_id":"foo"}` will be displayed in `ctr events`.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
This moves most of the API calls off of the `labels` package onto the root
selinux package. This is the newer API for most selinux operations.
Signed-off-by: Michael Crosby <michael@thepasture.io>
We encountered two failing end-to-end tests after the adoption of
https://github.com/containerd/cri/pull/1470 in
https://github.com/cri-o/cri-o/pull/3749:
```
Summarizing 2 Failures:
[Fail] [sig-cli] Kubectl Port forwarding With a server listening on 0.0.0.0 that expects a client request [It] should support a client that connects,
sends DATA, and disconnects
test/e2e/kubectl/portforward.go:343
[Fail] [sig-cli] Kubectl Port forwarding With a server listening on localhost that expects a client request [It] should support a client that connects
, sends DATA, and disconnects
test/e2e/kubectl/portforward.go:343
```
Increasing the timeout to 1s fixes the issue.
Signed-off-by: Sascha Grunert <sgrunert@suse.com>
This swaps the RunningInUserNS() function that we're using
from libcontainer/system with the one in containerd/sys.
This removes the dependency on libcontainer/system, given
these were the only functions we're using from that package.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The `DualStack` option was deprecated in Go 1.12, and is now enabled by default
(through commit github.com/golang/go@efc185029bf770894defe63cec2c72a4c84b2ee9).
> The Dialer.DualStack field is now meaningless and documented as deprecated.
>
> To disable fallback, set FallbackDelay to a negative value.
The default `FallbackDelay` is 300ms; to make this more explicit, this patch
sets `FallbackDelay` to the default value.
Note that Docker Hub currently does not support IPv6 (DNS for registry-1.docker.io
has no AAAA records, so we should not hit the 300ms delay).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
use goroutines to copy the data from the stream to the TCP
connection, and viceversa, removing the socat dependency.
Quoting Lantao Liu, the logic is as follow:
When one side (either pod side or user side) of portforward
is closed, we should stop port forwarding.
When one side is closed, the io.Copy use that side as source will close,
but the io.Copy use that side as dest won't.
Signed-off-by: Antonio Ojea <antonio.ojea.garcia@gmail.com>
This changes adds `default_seccomp_profile` config switch to apply default seccomp profile when not provided by k8s.a
Signed-off-by: Maksym Pavlenko <pavlenko.maksym@gmail.com>
Should destroy the pod network if fails to setup or return invalid
net interface, especially multiple CNI configurations.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
Currently, CRI plugin passes each layer digest to remote snapshotters
sequentially, which leads to sequential snapshots preparation. But it costs
extra time especially for remote snapshotters which need to connect to the
remote backend store (e.g. registries) for checking the snapshot existence on
each preparation.
This commit solves this problem by introducing new label
`containerd.io/snapshot/cri.chain` for passing all layer digests in an image to
snapshotters and by allowing them to prepare these snapshots in parallel, which
leads to speed up the preparation.
Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
For a privilege pods with PrivilegedWithoutHostDevices set to true
host device specified in the config are not provided (whereas it is done for
non privilege pods or privilege pods with PrivilegedWithoutHostDevices set
to false as all devices are included).
Add them in this case.
Fixes: 3353ab76d9 ("Add flag to overload default privileged host device behaviour")
Signed-off-by: Thibaut Collet <thibaut.collet@6wind.com>
containerd loads timeout values from config.toml and populated those
values to `timeout` package at launch. So when using `timeout` package
from shim, there are default values and config file is ignored.
So use a hardcoded value for binary IO.
Signed-off-by: Maksym Pavlenko <makpav@amazon.com>
Throughout container lifecycle, pulling image is one of the time-consuming
steps. Recently, containerd community started to tackle this issue with
stargz-based remote snapshots, as a non-core
subproject(https://github.com/containerd/stargz-snapshotter).
This snapshotter is implemented as a standard proxy plugin but it requires the
client to pass some additional information (image ref and layer digest) for each
pull operation to query layer contents on the registry. Stargz snapshotter
project provides an image handler to do this and stargz snapshot users need to
pass this handler to containerd client.
This commit enables to use stargz-based remote snapshots through CRI by passing
the handler to containerd client on pull operation.
Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
Along with type(Sandbox or Container) and Sandbox name annotations
provide support for additional annotation:
- Container name
This will help us perform per container operation by comparing it
with pass through annotations (eg. pod metadata annotations from K8s)
Signed-off-by: Chethan Suresh <Chethan.Suresh@sony.com>
With go RWMutex design, no goroutine should expect to be able to
acquire a read lock until the read lock has been released, if one
goroutine call lock.
The original design is to reload cni network config on every single
Status CRI gRPC call. If one RunPodSandbox request holds read lock
to allocate IP for too long, all other RunPodSandbox/StopPodSandbox
requests will wait for the RunPodSandbox request to release read lock.
And the Status CRI call will fail and kubelet becomes NOTReady.
Reload cni network config at every single Status CRI call is not
necessary and also brings NOTReady situation. To lower the possibility
of NOTReady, CRI will reload cni network config if there is any valid fs
change events from the cni network config dir.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
The resize chan is never closed when doing exec/attach now. What's more,
`resize` is a recieved only chan so it can not be closed. Use ctx to
exit the goroutine in `handleResizing` properly.
Signed-off-by: Li Yuxuan <liyuxuan04@baidu.com>
Instead of having several dialer implementations, leave only one in
`pkg/dialer` and call it from `pkg/ttrpcutil`, `runtime/v(1|2)/shim`
which had their own
Closes#3471.
Signed-off-by: Kiril Vladimiroff <kiril@vladimiroff.org>
The pkg/store errors are duplicated errors of NotFound and AlreadyExist from
containerd's errdefs package and thus do not properly serialize when running
errdefs.ToGRPC on them. CRI runs this function on every return from a CRI
method so the conversion fails if there is a cache miss from the store caches
for containers or sandboxes. This change verifies that the errors are properly
converted to their gRPC values.
Signed-off-by: Justin Terry (VM) <juterry@microsoft.com>
In cgroup v1 container implementations, cgroupns is not used by default because
it was not available in the kernel until kernel 4.6 (May 2016), and the default
behavior will not change on cgroup v1 environments, because changing the
default will break compatibility and surprise users.
For cgroup v2, implementations are going to unshare cgroupns by default
so as to hide /sys/fs/cgroup from containers.
* Discussion: https://github.com/containers/libpod/issues/4363
* Podman PR (merged): https://github.com/containers/libpod/pull/4374
* Moby PR: https://github.com/moby/moby/pull/40174
This PR enables cgroupns for containers, but pod sandboxes are untouched
because probably there is no need to do.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
The former default runtime "io.containerd.runc.v1" won't support new features
like support for cgroup v2: containerd/containerd#3726
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Reized the I/O buffers to align with the size of the kernel buffers with fifos
and move the close aspect of the console to key off of the stdin closing.
Fixes#3738
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
1. For Windows the Hostname property is not inherited from the sandbox and must
be passed for the Workload container activations as well.
Signed-off-by: Justin Terry (VM) <juterry@microsoft.com>
Due to changes to the defaults in containerd, the CRI path to creating a
container OCI config needs to add back in the default UNIX $PATH (and
any other defaults) as that is the expected behavior from other
runtimes.
Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com>
The climan package has a command that can be registered with any urfav
cli app to generate man pages.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
- empty username means caller wants to use no credentials, typically for anonymous registry
- Fixes https://github.com/containerd/cri/issues/1249
Signed-off-by: Nishchay Kumar <mrawesomenix@gmail.com>
This adds a singleton `timeout` package that will allow services and user
to configure timeouts in the daemon. When a service wants to use a
timeout, it should declare a const and register it's default value
inside an `init()` function for that package. When the default config
is generated, we can use the `timeout` package to provide the available
timeout keys so that a user knows that they can configure.
These show up in the config as follows:
```toml
[timeouts]
"io.containerd.timeout.shim.cleanup" = 5
"io.containerd.timeout.shim.load" = 5
"io.containerd.timeout.shim.shutdown" = 3
"io.containerd.timeout.task.state" = 2
```
Timeouts in the config are specified in seconds.
Timeouts are very hard to get right and giving this power to the user to
configure things is a huge improvement. Machines can be faster and
slower and depending on the CPU or load of the machine, a timeout may
need to be adjusted.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
We are separating out the encryption code and have designed a few new
interfaces and APIs for processing content streams. This keep the core
clean of encryption code but enables not only encryption but support of
multiple content types ( custom media types ).
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
This commit adds a flag to the runtime config that allows overloading of the default
privileged behaviour. When the flag is enabled on a runtime, host devices won't
be appended to the runtime spec if the container is run as privileged.
By default the flag is false to maintain the current behaviour of privileged.
Fixes#1213
Signed-off-by: Alex Price <aprice@atlassian.com>
By default the SecurityContext for Container activation can contain a Username
UID, GID. The order of precedences is username, UID, GID. If none of these
options are specified as a last resort attempt to set the ImageSpec username.
Signed-off-by: Justin Terry (VM) <juterry@microsoft.com>
Moves to the containerd/log package over logrus directly. This benefits the
traces because if using any log context such as OpenCensus on the entry gRPC
API all traces for that gRPC method will now contain the appropriate TraceID,
SpanID for easy correlation.
Signed-off-by: Justin Terry (VM) <juterry@microsoft.com>
Use a Pipe() rather than a file to pass the passphrase to the command
line tool. Pass the file descriptor to read the passphrase from as fd '3'.
Signed-off-by: Stefan Berger <stefanb@linux.ibm.com>
Rather than passing the passphrase via command line write it into
a temp. file and pass the name of the file using passphrase-file option.
Signed-off-by: Stefan Berger <stefanb@linux.ibm.com>
A call to ExecSync should only return if the client context was canceled or
exceeded. The Timeout parameter to ExecSyncRequest is now used to send SIGKILL
if the exec'd process does not exit within Timeout but all paths wait for the
exec to exit.
Signed-off-by: Justin Terry (VM) <juterry@microsoft.com>
A call to StopContainer should only return if the client context is canceled or
its deadline was exceeded. The Timeout parameter on StopContainerRequest is now
used as the time AFTER sending the stop signal before the SIGKILL is delivered.
The call will remain until the container has exited or the client context has
finished.
Signed-off-by: Justin Terry (VM) <juterry@microsoft.com>
A call to RunPodSandbox should only return timeout if the operation has timed
out because the clients context deadline was exceeded. On client cancelation
it should return gRPC Canceled otherwise it should block until the sandbox has
exited.
Signed-off-by: Justin Terry (VM) <juterry@microsoft.com>
Minor correctness. We should use the value of the const in the error message
instead of hard coding it in the string so if maxDNSSearches ever changes so
does the error.
Signed-off-by: Justin Terry (VM) <juterry@microsoft.com>
Only compose full container log path if neither of the paths is empty. Otherwise container won't start properly.
Signed-off-by: Cong Liu <conliu@google.com>
To support cross compilation for SIG* signals perfer the syscall package over
the unix package.
Signed-off-by: Justin Terry (VM) <juterry@microsoft.com>