pkg/cap has the full list of the caps (for UT, originally),
so we can drop dependency on github.com/syndtr/gocapability
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
This change is needed for running the latest containerd inside Docker
that is not aware of the recently added caps (BPF, PERFMON, CHECKPOINT_RESTORE).
Without this change, containerd inside Docker fails to run containers with
"apply caps: operation not permitted" error.
See kubernetes-sigs/kind 2058
NOTE: The caller process of this function is now assumed to be as
privileged as possible.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Newer golangci-lint needs explicit `//` separator. Otherwise it treats
the entire line (`staticcheck deprecated ... yet`) as a name.
https://golangci-lint.run/usage/false-positives/#nolint
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
Looks like this import was not needed for the test; simplified the test
by just using the device-path (a counter would work, but for debugging,
having the list of paths can be useful).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
For some tools having the actual image name in the annotations is helpful for
debugging and auditing the workload.
Signed-off-by: Michael Crosby <michael@thepasture.io>
bump version 1.3.2 for gogo/protobuf due to CVE-2021-3121 discovered
in gogo/protobuf version 1.3.1, CVE has been fixed in 1.3.2
Signed-off-by: Aditi Sharma <adi.sky17@gmail.com>
This was dumping untrusted output to the debug logs from user containers.
We should not dump this type of information to reduce log sizes and any
information leaks from user containers.
Signed-off-by: Michael Crosby <michael@thepasture.io>
The event monitor handles exit events one by one. If there is something
wrong about deleting task, it will slow down the terminating Pods. In
order to reduce the impact, the exit event watcher should handle exit
event separately. If it failed, the watcher should put it into backoff
queue and retry it.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
Use a PrefixFilter() to get only the mounts we're interested in,
which removes the need to manually filter mounts from the mountinfo
results.
Additional optimizations can be made, as:
> ... there's a little known fact that `umount(MNT_DETACH)` is actually
> recursive in Linux, IOW this function can be replaced with
> `unix.Umount(target, unix.MNT_DETACH)` (or `mount.UnmountAll(target, unix.MNT_DETACH)`
> (provided that target itself is a mount point).
e8fb2c392f (r535450446)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
containerd is responsible for creating the log but there is no code to ensure
that the log dir exists. While kubelet should have created this there can be
times where this is not the case and this can cause stuck tasks.
Signed-off-by: Michael Crosby <michael@thepasture.io>
The build tag was removed in go-selinux v1.8.0: opencontainers/selinux#132
Related: remove "apparmor" build tag: 0a9147f3aa
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
The "apparmor" build tag does not have any cgo dependency and can be removed safely.
Related: https://github.com/opencontainers/runc/issues/2704
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
When a base runtime spec is being used, admins can configure defaults for the
spec so that default ulimits or other security related settings get applied for
all containers launched.
Signed-off-by: Michael Crosby <michael@thepasture.io>
recent versions of libcontainer/apparmor simplified the AppArmor
check to only check if the host supports AppArmor, but no longer
checks if apparmor_parser is installed, or if we're running
docker-in-docker;
bfb4ea1b1b
> The `apparmor_parser` binary is not really required for a system to run
> AppArmor from a runc perspective. How to apply the profile is more in
> the responsibility of higher level runtimes like Podman and Docker,
> which may do the binary check on their own.
This patch copies the logic from libcontainer/apparmor, and
restores the additional checks.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This is a followup to #4699 that addresses an oversight that could cause
the CRI to relabel the host /dev/shm, which should be a no-op in most
cases. Additionally, fixes unit tests to make correct assertions for
/dev/shm relabeling.
Discovered while applying the changes for #4699 to containerd/cri 1.4:
https://github.com/containerd/cri/pull/1605
Signed-off-by: Jacob Blain Christen <jacob@rancher.com>
Address an issue originally seen in the k3s 1.3 and 1.4 forks of containerd/cri, https://github.com/rancher/k3s/issues/2240
Even with updated container-selinux policy, container-local /dev/shm
will get mounted with container_runtime_tmpfs_t because it is a tmpfs
created by the runtime and not the container (thus, container_runtime_t
transition rules apply). The relabel mitigates such, allowing envoy
proxy to work correctly (and other programs that wish to write to their
/dev/shm) under selinux.
Tested locally with:
- SELINUX=Enforcing vagrant up --provision-with=shell,selinux,test-integration
- SELINUX=Enforcing CRITEST_ARGS=--ginkgo.skip='HostIpc is true' vagrant up --provision-with=shell,selinux,test-cri
- SELINUX=Permissive CRITEST_ARGS=--ginkgo.focus='HostIpc is true' vagrant up --provision-with=shell,selinux,test-cri
Signed-off-by: Jacob Blain Christen <jacob@rancher.com>
Made a change yesterday that passed through snapshotter labels into the wrapper of
WithNewSnapshot, but it passed the entirety of the annotations into the snapshotter.
This change just filters the set that we care about down to snapshotter specific
labels.
Will probably be future changes to add some more labels for LCOW/WCOW and the corresponding
behavior for these new labels.
Signed-off-by: Daniel Canter <dcanter@microsoft.com>
Previously there wwasn't a way to pass any labels to snapshotters as the wrapper
around WithNewSnapshot didn't have a parm to pass them in.
Signed-off-by: Daniel Canter <dcanter@microsoft.com>
The tricks performed by ensureRemoveAll only make sense for Linux and
other Unices, so separate it out, and make ensureRemoveAll for Windows
just an alias of os.RemoveAll.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
In containerd, there is a size limit for label size (4096 chars).
Currently if an image has many layers (> (4096-39)/72 > 56),
`containerd.io/snapshot/cri.image-layers` will hit the limit of label size and
the unpack will fail.
This commit fixes this by limiting the size of the annotation.
Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
The default values of masked and readonly paths are defined
in populateDefaultUnixSpec, and are used when a sandbox is
created. It is not, however, used for new containers. If
a container definition does not contain a security context
specifying masked/readonly paths, a container created from
it does not have masked and readonly paths.
This patch applies the default values to masked and
readonly paths of a new container, when any specific values
are not specified.
Fixes#1569
Signed-off-by: Yohei Ueda <yohei@jp.ibm.com>
Currently the shims only support starting the logging binary process if the
io.Creator Config does not specify Terminal: true. This means that the program
using containerd will only be able to specify FIFO io when Terminal: true,
rather than allowing the shim to fork the logging binary process. Hence,
containerd consumers face an inconsistent behavior regarding logging binary
management depending on the Terminal option.
Allowing the shim to fork the logging binary process will introduce consistency
between the running container and the logging process. Otherwise, the logging
process may die if its parent process dies whereas the container will keep
running, resulting in the loss of container logs.
Signed-off-by: Akshat Kumar <kshtku@amazon.com>
This allows development with container to be done for NRI without the need for
custom builds.
This is an experimental feature and is not enabled unless a user has a global
`/etc/nri/conf.json` config setup with plugins on the system. No NRI code will
be executed if this config file does not exist.
Signed-off-by: Michael Crosby <michael@thepasture.io>
This commit adds a config flag for allowing GC to clean layer contents up after
unpacking these contents completed, which leads to deduplication of layer
contents between the snapshotter and the contnet store.
Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
This allows an admin to set the upper bounds on the category range for selinux
labels. This can be useful when handling allocation of PVs or other volume
types that need to be shared with selinux enabled on the hosts and volumes.
Signed-off-by: Michael Crosby <michael@thepasture.io>
Adds support to mount named pipes into Windows containers. This support
already exists in hcsshim, so this change just passes them through
correctly in cri. Named pipe mounts must start with "\\.\pipe\".
Signed-off-by: Kevin Parsons <kevpar@microsoft.com>
Reason: originally it was introduced to prevent the loading of the SCTP kernel module on the nodes. But iptables chain creation alone does not load the kernel module. The module would be loaded if an SCTP socket was created, but neither cri nor the portmap CNI plugin starts managing SCTP sockets if hostPort / portmappings are defined.
Signed-off-by: Laszlo Janosi <laszlo.janosi@ibm.com>
This adds a configuration knob for adding request headers to all
registry requests. It is not namespaced to a registry.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
How to test (from https://github.com/opencontainers/runc/pull/2352#issuecomment-620834524):
(host)$ sudo swapoff -a
(host)$ sudo ctr run -t --rm --memory-limit $((1024*1024*32)) docker.io/library/alpine:latest foo
(container)$ sh -c 'VAR=$(seq 1 100000000)'
An event `/tasks/oom {"container_id":"foo"}` will be displayed in `ctr events`.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
This moves most of the API calls off of the `labels` package onto the root
selinux package. This is the newer API for most selinux operations.
Signed-off-by: Michael Crosby <michael@thepasture.io>
We encountered two failing end-to-end tests after the adoption of
https://github.com/containerd/cri/pull/1470 in
https://github.com/cri-o/cri-o/pull/3749:
```
Summarizing 2 Failures:
[Fail] [sig-cli] Kubectl Port forwarding With a server listening on 0.0.0.0 that expects a client request [It] should support a client that connects,
sends DATA, and disconnects
test/e2e/kubectl/portforward.go:343
[Fail] [sig-cli] Kubectl Port forwarding With a server listening on localhost that expects a client request [It] should support a client that connects
, sends DATA, and disconnects
test/e2e/kubectl/portforward.go:343
```
Increasing the timeout to 1s fixes the issue.
Signed-off-by: Sascha Grunert <sgrunert@suse.com>
This swaps the RunningInUserNS() function that we're using
from libcontainer/system with the one in containerd/sys.
This removes the dependency on libcontainer/system, given
these were the only functions we're using from that package.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The `DualStack` option was deprecated in Go 1.12, and is now enabled by default
(through commit github.com/golang/go@efc185029bf770894defe63cec2c72a4c84b2ee9).
> The Dialer.DualStack field is now meaningless and documented as deprecated.
>
> To disable fallback, set FallbackDelay to a negative value.
The default `FallbackDelay` is 300ms; to make this more explicit, this patch
sets `FallbackDelay` to the default value.
Note that Docker Hub currently does not support IPv6 (DNS for registry-1.docker.io
has no AAAA records, so we should not hit the 300ms delay).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
use goroutines to copy the data from the stream to the TCP
connection, and viceversa, removing the socat dependency.
Quoting Lantao Liu, the logic is as follow:
When one side (either pod side or user side) of portforward
is closed, we should stop port forwarding.
When one side is closed, the io.Copy use that side as source will close,
but the io.Copy use that side as dest won't.
Signed-off-by: Antonio Ojea <antonio.ojea.garcia@gmail.com>
This changes adds `default_seccomp_profile` config switch to apply default seccomp profile when not provided by k8s.a
Signed-off-by: Maksym Pavlenko <pavlenko.maksym@gmail.com>
Should destroy the pod network if fails to setup or return invalid
net interface, especially multiple CNI configurations.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
Currently, CRI plugin passes each layer digest to remote snapshotters
sequentially, which leads to sequential snapshots preparation. But it costs
extra time especially for remote snapshotters which need to connect to the
remote backend store (e.g. registries) for checking the snapshot existence on
each preparation.
This commit solves this problem by introducing new label
`containerd.io/snapshot/cri.chain` for passing all layer digests in an image to
snapshotters and by allowing them to prepare these snapshots in parallel, which
leads to speed up the preparation.
Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
For a privilege pods with PrivilegedWithoutHostDevices set to true
host device specified in the config are not provided (whereas it is done for
non privilege pods or privilege pods with PrivilegedWithoutHostDevices set
to false as all devices are included).
Add them in this case.
Fixes: 3353ab76d9 ("Add flag to overload default privileged host device behaviour")
Signed-off-by: Thibaut Collet <thibaut.collet@6wind.com>
containerd loads timeout values from config.toml and populated those
values to `timeout` package at launch. So when using `timeout` package
from shim, there are default values and config file is ignored.
So use a hardcoded value for binary IO.
Signed-off-by: Maksym Pavlenko <makpav@amazon.com>
Throughout container lifecycle, pulling image is one of the time-consuming
steps. Recently, containerd community started to tackle this issue with
stargz-based remote snapshots, as a non-core
subproject(https://github.com/containerd/stargz-snapshotter).
This snapshotter is implemented as a standard proxy plugin but it requires the
client to pass some additional information (image ref and layer digest) for each
pull operation to query layer contents on the registry. Stargz snapshotter
project provides an image handler to do this and stargz snapshot users need to
pass this handler to containerd client.
This commit enables to use stargz-based remote snapshots through CRI by passing
the handler to containerd client on pull operation.
Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
Along with type(Sandbox or Container) and Sandbox name annotations
provide support for additional annotation:
- Container name
This will help us perform per container operation by comparing it
with pass through annotations (eg. pod metadata annotations from K8s)
Signed-off-by: Chethan Suresh <Chethan.Suresh@sony.com>
With go RWMutex design, no goroutine should expect to be able to
acquire a read lock until the read lock has been released, if one
goroutine call lock.
The original design is to reload cni network config on every single
Status CRI gRPC call. If one RunPodSandbox request holds read lock
to allocate IP for too long, all other RunPodSandbox/StopPodSandbox
requests will wait for the RunPodSandbox request to release read lock.
And the Status CRI call will fail and kubelet becomes NOTReady.
Reload cni network config at every single Status CRI call is not
necessary and also brings NOTReady situation. To lower the possibility
of NOTReady, CRI will reload cni network config if there is any valid fs
change events from the cni network config dir.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
The resize chan is never closed when doing exec/attach now. What's more,
`resize` is a recieved only chan so it can not be closed. Use ctx to
exit the goroutine in `handleResizing` properly.
Signed-off-by: Li Yuxuan <liyuxuan04@baidu.com>
Instead of having several dialer implementations, leave only one in
`pkg/dialer` and call it from `pkg/ttrpcutil`, `runtime/v(1|2)/shim`
which had their own
Closes#3471.
Signed-off-by: Kiril Vladimiroff <kiril@vladimiroff.org>
The pkg/store errors are duplicated errors of NotFound and AlreadyExist from
containerd's errdefs package and thus do not properly serialize when running
errdefs.ToGRPC on them. CRI runs this function on every return from a CRI
method so the conversion fails if there is a cache miss from the store caches
for containers or sandboxes. This change verifies that the errors are properly
converted to their gRPC values.
Signed-off-by: Justin Terry (VM) <juterry@microsoft.com>
In cgroup v1 container implementations, cgroupns is not used by default because
it was not available in the kernel until kernel 4.6 (May 2016), and the default
behavior will not change on cgroup v1 environments, because changing the
default will break compatibility and surprise users.
For cgroup v2, implementations are going to unshare cgroupns by default
so as to hide /sys/fs/cgroup from containers.
* Discussion: https://github.com/containers/libpod/issues/4363
* Podman PR (merged): https://github.com/containers/libpod/pull/4374
* Moby PR: https://github.com/moby/moby/pull/40174
This PR enables cgroupns for containers, but pod sandboxes are untouched
because probably there is no need to do.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
The former default runtime "io.containerd.runc.v1" won't support new features
like support for cgroup v2: containerd/containerd#3726
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Reized the I/O buffers to align with the size of the kernel buffers with fifos
and move the close aspect of the console to key off of the stdin closing.
Fixes#3738
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
1. For Windows the Hostname property is not inherited from the sandbox and must
be passed for the Workload container activations as well.
Signed-off-by: Justin Terry (VM) <juterry@microsoft.com>