userns.RunningInUserNS() checks if the code calling that function is
running inside a user namespace. But we need to check if the container
we will create will use a user namespace, in that case we need to
disable the sysctl too (or we would need to take the userns mapping into
account to set the IDs).
This was added in PR:
https://github.com/containerd/containerd/pull/6170/
And the param documentation says it is not enabled when user namespaces
are in use:
https://github.com/containerd/containerd/pull/6170/files#diff-91d0a4c61f6d3523b5a19717d1b40b5fffd7e392d8fe22aed7c905fe195b8902R118
I'm not sure if the intention was to disable this if containerd is
running inside a userns (rootless, if that is even supported) or just
when the pod has user namespaces.
Out of an abundance of caution, I'm keeping the userns.RunningInUserNS()
so it is still not used if containerd runs inside a user namespace.
With this patch and "enable_unprivileged_icmp = true" in the config,
running containerd as root on the host, pods with user namespaces start
just fine. Without this patch they fail with:
... failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: w
/proc/sys/net/ipv4/ping_group_range: invalid argument: unknown
Thanks a lot to Andy on the k8s slack for reporting the issue. He also
mentions he hits this with k3s on a default installation (the param
is off by default on containerd, but k3s turns that on by default it
seems). He also debugged which part of the stack was setting that
sysctl, found the PR that added this code in containerd and a workaround
(to turn the bool off).
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
All of the CRI sandbox and container specs all get assigned
almost the exact same default annotations (sandboxID, name, metadata,
container type etc.) so lets make a helper to return the right set for
a sandbox or regular workload container.
Signed-off-by: Danny Canter <danny@dcantah.dev>
This patch requests the OCI runtime to create a userns when the CRI
message includes such request.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
This allows user namespace support to progress, either by allowing
snapshotters to deal with ownership, or falling back to containerd doing
a recursive chown.
In the future, when snapshotters implement idmap mounts, they should
report the "remap-ids" capability.
Co-authored-by: Rodrigo Campos <rodrigoca@microsoft.com>
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Signed-off-by: David Leadbeater <dgl@dgl.cx>
All containers except the pause container, mount `/dev/shm" with flags
`nosuid,nodev,noexec`. So change mount options for pause container to
keep consistence.
This also helps to solve issues of failing to mount `/dev/shm` when
pod/container level user namespace is enabled.
Fixes: #6911
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Signed-off-by: Lei Wang <wllenyj@linux.alibaba.com>
This's an optimization to get rid of redundant `/dev/shm" mounts for pause container.
In `oci.defaultMounts`, there is a default `/dev/shm` mount which is redundant for
pause container.
Fixes: #6911
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Signed-off-by: Lei Wang <wllenyj@linux.alibaba.com>
CRI API has been updated to include a an optional `resources` field in the
LinuxPodSandboxConfig field, as part of the RunPodSandbox request.
Having sandbox level resource details at sandbox creation time will have
large benefits for sandboxed runtimes. In the case of Kata Containers,
for example, this'll allow for better support of SW/HW architectures
which don't allow for CPU/memory hotplug, and it'll allow for better
queue sizing for virtio devices associated with the sandbox (in the VM
case).
If this sandbox resource information is provided as part of the run
sandbox request, let's introduce a pattern where we will update the
pause container's runtiem spec to include this information in the
annotations field.
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
Move `pkg/cri/opts.WithoutRunMount` function to `oci.WithoutRunMount`
so that it can be used without dependency on CRI.
Also add `oci.WithoutMounts(dests ...string)` for generality.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Newer golangci-lint needs explicit `//` separator. Otherwise it treats
the entire line (`staticcheck deprecated ... yet`) as a name.
https://golangci-lint.run/usage/false-positives/#nolint
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>