crun 1.4.3 as well as runc 1.1 both support to open bind-mounts before
dropping privileges, as they are inaccessible after switching to the
user namespace. So that is the minimum version to use with containerd
1.7.
Also, since containerd 2.0 we use idmap mounts for files mounted in the
container created by containerd (like etc/hostname, etc/hosts, etc.), so
in that case we require newer OCI runtimes too. However, as the kubelet
doesn't request idmap mounts for kube volumes, we can lower the kernel
version.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Userns requires idmap mounts or to opt-in for a slow and expensive
chown. As idmap mounts support for overlayfs was merged in 5.19, let's
add the slow_chown config for our CI.
The config is harmless to keep it in new kernels, as if idmap mounts is
supported, it will be just used. Whenever all our CI is run with kernels
>= 5.19, we can remove this setting.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
If we don't use idmap mounts, doing a chown per pod is very expensive:
it implies duplicating the container storage for the image for every pod
and the latency to start a new pod is affected too.
Let's make sure users are aware of this, by having them opt-in, for
snapshotters that we have a better solution (like overlayfs, that has
support for idmap mounts).
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Deprecate the pacakge, but suppress linting errors for now. This is to allow
backporting these changes to release branches, which may still need to transition.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This "soft" deprecates the package, but keeps the local uses of the package,
which can make backporting this to release-branches easier (we can
still move all uses in those branches as well though).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
When a shim process is unexpectedly killed in a way that was not initiated through containerd - containerd reports the pod as not ready but the containers as running. This results in kubelet repeatedly sending container kill requests that fail since containerd cannot connect to the shim.
Changes:
- In the container exit handler, treat `err: Unavailable` as if the container has already exited out
- When attempting to get a connection to the shim, if the controller isn't available assume that the shim has been killed (needs to be done since we have a separate exit handler that cleans up the reference to the shim controller - before kubelet has the chance to call StopPodSandbox)
Signed-off-by: Aditya Ramani <a_ramani@apple.com>
Remove `LimitNOFILE` from `containerd.service` to rely on the systemd v240 implicit default of `1024:524288`. On supported platforms with systemd prior to v240, packagers will patch the service with an explicit `LimitNOFILE=1024:524288`.
- `1024` soft limit is an implicit default, avoiding unexpected breakage. Software that needs a higher limit should request to raise the soft limit for its process.
- `524288` hard limit is an implicit default since systemd v240 and is adequate for most processes (_half of the historical limit from `fs.nr_open` of `1048576`_), while 4096 is the implicit default from the kernel (often too low).
- The hard limit may not exceed `fs.nr_open` (_which a value of `infinity` will resolve to_). On most systems with systemd v240 or newer, this will resolve to an excessive size of 2^30 (over 1 billion).
- When set to `infinity` (usually as the soft limit) software may experience significantly increased resource usage, resulting in a performance regression or runtime failures that are difficult to troubleshoot.
Signed-off-by: Brennan Kinney <5098581+polarathene@users.noreply.github.com>
crun 1.9 was just released with fixes and exposes idmap mounts support
via the "features" sub-command.
We use that feature to throw a clear error to users (if they request
idmap mounts and the OCI runtime doesn't support it), but also to skip
tests on CI when the OCI runtime doesn't support it.
Let's bump it so the CI runs the tests with crun.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
runc, as mandated by the runtime-spec, ignores unknown fields in the
config.json. This is unfortunate for cases where we _must_ enable that
feature or fail.
For example, if we want to start a container with user namespaces and
volumes, using the uidMappings/gidMappings field is needed so the
UID/GIDs in the volume don't end up with garbage. However, if we don't
fail when runc will ignore these fields (because they are unknown to
runc), we will just start a container without using the mappings and the
UID/GIDs the container will persist to volumes the hostUID/GID, that can
change if the container is re-scheduled by Kubernetes.
This will end up in volumes having "garbage" and unmapped UIDs that the
container can no longer change. So, let's avoid this entirely by just
checking that runc supports idmap mounts if the container we are about
to create needs them.
Please note that the "runc features" subcommand is only run when we are
using idmap mounts. If idmap mounts are not used, the subcommand is not
run and therefore this should not affect containers that don't use idmap
mounts in any way.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>