crun 1.4.3 as well as runc 1.1 both support to open bind-mounts before
dropping privileges, as they are inaccessible after switching to the
user namespace. So that is the minimum version to use with containerd
1.7.
Also, since containerd 2.0 we use idmap mounts for files mounted in the
container created by containerd (like etc/hostname, etc/hosts, etc.), so
in that case we require newer OCI runtimes too. However, as the kubelet
doesn't request idmap mounts for kube volumes, we can lower the kernel
version.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Userns requires idmap mounts or to opt-in for a slow and expensive
chown. As idmap mounts support for overlayfs was merged in 5.19, let's
add the slow_chown config for our CI.
The config is harmless to keep it in new kernels, as if idmap mounts is
supported, it will be just used. Whenever all our CI is run with kernels
>= 5.19, we can remove this setting.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
If we don't use idmap mounts, doing a chown per pod is very expensive:
it implies duplicating the container storage for the image for every pod
and the latency to start a new pod is affected too.
Let's make sure users are aware of this, by having them opt-in, for
snapshotters that we have a better solution (like overlayfs, that has
support for idmap mounts).
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Deprecate the pacakge, but suppress linting errors for now. This is to allow
backporting these changes to release branches, which may still need to transition.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This "soft" deprecates the package, but keeps the local uses of the package,
which can make backporting this to release-branches easier (we can
still move all uses in those branches as well though).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
While this is not strictly necessary as the default OCI config masks this
path, it is possible that the user disabled path masking, passed their
own list, or is using a forked (or future) daemon version that has a
modified default config/allows changing the default config.
Add some defense-in-depth by also masking out this problematic hardware
device with the AppArmor LSM.
Signed-off-by: Bjorn Neergaard <bjorn.neergaard@docker.com>
The ability to read these files may offer a power-based sidechannel
attack against any workloads running on the same kernel.
This was originally [CVE-2020-8694][1], which was fixed in
[949dd0104c496fa7c14991a23c03c62e44637e71][2] by restricting read access
to root. However, since many containers run as root, this is not
sufficient for our use case.
While untrusted code should ideally never be run, we can add some
defense in depth here by masking out the device class by default.
[Other mechanisms][3] to access this hardware exist, but they should not
be accessible to a container due to other safeguards in the
kernel/container stack (e.g. capabilities, perf paranoia).
[1]: https://nvd.nist.gov/vuln/detail/CVE-2020-8694
[2]: 949dd0104c
[3]: https://web.eece.maine.edu/~vweaver/projects/rapl/
Signed-off-by: Bjorn Neergaard <bjorn.neergaard@docker.com>
When a shim process is unexpectedly killed in a way that was not initiated through containerd - containerd reports the pod as not ready but the containers as running. This results in kubelet repeatedly sending container kill requests that fail since containerd cannot connect to the shim.
Changes:
- In the container exit handler, treat `err: Unavailable` as if the container has already exited out
- When attempting to get a connection to the shim, if the controller isn't available assume that the shim has been killed (needs to be done since we have a separate exit handler that cleans up the reference to the shim controller - before kubelet has the chance to call StopPodSandbox)
Signed-off-by: Aditya Ramani <a_ramani@apple.com>
Remove `LimitNOFILE` from `containerd.service` to rely on the systemd v240 implicit default of `1024:524288`. On supported platforms with systemd prior to v240, packagers will patch the service with an explicit `LimitNOFILE=1024:524288`.
- `1024` soft limit is an implicit default, avoiding unexpected breakage. Software that needs a higher limit should request to raise the soft limit for its process.
- `524288` hard limit is an implicit default since systemd v240 and is adequate for most processes (_half of the historical limit from `fs.nr_open` of `1048576`_), while 4096 is the implicit default from the kernel (often too low).
- The hard limit may not exceed `fs.nr_open` (_which a value of `infinity` will resolve to_). On most systems with systemd v240 or newer, this will resolve to an excessive size of 2^30 (over 1 billion).
- When set to `infinity` (usually as the soft limit) software may experience significantly increased resource usage, resulting in a performance regression or runtime failures that are difficult to troubleshoot.
Signed-off-by: Brennan Kinney <5098581+polarathene@users.noreply.github.com>
crun 1.9 was just released with fixes and exposes idmap mounts support
via the "features" sub-command.
We use that feature to throw a clear error to users (if they request
idmap mounts and the OCI runtime doesn't support it), but also to skip
tests on CI when the OCI runtime doesn't support it.
Let's bump it so the CI runs the tests with crun.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
runc, as mandated by the runtime-spec, ignores unknown fields in the
config.json. This is unfortunate for cases where we _must_ enable that
feature or fail.
For example, if we want to start a container with user namespaces and
volumes, using the uidMappings/gidMappings field is needed so the
UID/GIDs in the volume don't end up with garbage. However, if we don't
fail when runc will ignore these fields (because they are unknown to
runc), we will just start a container without using the mappings and the
UID/GIDs the container will persist to volumes the hostUID/GID, that can
change if the container is re-scheduled by Kubernetes.
This will end up in volumes having "garbage" and unmapped UIDs that the
container can no longer change. So, let's avoid this entirely by just
checking that runc supports idmap mounts if the container we are about
to create needs them.
Please note that the "runc features" subcommand is only run when we are
using idmap mounts. If idmap mounts are not used, the subcommand is not
run and therefore this should not affect containers that don't use idmap
mounts in any way.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
These utilities resulted in the platforms package to have the containerd
API as dependency. As this package is used in many parts of the code, as
well as external consumers, we should try to keep it light on dependencies,
with the potential to make it a standalone module.
These utilities were added in f3b7436b61,
which has not yet been included in a release, so skipping deprecation
and aliases for these.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This cleans up the platforms package from dependencies that are not strictly
needed. This is in preparation of making this package a separate module, which
can be shared by plugins, and containerd versions (as well as external consumers),
- Remove dependency on the errdefs package: most uses of these error
definitions were used internally, and other errors may not be useful
for external consumers as sentinel errors.
- ErrInvalidArgument may be a potential exception, although a look at
current uses of this package shows that there's no special handling of
invalid parameters vs other errors (all would boil down to "the passed
platform is invalid" (either the format, or parsing is not implemented
on a specific platform)
- Remove uses of the convenience "Platform" alias in favor of using the
upstream (from OCI spec). Consumers of this package can still use the
convenience alias, but make sure that function signatures do not imply
that it's a different type (which can cause confusion).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>