Support CRI configuration to allow for request-time rewrite rules
applicable only to the repository portion of resource paths when pulling
images. Because the rewrites are applied at request time, images
themselves will not be "rewritten" -- images as stored by CRI (and the
underlying containerd facility) will continue to present as normal.
As an example, if you use the following config for your containerd:
```toml
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.grpc.v1.cri".registry]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
endpoint = ["https://registry-1.docker.io/v2"]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io".rewrite]
"^library/(.*)" = "my-org/$1"
```
And then subsequently invoke `crictl pull alpine:3.13` it will pull
content from `docker.io/my-org/alpine:3.13` but still show up as
`docker.io/library/alpine:3.13` in the `crictl images` listing.
This commit has been reworked from the original implementation. Rewites
are now done when resolving instead of when building the request, so
that auth token scopes stored in the context properly reflect the
rewritten repository path. For the original implementation, see
06c4ea9baec2b278b8172a789bf601168292f645.
Ref: https://github.com/k3s-io/k3s/issues/11191#issuecomment-2455525773
Signed-off-by: Jacob Blain Christen <jacob@rancher.com>
Co-authored-by: Brad Davidson <brad.davidson@rancher.com>
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
use go1.23.8 as the default go version for running in CI and making
release binaries.
Signed-off-by: Akhil Mohan <akhilerm@gmail.com>
(cherry picked from commit 6f93c65f52c9e1c5e25595429fd50ce2e5da6843)
Signed-off-by: Derek McGowan <derek@mcg.dev>
- go1.23.8 (released 2025-04-01) includes security fixes to the net/http
package, as well as bug fixes to the runtime and the go command.
Ref: https://github.com/golang/go/issues?q=milestone%3AGo1.23.8+label%3ACherryPickApproved
- go1.24.2 (released 2025-04-01) includes security fixes to the net/http
package, as well as bug fixes to the compiler, the runtime, the go
command, and the crypto/tls, go/types, net/http, and testing packages.
Ref: https://github.com/golang/go/issues?q=milestone%3AGo1.24.2+label%3ACherryPickApproved
Signed-off-by: Akhil Mohan <akhilerm@gmail.com>
(cherry picked from commit 5629e9fff7de69a36f5f563d41966aa562866258)
Signed-off-by: Derek McGowan <derek@mcg.dev>
Due to current 100% failure rate on arm64 with the current OS image, disable criu testing for now
Signed-off-by: Phil Estes <estesp@amazon.com>
(cherry picked from commit 9ca6a7ee0aa0ea8added551dd16e00b2102fdea4)
Signed-off-by: Derek McGowan <derek@mcg.dev>
Prevent a panic in the Docker pusher pushWriter, by checking that
the pipe is non nil before attempting to use it.
The panic was found by Moby issue #46746 (https://github.com/moby/moby/issues/46746).
With this fix the panic no longer reproduces.
Signed-off-by: Cesar Talledo <cesar.talledo@docker.com>
Don't produce `reference for unknown type: application/vnd.in-toto+json`
warning logs when pushing/fetching an image containing the attestation
manifests.
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
Before this patch, calling `image.Children` on an image built with
BuildKit would produce unnecessary `encountered unknown type
application/vnd.in-toto+json; children may not be fetched` debug logs,
because the media type is neither a known layer or config type.
Make the `image.Children` aware of the attestation layers and don't
attempt to traverse them.
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
Fix the gRPC client dialer not using the timeout passed by the
containerd client timeout option.
Commit 63b4688175 replaced the usage of deprecated `grpc.DialContext`
with `grpc.NewClient`.
However, the `dialer.ContextDialer` relied on the context deadline to
propagate the timeout:
388fb336b0/vendor/google.golang.org/grpc/clientconn.go (L216)
This assumption is now broken, because `grpc.NewClient` doesn't do any
initial connection and defers it to the first RPC usage.
This commit passes the timeout via the `MinConnectTimeout` grpc
connection param, which will be applied to **every** connection attempt
(not just the first).
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
- go1.23.7 (released 2025-03-04) includes security fixes to the net/http
package, as well as bug fixes to cgo, the compiler, and the reflect,
runtime, and syscall packages. See the Go 1.23.7 milestone on our issue
tracker for details
- go1.24.1 (released 2025-03-04) includes security fixes to the net/http
package, as well as bug fixes to cgo, the compiler, the go command, and
the reflect, runtime, and syscall packages. See the Go 1.24.1 milestone
on our issue tracker for details.
Signed-off-by: Akhil Mohan <akhilerm@gmail.com>
We changed the default setting for `enable_unprivileged_ports` and
`enable_unprivileged_icmp` in the CRI plugin in
https://github.com/containerd/containerd/pull/9348, but missed including
this change in the release notes.
Signed-off-by: Samuel Karp <samuelkarp@google.com>
use the shim bundled with cri-cni-containerd tar rather than using
the shim present on the host machine for running e2e
Signed-off-by: Akhil Mohan <akhilerm@gmail.com>
The remote content writer proxy already has the capability to break up
large files into multiple writes, but the current API doesn't recognize
when it's about to exceed the limits and attempts to send the data over
grpc in one message instead of breaking it into multiple messages.
This changes the behavior of `Write` to automatically break up the size
of the content based on the max send message size.
Signed-off-by: Jonathan A. Sternberg <jonathan.sternberg@docker.com>
(cherry picked from commit f25f36c334144d87233e06b0de90522ebd97e144)
Previously, PluginInfo was called with task options as the primary
value, resulting in opts.BinaryName being omitted. Consequently, the
containerd-shim-runc-v2 fell back to the system's runc binary in the
PATH rather than the explicitly specified one. This change inverts the
option fallback by preferring runtime options over task options,
ensuring the correct binary is used for the PluginInfo request.
Closes: https://github.com/containerd/containerd/issues/11169
Signed-off-by: Jose Fernandez <josef@netflix.com>
Reviewed-by: Erikson Tung <etung@netflix.com>
This is the fifth patch release in the 1.2.z series of runc. It
primarily fixes an issue caused by an upstream systemd bug.
There was a regression in systemd v230 which made the way we define device
rule restrictions require a systemctl daemon-reload for our transient
units. This caused issues for workloads using NVIDIA GPUs. Workaround the
upstream regression by re-arranging how the unit properties are defined.
Dependency github.com/cyphar/filepath-securejoin is updated to v0.4.1,
to allow projects that vendor runc to bump it as well.
CI: fixed criu-dev compilation.
Dependency golang.org/x/net is updated to 0.33.0.
diff: opencontainers/runc@v1.2.4...v1.2.5
Signed-off-by: Austin Vazquez <macedonv@amazon.com>
Block the synchronization of registering NRI plugins during
CRI events to avoid the plugin ending up in an inconsistent
starting state after initial sync (missing pods, containers
or missed events for some pods or containers).
Signed-off-by: Krisztian Litkey <krisztian.litkey@intel.com>
- go1.23.5 (released 2025-01-16) includes security fixes to the
crypto/x509 and net/http packages, as well as bug fixes to the compiler,
the runtime, and the net package. See the Go 1.23.5 milestone on our
issue tracker for details.
- go1.22.11 (released 2025-01-16) includes security fixes to the
crypto/x509 and net/http packages, as well as bug fixes to the runtime.
See the Go 1.22.11 milestone on our issue tracker for details.
Signed-off-by: Akhil Mohan <akhilerm@gmail.com>
This function has been moved to prevent an unintended dependency on CDI.
Signed-off-by: Derek McGowan <derek@mcg.dev>
(cherry picked from commit bdc847f1eb535a6728b6db3f2619d2a5ed0edbb9)
Signed-off-by: Derek McGowan <derek@mcg.dev>
The CDI device injection spec opt was mistakenly added to the OCI
package which brought in an unintended dependency on CDI and its
transitive dependencies.
Signed-off-by: Derek McGowan <derek@mcg.dev>
(cherry picked from commit e20f7f4a2425c005d85855abfd4556d7b4ccbf87)
Signed-off-by: Derek McGowan <derek@mcg.dev>
The cri image service init has a bug where, after getting FSPath
for snapshotter_i, it stores it under defaultSnapshotter instead
of snapshotter_i.
Also make a few other refactor:
1. Dedup the snapshotRoot loading for defaultSnapshotter
2. Remove some unnecessary logic in RuntimePlatforms for-loop
Signed-off-by: Jin Dong <djdongjin95@gmail.com>
This is the fourth patch release of the 1.2.z release branch of runc. It
includes a fix for a regression introduced in 1.2.0 related to the
default device list.
- Re-add tun/tap devices to built-in allowed devices lists.
In runc 1.2.0 we removed these devices from the default allow-list
(which were added seemingly by accident early in Docker's history) as
a precaution in order to try to reduce the attack surface of device
inodes available to most containers. At the time we thought
that the vast majority of users using tun/tap would already be
specifying what devices they need (such as by using --device with
Docker/Podman) as opposed to doing the mknod manually, and thus
there would've been no user-visible change.
Unfortunately, it seems that this regressed a noticeable number of
users (and not all higher-level tools provide easy ways to specify
devices to allow) and so this change needed to be reverted. Users
that do not need these devices are recommended to explicitly disable
them by adding deny rules in their container configuration.
diff: https://github.com/opencontainers/runc/compare/v1.2.3...v1.2.4
Signed-off-by: Akhil Mohan <akhilerm@gmail.com>
Fix issue 11228
`ctr images import --all-platforms` w/o `--local` was failing due to
`unable to initialize unpacker: no unpack platforms defined` error.
W/ `--local`, it unpacks the layers for the strict-default platform.
Now `ctr images import --all-platforms` w/o `--local` unpacks the layers
for the non-strict default platform.
This behavior still differs from `--local`.
i.e., on an arm64 host, arm/v{5,6,7} layers are unpacked too.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
These dependencies were updated to "master" in some modules we depend on,
but have no code-changes since their last release. Unfortunately, this also
causes a ripple effect, forcing all users of the containerd module to also
update these dependencies to an unrelease / un-tagged version.
Both these dependencies will unlikely do a new release in the near future,
so exclude these versions so that we can downgrade to the current release.
For additional details, see [this PR][1] and links mentioned in it.
[1]: https://github.com/kubernetes-sigs/kustomize/pull/5830#issuecomment-2569960859
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This fixes compatibility with alpine 3.21 and file 5.46+
- Fix additional possible `xx-cc`/`xx-cargo` compatibility issue with Alpine 3.21
- Support for Alpine 3.21
- Fix `xx-verify` with `file` 5.46+
- Fix possible error taking lock in `xx-apk` in latest Alpine without `coreutils`
full diff: https://github.com/tonistiigi/xx/compare/v1.2.1...v1.6.1
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
When kubelet enables UserNamespaceSupport feature gate, kubelet always
uses non-empty UsernsOptions to setup pods. In this case, the gVisor shim is
unable to parse runc.Option so that it will be unable to start container.
This change is to avoid adding IoOwner options if the UsernsOptions is
for node level. Since gVisor hasn't feature subcommand yet, CRI status
will report that gVisor runtime doesn't support user namespace. So it's
kind of workaround to avoid compatible issue.
REF: #11091
Signed-off-by: Wei Fu <fuweid89@gmail.com>
This is the third patch release of the 1.2.z release branch of runc. It
primarily fixes some minor regressions introduced in 1.2.0.
- Fixed a regression in use of securejoin.MkdirAll, where multiple
runc processes racing to create the same mountpoint in a shared rootfs
would result in spurious EEXIST errors. In particular, this regression
caused issues with BuildKit.
- Fixed a regression in eBPF support for pre-5.6 kernels after upgrading
Cilium's eBPF library version to 0.16 in runc.
full diff: https://github.com/opencontainers/runc/compare/v1.2.2...v1.2.3
release notes: https://github.com/opencontainers/runc/releases/tag/v1.2.3
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 981414521baf578a313c7b7af034ade6cb92b10d)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Jin Dong <djdongjin95@gmail.com>
(cherry picked from commit 288001f68c5fd34cfbdc7284f14375a3762b8ff4)
Signed-off-by: Jin Dong <djdongjin95@gmail.com>
The containerd-shim creates pipes and passes them to the init container as
stdin, stdout, and stderr for logging purposes. By default, these pipes are
owned by the root user (UID/GID: 0/0). The init container can access them
directly through inheritance.
However, if the init container attempts to open any files pointing to these
pipes (e.g., /proc/1/fd/2, /dev/stderr), it will encounter a permission issue
since it is not the owner. To avoid this, we need to align the ownership of
the pipes with the init process.
Fixes: #10598
Signed-off-by: Wei Fu <fuweid89@gmail.com>
- go1.23.3 (released 2024-11-06) includes fixes to the linker, the
runtime, and the net/http, os, and syscall packages. See the
Go 1.23.3 milestone on our issue tracker for details.
- go1.22.9 (released 2024-11-06) includes fixes to the linker. See
the Go 1.22.9 milestone on our issue tracker for details
Signed-off-by: Akhil Mohan <akhilerm@gmail.com>
No code change.
k8s.io/cri-api was accidentally updated to a non-stable version
v0.32.0-alpha.0 in PR 10552.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
ListPids may not pick up the sh subprocess yet when it is first run. To
make this test more resilient, retry fetching the processes if only a
single pid is found for a short time.
Signed-off-by: Derek McGowan <derek@mcg.dev>
- CRI support for user namespaces (PR 8803)
- CRI support for recursive read-only mounts (PR 9787)
- CDI is now enabled by default (PR 9621)
Co-authored-by: Samuel Karp <me@samuelkarp.com>
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
- combine consecutive "WithField" calls to "WithFields", as multiple
calls is known to be expensive.
- include a "snapshotter" field in logs to allow correlating actions
with specific snapshotters.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Prior to this commit, `/etc/containerd/config.toml` with no version
was parsed as version 3.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
A contentstore can be created on top of readonly path and
should not fail unless there is an attempt to write into it.
Currently this fails because new ingest directory is created
always, meaning for example that you can't create a store to
read blobs from OCI layout without it contaminating the OCI
layout files.
Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
crun is usually used on Fedora, RHEL, and similar distros.
So it makes more sense to run crun tests on Fedora.
Ubuntu jobs are removed, because inflating the number of the jobs will result
in making the flakiness rate much worse.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
While some warnings were available in earlier versions, the first
"complete" implementation was in 1.7.12 and 1.6.27.
https://github.com/containerd/containerd/issues/9312 tracks that initial
set of warnings.
Signed-off-by: Samuel Karp <samuelkarp@google.com>
See for rationale the Pull Request description. Added unit test to
demonstrate the difference of this change.
Signed-off-by: Ray Burgemeestre <rayb@nvidia.com>
Uses the new github.com/containerd/errdefs/pkg module which is intended
to hold less stable utility functions separately from the stable
github.com/containerd/errdefs error types.
Includes temporary update to hcsshim until a release is cut there
Signed-off-by: Derek McGowan <derek@mcg.dev>
containerd launches runc, which communicates via dbus with systemd to start transient units. Thus, containerd should have an `After` dependency on `dbus.service` to prevent dbus from being shut down concurrently with containerd.
Signed-off-by: Benjamin Peterson <benjamin@engflow.com>
This reverts commit f0f1bfca07.
runc 1.1.15 appears to have incresed chances for causing OOMs for
containers with small memory limits. Revert the change in containerd
to unblock CI while the upstream runc issue is resolved.
Dependency-issue: https://github.com/opencontainers/runc/issues/4427
Signed-off-by: Samuel Karp <samuelkarp@google.com>
diff: https://github.com/opencontainers/runc/compare/v1.1.14...v1.1.15
Release notes:
- The -ENOSYS seccomp stub is now always generated for the native
architecture that runc is running on. This is needed to work around some
arguably specification-incompliant behaviour from Docker on architectures
such as ppc64le, where the allowed architecture list is set to null. This
ensures that we always generate at least one -ENOSYS stub for the native
architecture even with these weird configs. (#4391)
- On a system with older kernel, reading /proc/self/mountinfo may skip some
entries, as a consequence runc may not properly set mount propagation,
causing container mounts leak onto the host mount namespace. (#2404, #4425)
- In order to fix performance issues in the "lightweight" bindfd protection
against [CVE-2019-5736], the temporary ro bind-mount of /proc/self/exe
has been removed. runc now creates a binary copy in all cases. (#4392, #2532)
Signed-off-by: Samuel Karp <samuelkarp@google.com>
This change upgrades the runner images in CI to macOS 13. macOS 12
runners are being deprecated.
See https://github.com/actions/runner-images/issues/10721 for more
information.
Signed-off-by: Austin Vazquez <macedonv@amazon.com>
Makes the pprof server a plugin and also gates by the `shim_tracing`
build tag (like otel is).
With this change, `net/http` is no longer a dependency in the shim.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
This makes it so we don't need to import otelttrpc unless the shim is
compiled with the `shim_tracing` build tag.
This way otel is no longer compiled into the binary at all unless
`shim_tracing` is set.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
When an upstream client (e.g. kubelet) stops or restarts, the CRI
connection to the containerd gets interrupted which is treated as a
cancellation of context which subsequently cancels an ongoing operation,
including an image pull. This generally gets followed by containerd's
GC routine that tries to delete the prepared snapshots for the image
layer(s) corresponding to the image in the pull operation that got
cancelled. However, if the upstream client immediately retries (or
starts a new) image pull operation, containerd initiates a new image
pull and starts unpacking the image layers into snapshots. This may
create a race condition: the GC routine (corresponding to the failed
image pull operation) trying to clean up the same snapshot that the new
image pull operation is preparing, thus leading to the "parent snapshot
does not exist: not found" error.
Race Condition Scenario:
Assume an image consisting of 2 layers (L1 and L2, L1 being the bottom
layer) that are supposed to get unpacked into snapshots S1 and S2
respectively.
During an image pull operation, containerd unpacks(L1) which involves
Stat()'ing the chainID. This Stat() fails as the chainID does not
exist and Prepare(L1) gets called. Once S1 gets prepared, containerd
processes L2 - unpack(L2) which again involves Stat()'ing the chainID
which fails as the chainID for S2 does not exist which results in the
call to Prepare(L2). However, if the image pull operation gets
cancelled before Prepare(L2) is called, then the GC routine tries to
clean up S1.
When the image pull operation is retried by the upstream client,
containerd follows the same series of operations. unpack(L1) gets
called which then calls Stat(chainID) for L1. However, this time,
Stat(L1) succedes as S1 already exists (from the previous image pull
operation) and thus containerd goes to the next iteration to
unpack(L2). Now, GC cleans up S1 and when Prepare(L2) gets called, it
returns back the "parent snapshot does not exist: not found" error.
Fix:
Removing the "Stat() + early return" fixes the race condition. Now
during the image pull operation corresponding to the client retry,
although the chainID (for L1) already exists, containerd does not
return early and goes on to Prepare(L1). Since L1 is already prepared,
it adds a new lease to S1 and then returns `ErrAlreadyExists`. This
new lease prevents GC from cleaning up S1 when containerd processes
L2 (unpack(L2) -> Prepare(L2)).
Fixes: https://github.com/containerd/containerd/issues/3787
Signed-off-by: Saket Jajoo <saketjajoo@google.com>
This PR adds the trap statement in the install runc script to clean
up the temporary files and ensure we are not leaving them.
Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
This is needed so we can build the runc shim without grpc as a
transative dependency.
With this change the runc shim binary went from 14MB to 11MB.
The RSS from an idle shim went from about 17MB to 14MB (back around
where it was in in 1.7).
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Core should not have a dependency on API types.
This was causing a transative dependency on grpc when importing the core
snapshots package.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
pkg/oci is a general utility package with dependency chains that are
uneccessary for the shim.
The shim only actually used it for a convenience function for reading
an oci spec file.
Instead of pulling in those deps just re-implement that internally in
the shim command.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
This adds trace context propagation over the grpc/ttrpc calls to a shim.
It also adds the otlp plugin to the runc shim so that it will send
traces to the configured tracer (which is inherited from containerd's
config).
It doesn't look like this is adding any real overhead to the runc shim's
memory usage, however it does add 2MB to the binary size.
As such this is gated by a build tag `shim_tracing`
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
However, when an image has multiple tags, the image originally obtained may not be the one actually specified by the user.
Starting from cri-api v0.28.0, a UserSpecifiedImage field is added to ImageSpec.
It is more appropriate to use UserSpecifiedImage.
Signed-off-by: jinda.ljd <jinda.ljd@alibaba-inc.com>
The detached mount is less likely to fail in our case, but if we see any
failure to unmount, we should just skip the removal of directories.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Using os.RemoveAll() is quite risky, as if the unmount failed and we
can delete files from the container rootfs. In fact, we were doing just
that.
Let's use os.Remove() to make sure we only deleted empty dirs.
Big kudos to @mbaynton for reporting this issue with lot of details,
nailing it down to containerd lines of code and showing all the log
lines to understand the big picture.
Fixes: #10704
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Overlayfs needs to do an idmap mount of each layer and the cleanup
function just unmounts and deletes the directories. However, when the
resource is busy, the umount fails.
Let's make the unmount detached so the unmount will eventually be done
when it's not busy anymore. Also, making it detached solves the issues with
the unmount failing because it is busy.
Big kudos to @mbaynton for reporting this issue with lot of details,
nailing it down to containerd lines of code and showing all the log
lines to understand the big picture.
Fixes: #10704
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Bumps the golang-x group with 1 update in the / directory: [golang.org/x/mod](https://github.com/golang/mod).
Updates `golang.org/x/mod` from 0.20.0 to 0.21.0
- [Commits](https://github.com/golang/mod/compare/v0.20.0...v0.21.0)
---
updated-dependencies:
- dependency-name: golang.org/x/mod
dependency-type: direct:production
update-type: version-update:semver-minor
dependency-group: golang-x
...
Signed-off-by: dependabot[bot] <support@github.com>
Motivation:
For pod-level user namespaces, it's impossible to force the container runtime
to join an existing network namespace after creating a new user namespace.
According to the capabilities section in [user_namespaces(7)][1], a network
namespace created by containerd is owned by the root user namespace. When the
container runtime (like runc or crun) creates a new user namespace, it becomes
a child of the root user namespace. Processes within this child user namespace
are not permitted to access resources owned by the parent user namespace.
If the network namespace is not owned by the new user namespace, the container
runtime will fail to mount /sys due to the [sysfs: Restrict mounting sysfs][2]
patch.
Referencing the [cap_capable][3] function in Linux, a process can access a
resource if:
* The resource is owned by the process's user namespace, and the process has
the required capability.
* The resource is owned by a child of the process's user namespace, and the
owner's user namespace was created by the process's UID.
In the context of pod-level user namespaces, the CRI plugin delegates the
creation of the network namespace to the container runtime when running the
pause container. After the pause container is initialized, the CRI plugin pins
the pause container's network namespace into `/run/netns` and then executes
the `CNI_ADD` command over it.
However, if the pause container is terminated during the pinning process, the
CRI plugin might encounter a PID cycle, leading to the `CNI_ADD` command
operating on an incorrect network namespace.
Moreover, rolling back the `RunPodSandbox` API is complex due to the delegation
of network namespace creation. As highlighted in issue #10363, the CRI plugin
can lose IP information after a containerd restart, making it challenging to
maintain robustness in the RunPodSandbox API.
Solution:
Allow containerd to create a new user namespace and then create the network
namespace within that user namespace. This way, the CRI plugin can force the
container runtime to join both the user namespace and the network namespace.
Since the network namespace is owned by the newly created user namespace,
the container runtime will have the necessary permissions to mount `/sys` on
the container's root filesystem. As a result, delegation of network namespace
creation is no longer needed.
NOTE:
* The CRI plugin does not need to pin the newly created user namespace as it
does with the network namespace, because the kernel allows retrieving a user
namespace reference via [ioctl_ns(2)][4]. As a result, the podsandbox
implementation can obtain the user namespace using the `netnsPath` parameter.
[1]: <https://man7.org/linux/man-pages/man7/user_namespaces.7.html>
[2]: <7dc5dbc879>
[3]: <2c85ebc57b/security/commoncap.c (L65)>
[4]: <https://man7.org/linux/man-pages/man2/ioctl_ns.2.html>
Signed-off-by: Wei Fu <fuweid89@gmail.com>
- https://github.com/golang/go/issues?q=milestone%3AGo1.23.1+label%3ACherryPickApproved
- full diff: https://github.com/golang/go/compare/go1.23.0...go1.23.1
These minor releases include 3 security fixes following the security policy:
- go/parser: stack exhaustion in all Parse* functions
Calling any of the Parse functions on Go source code which contains
deeply nested literals can cause a panic due to stack exhaustion.
This is CVE-2024-34155 and Go issue https://go.dev/issue/69138.
- encoding/gob: stack exhaustion in Decoder.Decode
Calling Decoder.Decode on a message which contains deeply nested
structures can cause a panic due to stack exhaustion.
This is a follow-up to CVE-2022-30635.
Thanks to Md Sakib Anwar of The Ohio State University for reporting
this issue.
This is CVE-2024-34156 and Go issue https://go.dev/issue/69139.
- go/build/constraint: stack exhaustion in Parse
Calling Parse on a "// +build" build tag line with deeply nested
expressions can cause a panic due to stack exhaustion.
This is CVE-2024-34158 and Go issue https://go.dev/issue/69141.
View the release notes for more information:
https://go.dev/doc/devel/release#go1.23.1
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This issue was caused by a race between init exits and new exec process
tracking inside the shim. The test operates by controlling the time
between when the shim invokes "runc exec" and when the actual "runc
exec" is triggered. This allows validating that races for shim state
tracking between pre- and post-start of the exec process do not exist.
Relates to https://github.com/containerd/containerd/issues/10589
Signed-off-by: Samuel Karp <samuelkarp@google.com>
This commit rewrites and simplifies a lot of this logic to reduce it's
complexity, and also handle the case where the container doesn't have
it's own pid-namespace, which means that we're not guaranteed to receive
the init exit last.
This is achieved by replacing `s.pendingExecs` with `s.runningExecs`,
for which both (previously) pending and de facto running execs are
considered.
The new exit handling logic can be summed up by:
- when we receive an init exit, stash it it in `s.containerInitExit`,
and if a container's init process has exited, refuse new execs.
- (if the container does not have it's own pidns) kill all running
processes (if the container has a private pid-namespace, then all
processes will be dead already).
- wait for the container's running exec count (which includes execs
which have been started but might still early exit) to get to 0.
- publish the stashed away init exit.
Signed-off-by: Laura Brehm <laurabrehm@hey.com>
diff: https://github.com/opencontainers/runc/compare/v1.1.13...v1.1.14
Release Notes:
- Fix CVE-2024-45310, a low-severity attack that allowed
maliciously configured containers to create empty files and directories on
the host.
- Add support for Go 1.23.
- Revert "allow overriding VERSION value in Makefile" and add EXTRA_VERSION.
- rootfs: consolidate mountpoint creation logic.
Signed-off-by: Akhil Mohan <akhilerm@gmail.com>
The runc task state machine prevents execs from being created after the
init process has exited, but there are no guards against starting a
created exec after the init process has exited. That leaves a small
window for starting an exec to race our handling of the init process
exiting. Normally this is not an issue in practice: the kernel will
atomically kill all processes in a PID namespace when its "init" process
terminates, and will not allow new processes to fork(2) into the PID
namespace afterwards. Therefore the racing exec is guaranteed by the
kernel to not be running after the init process terminates. On the other
hand, when the container does not have a private PID namespace (i.e. the
container's init process is not the "init" process of the container's
PID namespace), the kernel does not automatically kill other container
processes on init exit and will happily allow runc to start an exec
process at any time. It is the runc shim's responsibility to clean up
the container when the init process exits in this situation by killing
all the container's remaining processes. Block execs from being started
after the container's init process has exited to prevent the processes
from leaking, and to avoid violating the task service's assumption that
an exec can be running iff the init process is also running.
Signed-off-by: Cory Snider <csnider@mirantis.com>
After these changes, in order to add Darwin bind-mount implementation, one only needs:
* Adjust HasBindMounts definition in mount.go
* Provide implementation in mount_darwin.go
There was no consensus on adding dependency on bindfs, that seems to be the only working solution for bind-mounts on Darwin as of today, in https://github.com/containerd/containerd/pull/8789, that's why the actual implementation is not added in current PR.
As a bonus, Linux FUSE-related code was moved to a separate file and possibly could be reused on FreeBSD, though this needs testing.
Signed-off-by: Marat Radchenko <marat@slonopotamus.org>
It's not true that `s.mu` needs to be held when calling
`handleProcessExit`, and indeed hasn't been the case for a
while – see 892dc54bd2.
Signed-off-by: Laura Brehm <laurabrehm@hey.com>
During removal of the container a stat value might be reported as zero; in this case the caluclation could end up with an extremely large number. If the cumulative stat decreases report zero.
Signed-off-by: James Sturtevant <jstur@microsoft.com>
The main reason is to improve the comment about pidfd in Go 1.23+.
While at it:
- avoid slice manipulation as we only need count;
- avoid repeating "/proc/self/fd".
Updates: #10345.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Update the local content plugin to register itself in a consistent way
as other plugins. This also allows the separate package to define its
own configuration more cleanly.
Signed-off-by: Derek McGowan <derek@mcg.dev>
The OpenSSF scorecard is complaining about these two dependencies being
installed without a patch version specified;
Warn: goCommand not pinned by hash: script/setup/install-dev-tools:27
Warn: goCommand not pinned by hash: script/setup/install-dev-tools:28
While the error indicates it expects a hash, it looks like it's fine
with other modules in the same file, the difference being that those
specify a full version, including path version, e.g.;
919beb1cf7/script/setup/install-dev-tools (L26)
This patch updates `protoc-gen-go` and `protoc-gen-go-grpc` to the latest
patch release for the specified versions.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
`disable_cgroup` was implemenetd in containerd/cri PR 970 (Nov 2018)
for supporting very early version of Usernetes on cgroup v1 hosts,
when most distros were still not ready to support cgroup v2.
This configuration is no longer needed, as cgroup v2 delegation is
now supported on almost all distros.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
As per https://github.com/golang/go/issues/60529, printf like commands with
non-constant format strings and no args give an error in govet
Signed-off-by: Akhil Mohan <akhilerm@gmail.com>
Commit 8437c567d8 migrated the use of the
userns package to the github.com/moby/sys/user module.
After further discussion with maintainers, it was decided to move the
userns package to a separate module, as it has no direct relation with
"user" operations (other than having "user" in its name).
This patch migrates our code to use the new module.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
When running the test on Ubuntu focal (kernel version 5.4), the
symlink for pidfd is anon_inode:[pidfd].
Updates: #10345
Signed-off-by: Shengjing Zhu <zhsj@debian.org>
Old shims do not implement containerd.task.v3.Task, but it can be
useful to use a new ctr with an older shim especially during upgrade
scenarios.
Signed-off-by: Samuel Karp <samuelkarp@google.com>
The v2 shim interface supports grouping, so a single shim can manage
multiple tasks. Prior to this change, the `shim state` command could
only query the state of the primary task (task that shares the same ID
as the shim).
Signed-off-by: Samuel Karp <samuelkarp@google.com>
runc v1.1.13 introduced an option to customize the version (as printed by the
`--version` flag) through a `VERSION` Make variable / environment variable
(see [1]).
This variable collided with the `VERSION` environment variable used by
containerd for the same purpose, which lead to `runc` binaries built
using the version of containerd;
runc --version
runc version 1.7.20
commit: v1.1.13-0-g58aa9203
...
This patch unsets the `VERSION` variable to bring prevent it from being
inherited and to bring back the previous behavior.
Before this patch:
docker build -t containerd-test -f contrib/Dockerfile.test .
docker run -it --rm --env VERSION=1.7.20 containerd-test sh -c 'script/setup/install-runc && /usr/local/sbin/runc --version'
# ....
HEAD is now at 58aa9203 VERSION: release 1.1.13
go build -trimpath "-buildmode=pie" -tags "seccomp" -ldflags "-X main.gitCommit=v1.1.13-0-g58aa9203 -X main.version=1.7.20 " -o runc .
install -D -m0755 runc /usr/local/sbin/runc
/go/src/github.com/containerd/containerd
runc version 1.7.20
commit: v1.1.13-0-g58aa9203
spec: 1.0.2-dev
go: go1.22.5
libseccomp: 2.5.4
With this patch:
docker build -t containerd-test -f contrib/Dockerfile.test .
docker run -it --rm --env VERSION=1.7.20 containerd-test sh -c 'script/setup/install-runc && /usr/local/sbin/runc --version'
# ....
HEAD is now at 58aa9203 VERSION: release 1.1.13
go build -trimpath "-buildmode=pie" -tags "seccomp" -ldflags "-X main.gitCommit=v1.1.13-0-g58aa9203 -X main.version=v1.1.13 " -o runc .
install -D -m0755 runc /usr/local/sbin/runc
/go/src/github.com/containerd/containerd
runc version v1.1.13
commit: v1.1.13-0-g58aa9203
spec: 1.0.2-dev
go: go1.22.5
libseccomp: 2.5.4
[1]: 6f4d975c40
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
There are a couple directories that get created under the default
state directory ("/run/containerd") even when containerd is configured
to use a different location for its state directory. Create the default
state directory even if containerd is configured to use a different
state directory location. This ensure pkg/shim and pkg/fifo won't create
the default state directory with incorrect permissions when calling
os.MkdirAll for their respective subdirectories.
Signed-off-by: Erikson Tung <etung@netflix.com>
Similar to container removal, the stop of a container should be a noop if
the container has not been found.
Found during: https://github.com/kubernetes-sigs/cri-tools/pull/1536
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
Similar to sandbox removal, the stop of a sandbox should be a noop if
the sandbox has not been found.
Found during: https://github.com/kubernetes-sigs/cri-tools/pull/1535
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
The userns package in libcontainer was integrated into the moby/sys/user
module at commit [3778ae603c706494fd1e2c2faf83b406e38d687d][1].
This patch deprecates the containerd fork of that package, and adds it as
an alias for the moby/sys/user/userns package.
[1]: 3778ae603c
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Some CRI config properties had removal postponed until v2.1 in
https://github.com/containerd/containerd/pull/9966. Update the
associated deprecation warnings to match the new removal version.
Signed-off-by: Samuel Karp <samuelkarp@google.com>
The /var/lib/containerd/io.containerd.grpc.v1.introspection/uuid file
stores a UUID to identify the particular containerd daemon responding to
requests. The file should either exist with a UUID, or not exist.
However, it has been observed that the file can be truncated with 0
bytes, which will then fail to be parsed as a valid UUID.
As a defensive practice, detect a 0-length file and overwrite with a new
UUID rather than failing.
Fixes: https://github.com/containerd/containerd/issues/10491
Signed-off-by: Samuel Karp <samuelkarp@google.com>
This PR ignores a new pidfd file descriptor that is introduced in
gotip (future 1.23) and should not be considered when detecting fd leaks.
Fixes#10345
Signed-off-by: Mauri de Souza Meneguzzo <mauri870@gmail.com>
There's a couple spots where we know exactly how large
the destination buffer should be, so pre-size these to
avoid any reallocs to a higher capacity.
Signed-off-by: Danny Canter <danny@dcantah.dev>
This patch release includes just a fix to skip userns tests on host that
don't support the feature. See:
https://github.com/kubernetes-sigs/cri-tools/releases/tag/v1.30.1
This is needed for CI to work fine when we update to runc 1.2 (not yet
released). It is also a blocker for the final runc release to make sure
it works in all known downstreams. This makes it work fine here :)
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
slow_chown is safe to add on all kernels, and when running in old
kernels (as some CI distros on purpose are), we want the expensive
fallback.
Vagrant setup and others use this script to config containerd. This
fixes userns tests with runc 1.2.0-rc.2 when running with old kernels.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
When runc 1.2.0 is released, it will expose support for userns and
therefore the critest suite will run those tests. The thing is, runc
needs to be able to traverse the path to mount the rootfs on itself.
Let's just mark the paths from the BDIR upwards with +x permissions, so
the tests run fine. Containerd already makes sure that the paths below
(the ones it creates) have the right permissions and for the right
group, etc.
I've tested with runc 1.2.0-rc.2 and CI fails without this path, with
this patch it works just fine.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
This functionality is not directly related to containerd and could move
to external package at some point.
Signed-off-by: Derek McGowan <derek@mcg.dev>
commit 149ca6880a updated the hcsshim
module to v0.12.4, but did not add a commit to also update the runhcs
binary version.
full diff: https://github.com/microsoft/hcsshim/compare/v0.12.3...v0.12.4
These versions are decoupled since 15b13fb3ea
to allow updating the binary version without updating the module, in cases
where the module doesn't require updates.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
A nil CRIImplementation field can cause a nil pointer dereference and
panic during startup recovery.
Prior to this change, the nri.API struct would have a nil cri
(CRIImplementation) field after nri.NewAPI until nri.Register was
called. Register is called mid-way through initialization of the CRI
plugin, but recovery for containers occurs prior to that. Container
recovery includes establishing new exit monitors for existing containers
that were discovered. When a container exits, NRI plugins are given the
opportunity to be notified about the lifecycle event, and this is done
by accessing that CRIImplementation field inside the nri.API. If a
container exits prior to nri.Register being called, access to the
CRIImplementation field can cause a panic.
Here's the call-path:
* The CRI plugin starts running
[here](ae71819c4f/pkg/cri/server/service.go (L222))
* It then [calls into](ae71819c4f/pkg/cri/server/service.go (L227))
`recover()` to recover state from previous runs of containerd
* `recover()` then attempts to recover all containers through
[`loadContainer()`](ae7d74b9e2/internal/cri/server/restart.go (L175))
* When `loadContainer()` finds a container that is still running, it waits
for the task (internal containerd object) to exit and sets up
[exit monitoring](ae7d74b9e2/internal/cri/server/restart.go (L391))
* Any exit that then happens must be
[handled](ae7d74b9e2/internal/cri/server/events.go (L145))
* Handling an exit includes
[deleting the Task](ae7d74b9e2/internal/cri/server/events.go (L188))
and specifying [`nri.WithContainerExit`](ae7d74b9e2/internal/cri/nri/nri_api_linux.go (L348))
to [notify](ae7d74b9e2/internal/cri/nri/nri_api_linux.go (L356))
any subscribed NRI plugins
* NRI plugins need to know information about the pod (not just the sandbox),
so before a plugin is notified the NRI API package
[queries the Sandbox Store](ae7d74b9e2/internal/cri/nri/nri_api_linux.go (L232))
through the CRI implementation
* The `cri` implementation member field in the `nri.API` struct is set as part of the
[`Register()`](ae7d74b9e2/internal/cri/nri/nri_api_linux.go (L66)) method
* The `nri.Register()` method is only called
[much further down in the CRI `Run()` method](ae71819c4f/pkg/cri/server/service.go (L279))
Signed-off-by: Samuel Karp <samuelkarp@google.com>
Implement calls to the fsverity kernel module, allowing containerd to
enable fsverity on blob data in the content store. This causes fsverity
to veirfy the integrity of blob data when the blob is read.
Signed-off-by: James Jenkins <James.Jenkins@ibm.com>
The behavior of this function is quite counter-intuitive, as it preserves
the delimiter in the result, and its use for external consumers would be
very limited.
Spec.Digest no longer uses this function, and it appears that BuildKit is
currently the only (publicly visible) external consumer of it.
This patch deprecates the function.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The behavior of this function is quite counter-intuitive, as it preserves
the delimiter in the result. This function should probably have been an
internal function, as its use for external consumers would be very limited,
but let's at least document the (surprising) behavior for those that are
considering to use it.
It appears that BuildKit is currently the only (publicly visible) external
consumer of this function; I am planning to inline its functionality in
Spec.Digest() and to deprecate this function so that it can be removed.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
These were straight concatenations of strings; reduce some allocations by
removing fmt.Sprintf for this.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The boltdb instance in metadata is only used for getting transactions
and can also be overriden via the context to have a wider control of the
transaction boundary. Using the transactor interface allows callers of
metadata to have more control of the transaction lifecycle.
Since boltdb must be fsync'ed on commit, operations which perform many
database operations can be costly and slow. While providing transactor
via context can be used to group together operations, it does not
provide a way to manage the commit fsyncs more globally.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Update the dependency and the indirect golang.org/x/net version to align
with containerd itself, and to prevent a vulnerability being detected.
We should keep the versions <= versions used by containerd 1.7 to prevent
forcing users of containerd 1.7 in combination with the latest version
of the API module from having to update all their dependencies, but
this update should likely be fine (and aligns with 1.7).
Before this:
Scanning your code and 254 packages across 15 dependent modules for known vulnerabilities...
=== Symbol Results ===
Vulnerability #1: GO-2024-2687
HTTP/2 CONTINUATION flood in net/http
More info: https://pkg.go.dev/vuln/GO-2024-2687
Module: golang.org/x/net
Found in: golang.org/x/net@v0.21.0
Fixed in: golang.org/x/net@v0.23.0
Example traces found:
#1: events/task_fieldpath.pb.go:85:20: events.TaskIO.Field calls fmt.Sprint, which eventually calls http2.ConnectionError.Error
#2: events/task_fieldpath.pb.go:85:20: events.TaskIO.Field calls fmt.Sprint, which eventually calls http2.ErrCode.String
#3: events/task_fieldpath.pb.go:85:20: events.TaskIO.Field calls fmt.Sprint, which eventually calls http2.FrameHeader.String
#4: events/task_fieldpath.pb.go:85:20: events.TaskIO.Field calls fmt.Sprint, which eventually calls http2.FrameType.String
#5: events/task_fieldpath.pb.go:85:20: events.TaskIO.Field calls fmt.Sprint, which eventually calls http2.Setting.String
#6: events/task_fieldpath.pb.go:85:20: events.TaskIO.Field calls fmt.Sprint, which eventually calls http2.SettingID.String
#7: events/task_fieldpath.pb.go:85:20: events.TaskIO.Field calls fmt.Sprint, which eventually calls http2.StreamError.Error
#8: services/content/v1/content_ttrpc.pb.go:272:35: content.ttrpccontentClient.Write calls ttrpc.Client.NewStream, which eventually calls http2.chunkWriter.Write
#9: events/task_fieldpath.pb.go:85:20: events.TaskIO.Field calls fmt.Sprint, which eventually calls http2.connError.Error
#10: events/task_fieldpath.pb.go:85:20: events.TaskIO.Field calls fmt.Sprint, which eventually calls http2.duplicatePseudoHeaderError.Error
#11: events/task_fieldpath.pb.go:85:20: events.TaskIO.Field calls fmt.Sprint, which eventually calls http2.headerFieldNameError.Error
#12: events/task_fieldpath.pb.go:85:20: events.TaskIO.Field calls fmt.Sprint, which eventually calls http2.headerFieldValueError.Error
#13: events/task_fieldpath.pb.go:85:20: events.TaskIO.Field calls fmt.Sprint, which eventually calls http2.pseudoHeaderError.Error
#14: events/task_fieldpath.pb.go:85:20: events.TaskIO.Field calls fmt.Sprint, which eventually calls http2.writeData.String
Your code is affected by 1 vulnerability from 1 module.
This scan also found 0 vulnerabilities in packages you import and 3
vulnerabilities in modules you require, but your code doesn't appear to call
these vulnerabilities.
Use '-show verbose' for more details.
After this:
govulncheck ./...
Scanning your code and 251 packages across 13 dependent modules for known vulnerabilities...
=== Symbol Results ===
No vulnerabilities found.
Your code is affected by 0 vulnerabilities.
This scan also found 0 vulnerabilities in packages you import and 3
vulnerabilities in modules you require, but your code doesn't appear to call
these vulnerabilities.
Use '-show verbose' for more details.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Unfortunately, this is a rather large diff, but perhaps worth a one-time
"rip off the bandaid" for v2. This patch removes the use of "gocontext"
as alias for stdLib's "context", and uses "cliContext" for uses of
cli.context.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The ParentIDs array in the Snapshot type is populated in the reverse order i.e the
immediate parent is at the 0th index and the oldest parent is at the last index. It can be
seen here:
https://github.com/containerd/containerd/blob/main/core/snapshots/storage/bolt.go#L492
When applying these layers, the parent layer at the last index should be applied first and
the parent layer at the 0th index should be applied last. However, the comment above the
Snapshot type says the exact opposite thing. This commit fixes that comment.
Signed-off-by: Amit Barve <ambarve@microsoft.com>
While the hook is intended to be used with logrus, we don't need to have
the direct import; use the aliases provided by the containerd/log module
instead.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Now that we're on runc v1.1.13, we no longer need to pin the
go version fo runc to go1.21
This reverts commit fef78c1024.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
full diff: https://github.com/opencontainers/runc/compare/v1.1.12...v1.1.13
Release notes:
* If building with Go 1.22.x, make sure to use 1.22.4 or a later version.
* Support go 1.22.4+.
* runc list: fix race with runc delete.
* Fix set nofile rlimit error.
* libct/cg/fs: fix setting rt_period vs rt_runtime.
* Fix a debug msg for user ns in nsexec.
* script/*: fix gpg usage wrt keyboxd.
* CI fixes and misc backports.
* Fix codespell warnings.
* Silence security false positives from golang/net.
* libcontainer: allow containers to make apps think fips is enabled/disabled for testing.
* allow overriding VERSION value in Makefile.
* Vagrantfile.fedora: bump Fedora to 39.
* ci/cirrus: rm centos stream 8.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Since Go 1.20, math/rand does not need explicit seeding:
https://go.dev/doc/go1.20#minor_library_changes
Go <= 1.19 is no longer supported due to EOL.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Debian has started building packages with usernamespaces
to disable network access and similar isolation features. The
containerd package executes a unit test that fails in that
scenario, see https://bugs.debian.org/1070411
The code contains a conditional on whether it is running in
usernamepsace. This commit expands the unit test to cover
this behavior; it was previously untested.
The easiest way to reproduce this issue is to prefix the test
invocaiton with 'unshare -nr go test [...]'
Signed-off-by: Reinhard Tartler <siretart@gmail.com>
ctr currently silently ignores several flags by default (without --local) and
the user can't know which flags are supported until they see the code.
This commit fixes ctr to return an explicit error when it finds an unsupported
flag.
Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
Avoid running tests when a plugin fails to load and return the init
error from the plugin. This prevents the test failing later with an
unhelpful error and attempting to find the actual error in the daemon
logs.
Signed-off-by: Derek McGowan <derek@mcg.dev>
When no port is specified, allow falling back from 443 to 80 when
http is specified along with a TLS configuration.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Includes fix for a symlink race on remove.
Updates 1.21 to 1.21.11 for runc install which also includes the
symlink fix.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Commit 3c8469a782 removed uses of the api
types.Platform type from public interfaces, instead using the type from
the OCI image spec.
For convenience, it also introduced an alias in the platforms package.
While this alias allows packages that already import containerd's
platforms package (now a separate module), it may also cause confusion
(it's not clear that it's an alias for the OCI type), and for packages
that do not depend on containerd's platforms package / module may now
be resulting in an extra dependency.
Let's remove the use of this alias, and instead use the OCI type directly.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Before this, during a call to the docker resolver, we would generate
span wrappers for each HTTPRequest correctly, however, as the docker
resolver reaches out to the docker authorizer, it could create HTTP
requests (for fetching tokens) that would not be wrapped in any span.
This can result in rather confusing traces, e.g. something like:
remotes.docker.resolver.HTTPRequest
HTTP HEAD (fetch index, fails with 401)
HTTP GET (fetch token)
remotes.docker.resolver.HTTPRequest
HTTP HEAD (fetch index)
remotes.docker.resolver.HTTPRequest
HTTP GET (fetch manifest)
By adding a span into the FetchToken, this trace becomes a little easier
to consume:
remotes.docker.resolver.HTTPRequest
HTTP HEAD (fetch index, fails with 401)
remotes.docker.resolver.FetchToken
HTTP GET (fetch token)
remotes.docker.resolver.HTTPRequest
HTTP HEAD (fetch index)
remotes.docker.resolver.HTTPRequest
HTTP GET (fetch manifest)
Signed-off-by: Justin Chadwell <me@jedevc.com>
sandbox address should be in the form of
<ttrpc|grpc>+<unix|vsock|hvsock>://<uds-path|vsock-cid:vsock-port|uds-path:hvsock-port>
for example: ttrpc+hvsock:///run/test.hvsock:1024
or: grpc+vsock://1111111:1024
and the Stdin/Stdout/Stderr will add a `streaming_id` as a parameter of the url
result form is:
<ttrpc|grpc>+<unix|vsock|hvsock>://<uds-path|vsock-cid:vsock-port|uds-path:hvsock-port>?streaming_id=<stream-id>
for example ttrpc+hvsock:///run/test.hvsock:1024?streaming_id=111111
or grpc+vsock://1111111:1024?streaming_id=222222
Signed-off-by: Abel Feng <fshb1988@gmail.com>
Go 1.22.3 release includes bug fixes for the core net/http package.
Full release notes: https://go.dev/doc/devel/release#go1.22.minor
Signed-off-by: Austin Vazquez <macedonv@amazon.com>
When a set of layers are provided to the unpacker, then the unpacker
should still fetch them regardless of whether they will be used for
unpack. The image handler filters are responsible for removing content
which is not intended to be fetched. Currently there is no way to use an
unpacker and also fetch all platforms.
Signed-off-by: Derek McGowan <derek@mcg.dev>
remote sandbox controller may restart, the Wait call should be retried
if it is an grpc disconnetion error.
Signed-off-by: Abel Feng <fshb1988@gmail.com>
This also fixes the following warnings:
```
WARN [config_reader] The configuration option `run.skip-dirs` is deprecated, please use `issues.exclude-dirs`.
WARN [lintersdb] The name "vet" is deprecated. The linter has been renamed to: govet.
```
Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
Fun times: In grpc 1.63 grpc.Dial and a few of the options we use (WithBlock) are
deprecated in favor of the no-IO variant NewClient. The uses in the integration tests
should be easy to swap however as they don't use WithBlock anyways, so that's what this
change aims to do. This also removes some context.WithTimeout's as I don't see anywhere
the context is actually used in Dial if you don't also specify WithBlock (and it's
especially not used now with NewClient as it doesn't even take in a context).
Signed-off-by: Danny Canter <danny@dcantah.dev>
Currently the metadata snapshotter is not consistently adding keys to a
lease when already exists is returned. When a lease is provided, any
already exists errors should add the relevant key to the lease. It is
not expected that clients must explicitly lease a key after calling
Prepare/Commit.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Pulls in kubernetes-sigs/cri-tools PR 1344 (`KEP-3857: Recursive Read-only (RRO) mounts`)
to test PR 9787
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
since go.mod got updated to go1.22, 1.22 is the minimum version to build
containerd. even if 1.21.9 is the version present on the host, go
command will build using 1.22.0 go version.
Signed-off-by: Akhil Mohan <akhilerm@gmail.com>
This just replaces some type casts to check whether a few dial errors are
a specific syscall with the stdlibs errors.As/errors.Is pals.
Signed-off-by: Danny Canter <danny@dcantah.dev>
This commit gets rid of the TODO by moving the check to use the
pluginInfo() infrastructure.
The check is only enforced for shims that return info that can be read
as type runtime.Features. For shims that don't provide that, we just
ignore it, as those shims might not be affected by this.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Allow the api to stay at the same v1 go package name and keep using a
1.x version number. This indicates the API is still at 1.x and allows
sharing proto types with containerd 1.6 and 1.7 releases.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Syself Autopilot is a managed kubernetes solution, added at the end since it's a commercial adopter.
Signed-off-by: Lucas Rattz <lucas.rattz@syself.com>
/usr/sbin/runc is confined with "runc" profile[1] introduced in AppArmor
v4.0.0. This change breaks stopping of containers, because the profile
assigned to containers doesn't accept signals from the "runc" peer.
AppArmor >= v4.0.0 is currently part of Ubuntu Mantic (23.10) and later.
The issue is reproducible both with nerdctl and ctr clients. In the case
of ctr, the --apparmor-default-profile flag has to be specified,
otherwise the container processes would inherit the runc profile, which
behaves as unconfined, and so the subsequent runc process invoked to
stop it would be able to signal it.
Test commands:
root@cloudimg:~# nerdctl run -d --name foo nginx:latest
3d1e74bfe6e7b2912d9223050ae8a81a8f4b73de0846e6d9c956c1e411cdd95a
root@cloudimg:~# nerdctl stop foo
FATA[0000] 1 errors:
unknown error after kill: runc did not terminate successfully: exit status 1: unable to signal init: permission denied
: unknown
or
root@cloudimg:~# ctr pull docker.io/library/nginx:latest
...
root@cloudimg:~# ctr run -d --apparmor-default-profile ctr-default docker.io/library/nginx:latest foo
root@cloudimg:~# ctr task kill foo
ctr: unknown error after kill: runc did not terminate successfully: exit status 1: unable to signal init: permission denied
: unknown
Relevant syslog messages (with long lines wrapped):
Apr 23 22:03:12 cloudimg kernel: audit:
type=1400 audit(1713909792.064:262): apparmor="DENIED"
operation="signal" class="signal" profile="nerdctl-default"
pid=13483 comm="runc" requested_mask="receive"
denied_mask="receive" signal=quit peer="runc"
or
Apr 23 22:05:32 cloudimg kernel: audit:
type=1400 audit(1713909932.106:263): apparmor="DENIED"
operation="signal" class="signal" profile="ctr-default"
pid=13574 comm="runc" requested_mask="receive"
denied_mask="receive" signal=quit peer="runc"
This change extends the default profile with rules that allow receiving
signals from processes that run confined with either runc or crun
profile (crun[2] is an alternative OCI runtime that's also confined in
AppArmor >= v4.0.0, see [1]). It is backward compatible because the peer
value is a regular expression (AARE) so the referenced profile doesn't
have to exist for this profile to successfully compile and load.
[1] https://gitlab.com/apparmor/apparmor/-/commit/2594d936
[2] https://github.com/containers/crun
Signed-off-by: Tomáš Virtus <nechtom@gmail.com>
Fix containerd/nerdctl issue 2730
> [Rootless] `nerdctl rm` fails when AppArmor is loaded:
> `error="unknown error after kill: runc did not terminate successfully: exit status 1:
> unable to signal init: permission denied\n: unknown"`
Caused by:
> kernel: audit: type=1400 audit(1713840662.766:122): apparmor="DENIED" operation="signal" class="signal"
> profile="nerdctl-default" pid=366783 comm="runc" requested_mask="receive" denied_mask="receive" signal=kill
> peer="/usr/local/bin/rootlesskit"
The issue is known to happen on Ubuntu 23.10 and 24.04 LTS.
Doesn't seem to happen on Ubuntu 22.04 LTS.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
This is a non-functional change, that fixes the following typos:
* Snashotter -> Snapshotter
* expectSnapshotter -> expectedSnapshotter
* expectErr -> expectedErr
* exiting-runtime -> existing-runtime
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
go1.21.9 (released 2024-04-03) includes a security fix to the net/http
package, as well as bug fixes to the linker, and the go/types and
net/http packages. See the Go 1.21.9 milestone for more details;
https://github.com/golang/go/issues?q=milestone%3AGo1.21.9+label%3ACherryPickApproved
These minor releases include 1 security fixes following the security policy:
- http2: close connections when receiving too many headers
Maintaining HPACK state requires that we parse and process all HEADERS
and CONTINUATION frames on a connection. When a request's headers exceed
MaxHeaderBytes, we don't allocate memory to store the excess headers but
we do parse them. This permits an attacker to cause an HTTP/2 endpoint
to read arbitrary amounts of header data, all associated with a request
which is going to be rejected. These headers can include Huffman-encoded
data which is significantly more expensive for the receiver to decode
than for an attacker to send.
Set a limit on the amount of excess header frames we will process before
closing a connection.
Thanks to Bartek Nowotarski (https://nowotarski.info/) for reporting this issue.
This is CVE-2023-45288 and Go issue https://go.dev/issue/65051.
View the release notes for more information:
https://go.dev/doc/devel/release#go1.22.2
- https://github.com/golang/go/issues?q=milestone%3AGo1.21.9+label%3ACherryPickApproved
- full diff: https://github.com/golang/go/compare/go1.21.8...go1.21.9
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Some of the snapshotters that allow you to change their root location
were already doing this, this just makes all of them follow the same
pattern.
Signed-off-by: Danny Canter <danny@dcantah.dev>
This makes use of pkg/sys's IgnoringEintr function
to clean up some of the redundant eintr loops we
had laying around.
Signed-off-by: Danny Canter <danny@dcantah.dev>
We have quite a few pieces of code laying around containerd
that all loop and ignore eintr as they make syscalls directly
(or use a unix/syscall wrapper) because there's no stdlib
equivalent. This adds a small utility to pkg/sys that we can
use for all of these spots.
Signed-off-by: Danny Canter <danny@dcantah.dev>
This includes migrating from cdi.GetRegistry() to cdi.Configure() and
using top-level cdi Refresh and InjectDevices functions as applicable.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
We are currently in the process of developing a feature to facilitate guest image pulling
on confidential-containers, and we would be grateful for containerd's support in this endeavor.
It would greatly assist our efforts if containerd could provide the pause image name and
add it into the annotations.
Fixes: #9418
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Use the Syncfs wrapper function defined in the golang.org/x/sys/unix
package instead of manually wrapping it in doSyncFs.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Currenlty transfer service doesn't handle plain HTTP connection.
This commit fixes this issue by propagating
`(core/remotes/docker/config).HostOptions.DefaultScheme` from client to the
transfer service.
This commit also fixes ctr to use this feature for "--plain-http" flag.
Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
Fixes#10013. It seems we can end up in a spot where the sandbox store still
has a listing for a pod, whereas containerds underlying store has removed it.
It might be better to shield the caller (k8s) from these transient errors.
Signed-off-by: Danny Canter <danny@dcantah.dev>
Set 'DisableSliceFlagSeparator = true'
urfave/cli/v2 uses ',' as default string slice separator.
That means '--mount type=bind,src=/src,des=/des,options=rbind:rw'
will be token as four bind mount options.
Fixes: #10003
Signed-off-by: baijia <baijia.wr@antgroup.com>
In order to make sure that we don't publish task exit events for init
processes before we do for execs in that container, we added logic to
`processExits` in 892dc54bd2 to skip these
and let the pending exec's `handleStarted` closure process them.
However, the conditional logic in `processExits` added was faulty - we
should only defer processing of exit events related to init processes,
not other execs. Due to this missing condition,
892dc54bd2 introduced a bug where, if
there are many concurrent execs for the same container/init pid, exec
exits are skipped and then never published, resulting in hanging
clients.
This commit adds the missing logic to `processExits`.
Signed-off-by: Laura Brehm <laurabrehm@hey.com>
This allows arm64 to pull armhf images.
Before this change the transfer service would reject pulls for armhf on
an arm64 machine, or indeed any such platform variant mismatches.
I would argue that its a bit weird for the transfer service to reject a
pull at all since there are legitamate reasons to want to pull images
for other architectures, however that's a more philosophical change.
In the case where I ran into this, I have an arm64 machine running
an armhf containerd in an armhf container (for running some basic sanity
checks during packaging).
Tests started failing once `ctr` was moved to use the transfer service
by default.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
This TODO was added in 9e6db71954, at which time
the reference package was part of the docker/distribution (registry) repository.
The reference package has moved to a standalone module, which has been in use
since 4923470902, so this should no longer be a
concern.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This pacakge is only used internally in the cri package, which is an internal
packages, so we can make the utility internal as well.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This pacakge is only used internally in the cri package, which is an internal
packages, so we can make the utility internal as well.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
commit 10c7f03b3b updated google.golang.org/protobuf
to v1.33.0, which addresses CVE-2024-24786, however a follow-up post on the
Golang security list issued a warning that the v1.33.0 update introduced a
breaking change, causing compatibility with github.com/golang/protobuf to be
broken;
> A small correction: This vulnerability applies when the UnmarshalOptions.DiscardUnknown
> option is set (as well as when unmarshaling into any message which contains a
> google.protobuf.Any). There is no UnmarshalUnknown option.
>
> In addition, version 1.33.0 of google.golang.org/protobuf inadvertently
> introduced an incompatibility with the older github.com/golang/protobuf
> module. (https://github.com/golang/protobuf/issues/1596) Users of the older
> module should update to github.com/golang/protobuf@v1.5.4.
Containerd itself does not appear to be using this code, but consumers may be,
so update the github.com/golang/protobuf to restore compatibility.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
For the first version of containerd's "Forensic Container Checkpointing"
support the error message if the CRIU binary is not found was
deliberately wrong to not break Kubernetes e2e_node tests.
Now that the e2e_node tests have been adapted, containerd can return the
correct error message.
Signed-off-by: Adrian Reber <areber@redhat.com>
This connects the new CRI ContainerCheckpoint RPC to the existing
internal checkpoint functions. With this commit it is possible
to checkpoint a container in Kubernetes using the Forensic Container
Checkpointing KEP (#2008):
# curl X POST "https://localhost:10250/checkpoint/namespace/podId/container"
Which will result in containerd creating a checkpoint in the location
specified by Kubernetes (usually /var/lib/kubelet/checkpoints).
This is a Linux only feature because CRIU only exists on Linux.
Rewritten with the help of Phil Estes.
Signed-off-by: Phil Estes <estesp@gmail.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
CimFS layers don't need to create a new scratch VHD per image. The scratch VHDs used with CimFS are empty so
we can just create one base VHD and one differencing VHD and copy it for every scratch snapshot.
(Note that UVM VHDs are still unique per image because the VHD information is embedded in the UVM BCD during
import)
Signed-off-by: Amit Barve <ambarve@microsoft.com>
Currently transfer service isn't aware of configurations of hosts directory and
ctr's `--hosts-dir` doesn't work.
This commit fixes this issue by using `config.ConfigureHosts` instead of
`docker.ConfigureDefaultRegistries`.
This commit also fixes ctr to use this feature for "--hosts-dir" flag.
Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
Split service proxy from service plugin.
Make introspection service easier for clients to use.
Update service proxy to support grpc and ttrpc.
Signed-off-by: Derek McGowan <derek@mcg.dev>
For a given container, as long as the init process is the init process
of that PID namespace, we always receive the exits for execs before we
receive them for the init process.
It's important that we uphold this invariant for the outside world by
always emitting a TastExit event for a container's exec before we emit
one for the init process because this is the expected behavior from
callers, and changing this creates issues - such as Docker, which will
delete the container after receiving a TaskExit for the init process,
and then not be able to handle the exec's exit after having deleted
the container (see: https://github.com/containerd/containerd/issues/9719).
Since 5cd6210ad0, if an exec is starting
at the same time that an init exits, if the exec is an "early exit"
i.e. we haven't emitted a TaskStart for it/put it in `s.running` by the
time we receive it's exit, we notify concurrent calls to `s.Start()` of
the exit and continue processing exits, which will cause us to process
the Init's exit before the exec, and emit it, which we don't want to do.
This commit introduces a map `s.pendingExecs` to keep track of the
number of pending execs keyed by container, which allows us to skip
processing exits for inits if there are pending execs, and instead
have the closure returned by `s.preStart` handle the init exit after
emitting the exec's exit.
Signed-off-by: Laura Brehm <laurabrehm@hey.com>
NRI is still newer and mostly used by CRI plugin. Keep the package in
internal to allow for interfaces as the project matures.
Signed-off-by: Derek McGowan <derek@mcg.dev>
This commit fixes the duplicate copy and configure steps for
the Windows powershell scripts.
fixes#9887
It also adds the architecture as a variable in preparation for
the ARM64 support that is coming.
Signed-off-by: Anthony Nandaa <profnandaa@gmail.com>
so that we cri service don't have to get sandbox controller everytime it
needs to call sandbox controller api.
Signed-off-by: Abel Feng <fshb1988@gmail.com>
A downstream library (s3) needs a read seeker to be able to do its own multipart upload.
See: https://github.com/moby/buildkit/pull/4551
Signed-off-by: Adrien Delorme <azr@users.noreply.github.com>
Since kubernetes 1.30, the kubelet will query the runtime handlers
features and only start pods with userns if the runtime handler used for
that pod supports it.
Let's expose the user namespace support to the kubelet.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
These are standard environment variables described by the otel spec in
https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/.
The old config options are removed
Also since otel will by default try to connect to https://localhost:4318
if no endpoint is set, this will also just disable the otlp plugin when
there is no endpoint so we don't have otel continuously trying to
connect to the default endpoint, littering the logs with connection
failure messages and collecting traces that won't go anywhere.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Fixes#9806
go-grpc-prometheus is deprecated. The new location it was moved to also introduced
an entirely new api, but afaict this matches what we have at the moment.
Signed-off-by: Danny Canter <danny@dcantah.dev>
As `setupSandboxFiles` was done in sandbox controller, it is difficult
here to know if the sandbox controller has done and where the host path
in. Make sure the host path exists before adding them to linux container
mounts, otherwise, the container would generate some unnecessary mounts.
Signed-off-by: Zhang Tianyang <burning9699@gmail.com>
See kubernetes/enhancements issue 3857 (PR 3858).
Replaces PR 9713 `cri: make read-only mounts recursively read-only`
Unlike PR 9713, this PR does not automatically upgrade RO mounts to RRO.
Test depends on:
- kubernetes-sigs/cri-tools PR 1344
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Adds a plugin type for container monitor.
Rename the task monitor type to avoid confusion.
Add config migration for new plugin types to pass existing migration
tests.
Signed-off-by: Derek McGowan <derek@mcg.dev>
It's used to check new release containerd can parse metric data from existing
shim created by previous release.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
Followed the Migration Guide at https://cli.urfave.org/migrate-v1-to-v2/
The major changes not pointed out in the migration guide are:
- context.Args() no longer produces a []slice, so context.Args().Slice()
in substitued
- All cli.Global***** are deprecated (the migration guide is somewhat
unclear on this)
Signed-off-by: Derek Nola <derek.nola@suse.com>
Vendor in urfave cli/v2
Signed-off-by: Derek Nola <derek.nola@suse.com>
Fix NewStringSlice calls
Signed-off-by: Derek Nola <derek.nola@suse.com>
On startup `gcTimeSum` might work fast and return `0`, so on this case
the algorithm turns in infinity loop which simple consume CPU on timer
which fires without any interval.
Use `5ms` as fallback to have interval `245ms` for that case.
Closes: https://github.com/containerd/containerd/issues/5089
Signed-off-by: Kirill A. Korinsky <kirill@korins.ky>
Schema 1 (`application/vnd.docker.distribution.manifest.v1+prettyjws`) has been
officially deprecated since containerd v1.7 (PR 6884).
We have planned to remove the support for Schema 1 in containerd v2.0, but this
removal may still surprise some users.
So, in containerd v2.0 we will just disable it by default.
The support for Schema 1 can be still enabled by setting an environment variable
`CONTAINERD_ENABLE_DEPRECATED_PULL_SCHEMA_1_IMAGE=1`, however, this workaround
will be completely removed in containerd v2.1.
Schema 2 was introduced in Docker 1.10 (Feb 2016), so most users should
have been already using Schema 2 or OCI.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Propagate the deprecation list to CRI runtime conditions.
The propagated conditions are visible via `crictl info`,
but not visible via `kubectl get nodes -o yaml` yet, although
the CRI API says "These conditions will be exposed to users to help
them understand the status of the system".
https://github.com/kubernetes/cri-api/blob/v0.29.1/pkg/apis/runtime/v1/api.proto#L1505-L1509
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Print deprecation warnings on any ctr command, as users won't notice the
deprecations until we actually remove the deprecated features.
The warnings can be suppressed by setting
`CONTAINERD_SUPPRESS_DEPRECATION_WARNINGS=1`.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
The links GitHub Action workflow is failing after a refactor to move the
CRI package from pkg/cri to internal/cri. The note which contained the
link is no longer needed as CRI plugin has been internal since v1.5.
Signed-off-by: Austin Vazquez <macedonv@amazon.com>
Packages related to transfer and unpacking provide core interfaces which
use other core interfaces and part of common functionality.
Signed-off-by: Derek McGowan <derek@mcg.dev>
We added support for userns but we weren't showing it in the
podSandboxStatus.
Let's just show the whole nsOpts, so we don't forget in the future
either if something else inside there changes.
Please note that this will expose the content of nsOpts.TargetId that we
weren't exposing before. But that seemed like a bug to me.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Maybe this is better?
The metadata store is in the best place to handle events directly after
the database has been updated. This prevents every user of the image
store interface from having to know whether or not they are responsible
for publishing events and avoid double events if the grpc local service
is used.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Only the newer version of strace can support `--detach-on` options
and set time duration with human readable string.
In the 4.x version of strace, using `-b` to replace `--detach-on`,
and injecting a delay with int usecs.
Signed-off-by: Zoe <hi@zoe.im>
This commit adds an extra (optional) step for the Windows
installation/set-up to include the containerd binaries in
the $env:Path so that later executions especially
for `ctr.exe` if needed, do not require to specify the full path.
It also further fixes the previous steps to be absolute and
also work with re-installations and upgrades.
Signed-off-by: Anthony Nandaa <profnandaa@gmail.com>
Previously, resolveImports would apply a glob filter if
the path contained any '*', or otherwise convert relative
paths to absolute. This meant that it was impossible to
specify globs with paths relative to the main config file.
This commit first resolves relative to absolute paths, then
applies the glob filter (if any). A test case is added to ensure
that this now works as expected.
Signed-off-by: Angelos Kolaitis <neoaggelos@gmail.com>
We also need an additional check to avoid setting both the error and
response which can create a race where they can arrive in the receiving
thread in either order.
If we hit an error, we don't need to send the response.
> There is a condition where the registry (unexpectedly, not to spec)
> returns 201 or 204 on the put before the body is fully written. I would
> expect that the http library would issue close and could fall into a
> deadlock here. We could just read respC and call setResponse. In that
> case ErrClosedPipe would get returned and Commit shouldn't be called
> anyway.
Signed-off-by: Justin Chadwell <me@jedevc.com>
If sending two messages from goroutine X:
a <- 1
b <- 2
And receiving them in goroutine Y:
select {
case <- a:
case <- b:
}
Either branch of the select can trigger first - so when we call
.setError and .Close next to each other, we don't know whether the done
channel will close first or the error channel will receive first - so
sometimes, we get an incorrect error message.
We resolve this by not sending both signals - instead, we can have
.setError *imply* .Close, by having the pushWriter call .Close on
itself, after receiving an error.
Signed-off-by: Justin Chadwell <me@jedevc.com>
If a writer continually asks to be reset then it should always succeed -
it should be the responsibility of the underlying content.Writer to
stop producing ErrReset after some amount of time and to instead return
the underlying issue - which pushWriter already does today, using the
doWithRetries function.
doWithRetries already has a separate cap for retries of 6 requests (5
retries after the original failure), and it seems like this would be
previously overridden by content.Copy's max number of 5 attempts, hiding
the original error.
Signed-off-by: Justin Chadwell <me@jedevc.com>
If we get io.ErrClosedPipe in pushWriter.Write, there are three possible
scenarios:
- The request has failed, we need to attempt a reset, so we can expect a
new pipe incoming on pipeC.
- The request has failed, we don't need to attempt a reset, so we can
expect an incoming error on errC.
- Something else externally has called Close, so we can expect the done
channel to be closed.
This patch ensures that we block for as long as possible (while still
handling each of the above cases, so we avoid hanging), to make sure
that we properly return an appropriate error message each time.
Signed-off-by: Justin Chadwell <me@jedevc.com>
If Close is called externally before a request is attempted, then we
will accidentally attempt to send to a closed channel, causing a panic.
To avoid this, we can check to see if Close has been called, using a
done channel. If this channel is ever done, we drop any incoming errors,
requests or pipes - we don't need them, since we're done.
Signed-off-by: Justin Chadwell <me@jedevc.com>
io.Pipe produces a PipeReader and a PipeWriter - a close on the write
side, causes an error on both the read and write sides, while a close on
the read side causes an error on only the read side. Previously, we
explicitly prohibited closing from the read side.
However, http.Request.Body requires that "calling Close should unblock a
Read waiting for input". Our reader will not do this - calling close
becomes a no-op. This can cause a deadlock because client.Do may never
terminate in some circumstances.
We need the Reader side to close its side of the pipe as well, which it
already does using the go standard library - otherwise, we can hang
forever, writing to a pipe that will never be closed.
Allowing the requester to close the body should be safe - we never reuse
the same reader between requests, as the result of body() will never be
reused by the guarantees of the standard library.
Signed-off-by: Justin Chadwell <me@jedevc.com>
If we find that DNSConfig is provided and empty (not nil), we should not
replace it with the host's resolv.conf.
Also adds tests.
Signed-off-by: Tim Hockin <thockin@google.com>
Prior to this commit, `readOnly` volumes were not recursively read-only and
could result in compromise of data;
e.g., even if `/mnt` was mounted as read-only, its submounts such as
`/mnt/usbstorage` were not read-only.
This commit utilizes runc's "rro" bind mount option to make read-only bind
mounts literally read-only. The "rro" bind mount options is implemented by
calling `mount_setattr(2)` with `MOUNT_ATTR_RDONLY` and `AT_RECURSIVE`.
The "rro" bind mount options requires kernel >= 5.12, with runc >= 1.1 or
a compatible runtime such as crun >= 1.4.
When the "rro" bind mount options is not available, containerd falls back
to the legacy non-recursive read-only mounts by default.
The behavior is configurable via `/etc/containerd/config.toml`:
```toml
version = 2
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
# treat_ro_mounts_as_rro ("Enabled"|"IfPossible"|"Disabled")
# treats read-only mounts as recursive read-only mounts.
# An empty string means "IfPossible".
# "Enabled" requires Linux kernel v5.12 or later.
# This configuration does not apply to non-volume mounts such as "/sys/fs/cgroup".
treat_ro_mounts_as_rro = ""
```
Replaces:
- kubernetes/enhancements issue 3857
- kubernetes/enhancements PR 3858
Note: this change does not affect non-CRI clients such as ctr, nerdctl, and Docker/Moby.
RRO mounts have been supported since nerdctl v0.14 (containerd/nerdctl PR 511)
and Docker v25 (moby/moby PR 45278).
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
The file was replaced with the "Please update your bookmark" page on
Apr 1, 2022 (PR 6758).
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
The log package was kept because hcsshim had a dependency. This was
removed in https://github.com/microsoft/hcsshim/pull/1898. So, its not
required to maintain the containerd/containerd/log package anymore.
Signed-off-by: Akhil Mohan <akhilerm@gmail.com>
The new `PlunginInfo()` call can be used for instrospecting the details
of the runtime plugin.
```console
$ ctr plugins inspect-runtime --runtime=io.containerd.runc.v2 --runc-binary=runc
{
"Name": "io.containerd.runc.v2",
"Version": {
"Version": "v2.0.0-beta.0-XX-gXXXXXXXXX.m",
"Revision": "v2.0.0-beta.0-XX-gXXXXXXXXX.m"
},
"Options": {
"binary_name": "runc"
},
"Features": {
"ociVersionMin": "1.0.0",
"ociVersionMax": "1.1.0-rc.2",
...,
},
"Annotations": null
}
```
The shim binary has to support `-info` flag, see `runtime/v2/README.md`
Replaces PR 8509 (`api/services/task: add RuntimeInfo()`)
Co-authored-by: Derek McGowan <derek@mcg.dev>
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
We should set sandbox CreatedAt first time when we create sandbox struct,
and then set sandbox CreatedAt second time after container started.
Before this commit, we just set sandbox CreatedAt after container
started, but if network create failed, the sandbox time is the
default time, which is 269 years ago, so we need to set sandbox
CreatedAt at first, even if an error occurred before start container.
Signed-off-by: zzzzzzzzzy9 <zhang.yu58@zte.com.cn>
This commit addresses issue #7318 by introducing events broadcasting
to the current implementation. The integration/container_event_test.go
is extended to demonstrate the broadcasting capabilities
of two simultaneous connected clients.
Signed-off-by: Yury Gargay <yury.gargay@gmail.com>
Signed-off-by: krglosse <krglosse@us.ibm.com>
do not alter original slice
Signed-off-by: krglosse <krglosse@us.ibm.com>
Update core/mount/temp.go
makes sense, thank you!
Co-authored-by: Derek McGowan <derek@mcg.dev>
Signed-off-by: KodieGlosserIBM <39170759+KodieGlosserIBM@users.noreply.github.com>
do not copy mount structure unless conditional is met and adding a test case for it
Signed-off-by: krglosse <krglosse@us.ibm.com>
copy option slice when removing the element instead of giving the element an empty string
remove unneeded block
Signed-off-by: krglosse <krglosse@us.ibm.com>
simplify
Signed-off-by: krglosse <krglosse@us.ibm.com>
A media type string passed via `WithMediaType()` was not propagated
to a descriptor returned by `FetchByDigest()`.
Follow-up to PR 8744
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Ensure migration picks up defaults and correct ordering from the plugin
configuration. Ensures that the migration matches the behavior of the
default output and how the configuration will be loaded.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Merge slices while checking for equal values rather than always
appending. Remove setting Import to prevent migrations from setting
incorrect configuration Imports.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Remove default unpack configuration to prevent duplication of
configuration from toml decoder appending to the default. When no unpack
configuration is provided, use the default.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Prior to this commit, `--local=false` had to be explicitly specified to
opt-in to the transfer service
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
actions/upload-artifact@v4 marks artifacts as immutable. Thus, tests
which use matrix should have a unique artifact name while using
upload-artifact github action
Ref: https://github.com/actions/upload-artifact/releases/tag/v4.0.0
Signed-off-by: Akhil Mohan <akhilerm@gmail.com>
This change simplifies the CRI plugin dependencies by not requiring the
CRI image plugin to depend on any other CRI components. Since other CRI
plugins depend on the image plugin, this allows prevents a dependency
cycle for CRI configurations on a base plugin.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Updates the CRI image service to own image related configuration and
separate it from the runtime configuration.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Prepares the CRI image service for splitting CRI into multiple plugins.
Also prepares for config migration which will spread across multiple
different plugins.
Signed-off-by: Derek McGowan <derek@mcg.dev>
The interface that combines both content.InfoProvider and
content.Provider was duplicated in multiple places - create one directly
in `content` package and use it instead.
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
This dependency was removed in 2af6db672e, but
was re-introduced in commit 2fab240f21.
Now that golang.org/x/tools also stopped using this dependency, removing
this use will remove the package from our dependency tree.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Also refactor tests to use the t.Run and run each test concurrently in a
separate namespace.
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
Persist manifest/indexes distribution source labels as annotations in
the index.json. This could allow the importer to fetch the missing blobs
from the external repository.
These can't really be persisted directly in blob descriptors because
that would alter the digests.
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
Allow importing/exporting archives which doesn't have all the referenced
blobs. This allows to export/import an image with only some of the
platforms available locally while still persisting the full index.
> The blobs directory MAY be missing referenced blobs, in which case the missing blobs SHOULD be fulfilled by an external blob store.
https://github.com/opencontainers/image-spec/blob/v1.0/image-layout.md#blobs
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
Fixes ae7021300
In ae7021300 the WritePidFile and WriteAddress functions were
changed to use AtomicFile instead of os.CreateFile. However,
AtomicFile creates a temporary file and then changes its permissions
with os.Chmod which alters the previously observed behavior of
os.CreateFile which takes the system's umask into account.
This means that on Linux-based systems these files suddenly
became world writable (#9363). The address file has since been
removed, but pid-file was still created as world writable. This
commit explicitly requests 0644 permissions as even on systems
without default umask of 0022 there is no reason to have these
two files world writable.
Signed-off-by: Jaroslav Jindrak <dzejrou@gmail.com>
Before this, containerd would always print an error when shutting down;
ERRO[2023-12-07T14:35:00.070333131Z] failed to close plugin error="context canceled" id=io.containerd.internal.v1.tracing
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Fix CI badge links to filter only merge_group events as main branch
now uses merge_group to push changes. This will give a more accurate
representation of the status of CI rather than showing the complete
status of CI workflow.
Signed-off-by: Akhil Mohan <akhilerm@gmail.com>
Scratch images don't necessarily have the /etc/group file, so we shouldn't
fail if opening/parsing it is not needed: if all the group to add are numeric.
Signed-off-by: Djordje Lukic <djordje.lukic@docker.com>
It's to ensure the data integrity during unexpected power failure.
Background:
Since release 1.3, in Linux system, containerD unpacks and writes files into
overlayfs snapshot directly. It doesn’t involve any mount-umount operations
so that the performance of pulling image has been improved.
As we know, the umount syscall for overlayfs will force kernel to flush
all the dirty pages into disk. Without umount syscall, the files’ data relies
on kernel’s writeback threads or filesystem's commit setting (for
instance, ext4 filesystem).
The files in committed snapshot can be loss after unexpected power failure.
However, the snapshot has been committed and the metadata also has been
fsynced. There is data inconsistency between snapshot metadata and files
in that snapshot.
We, containerd, received several issues about data loss after unexpected
power failure.
* https://github.com/containerd/containerd/issues/5854
* https://github.com/containerd/containerd/issues/3369#issuecomment-1787334907
Solution:
* Option 1: SyncFs after unpack
Linux platform provides [syncfs][syncfs] syscall to synchronize just the
filesystem containing a given file.
* Option 2: Fsync directories recursively and fsync on regular file
The fsync doesn't support symlink/block device/char device files. We
need to use fsync the parent directory to ensure that entry is
persisted.
However, based on [xfstest-dev][xfstest-dev], there is no case to ensure
fsync-on-parent can persist the special file's metadata, for example,
uid/gid, access mode.
Checkout [generic/690][generic/690]: Syncing parent dir can persist
symlink. But for f2fs, it needs special mount option. And it doesn't say
that uid/gid can be persisted. All the details are behind the
implemetation.
> NOTE: All the related test cases has `_flakey_drop_and_remount` in
[xfstest-dev].
Based on discussion about [Documenting the crash-recovery guarantees of Linux file systems][kernel-crash-recovery-data-integrity],
we can't rely on Fsync-on-parent.
* Option 1 is winner
This patch is using option 1.
There is test result based on [test-tool][test-tool].
All the networking traffic created by pull is local.
* Image: docker.io/library/golang:1.19.4 (992 MiB)
* Current: 5.446738579s
* WIOS=21081, WBytes=1329741824, RIOS=79, RBytes=1197056
* Option 1: 6.239686088s
* WIOS=34804, WBytes=1454845952, RIOS=79, RBytes=1197056
* Option 2: 1m30.510934813s
* WIOS=42143, WBytes=1471397888, RIOS=82, RBytes=1209344
* Image: docker.io/tensorflow/tensorflow:latest (1.78 GiB, ~32590 Inodes)
* Current: 8.852718042s
* WIOS=39417, WBytes=2412818432, RIOS=2673, RBytes=335987712
* Option 1: 9.683387174s
* WIOS=42767, WBytes=2431750144, RIOS=89, RBytes=1238016
* Option 2: 1m54.302103719s
* WIOS=54403, WBytes=2460528640, RIOS=1709, RBytes=208237568
The Option 1 will increase `wios`. So, the `image_pull_with_sync_fs` is
option in CRI plugin.
[syncfs]: <https://man7.org/linux/man-pages/man2/syncfs.2.html>
[xfstest-dev]: <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git>
[generic/690]: <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/tests/generic/690?h=v2023.11.19>
[kernel-crash-recovery-data-integrity]: <https://lore.kernel.org/linux-fsdevel/1552418820-18102-1-git-send-email-jaya@cs.utexas.edu/>
[test-tool]: <a17fb2010d/contrib/syncfs/containerd/main_test.go (L51)>
Signed-off-by: Wei Fu <fuweid89@gmail.com>
There are many Kubernetes clusters running on ARM64. Enable ARM64 runner
is to commit to support ARM64 platform officially.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
All components need to use a consistent `semconv` version or OTel
will emit errors about "cannot merge resource due to conflicting Schema URL".
Switch to the appropriate semconv version, which requires dropping
usage of `httpconv`. Instead, the upstream HTTP client hooks are
used directly. (The lower-level functions are no longer exported by
OTel.)
Signed-off-by: Milas Bowman <milas.bowman@docker.com>
go1.21.5 (released 2023-12-05) includes security fixes to the go command,
and the net/http and path/filepath packages, as well as bug fixes to the
compiler, the go command, the runtime, and the crypto/rand, net, os, and
syscall packages. See the Go 1.21.5 milestone on our issue tracker for
details:
- https://github.com/golang/go/issues?q=milestone%3AGo1.21.5+label%3ACherryPickApproved
- full diff: https://github.com/golang/go/compare/go1.21.4...go1.21.5
from the security mailing:
[security] Go 1.21.5 and Go 1.20.12 are released
Hello gophers,
We have just released Go versions 1.21.5 and 1.20.12, minor point releases.
These minor releases include 3 security fixes following the security policy:
- net/http: limit chunked data overhead
A malicious HTTP sender can use chunk extensions to cause a receiver
reading from a request or response body to read many more bytes from
the network than are in the body.
A malicious HTTP client can further exploit this to cause a server to
automatically read a large amount of data (up to about 1GiB) when a
handler fails to read the entire body of a request.
Chunk extensions are a little-used HTTP feature which permit including
additional metadata in a request or response body sent using the chunked
encoding. The net/http chunked encoding reader discards this metadata.
A sender can exploit this by inserting a large metadata segment with
each byte transferred. The chunk reader now produces an error if the
ratio of real body to encoded bytes grows too small.
Thanks to Bartek Nowotarski for reporting this issue.
This is CVE-2023-39326 and Go issue https://go.dev/issue/64433.
- cmd/go: go get may unexpectedly fallback to insecure git
Using go get to fetch a module with the ".git" suffix may unexpectedly
fallback to the insecure "git://" protocol if the module is unavailable
via the secure "https://" and "git+ssh://" protocols, even if GOINSECURE
is not set for said module. This only affects users who are not using
the module proxy and are fetching modules directly (i.e. GOPROXY=off).
Thanks to David Leadbeater for reporting this issue.
This is CVE-2023-45285 and Go issue https://go.dev/issue/63845.
- path/filepath: retain trailing \ when cleaning paths like \\?\c:\
Go 1.20.11 and Go 1.21.4 inadvertently changed the definition of the
volume name in Windows paths starting with \\?\, resulting in
filepath.Clean(\\?\c:\) returning \\?\c: rather than \\?\c:\ (among
other effects). The previous behavior has been restored.
This is an update to CVE-2023-45283 and Go issue https://go.dev/issue/64028.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
go1.21.4 (released 2023-11-07) includes security fixes to the path/filepath
package, as well as bug fixes to the linker, the runtime, the compiler, and
the go/types, net/http, and runtime/cgo packages. See the Go 1.21.4 milestone
on our issue tracker for details:
- https://github.com/golang/go/issues?q=milestone%3AGo1.21.4+label%3ACherryPickApproved
- full diff: https://github.com/golang/go/compare/go1.21.3...go1.21.4
from the security mailing:
[security] Go 1.21.4 and Go 1.20.11 are released
Hello gophers,
We have just released Go versions 1.21.4 and 1.20.11, minor point releases.
These minor releases include 2 security fixes following the security policy:
- path/filepath: recognize `\??\` as a Root Local Device path prefix.
On Windows, a path beginning with `\??\` is a Root Local Device path equivalent
to a path beginning with `\\?\`. Paths with a `\??\` prefix may be used to
access arbitrary locations on the system. For example, the path `\??\c:\x`
is equivalent to the more common path c:\x.
The filepath package did not recognize paths with a `\??\` prefix as special.
Clean could convert a rooted path such as `\a\..\??\b` into
the root local device path `\??\b`. It will now convert this
path into `.\??\b`.
`IsAbs` did not report paths beginning with `\??\` as absolute.
It now does so.
VolumeName now reports the `\??\` prefix as a volume name.
`Join(`\`, `??`, `b`)` could convert a seemingly innocent
sequence of path elements into the root local device path
`\??\b`. It will now convert this to `\.\??\b`.
This is CVE-2023-45283 and https://go.dev/issue/63713.
- path/filepath: recognize device names with trailing spaces and superscripts
The `IsLocal` function did not correctly detect reserved names in some cases:
- reserved names followed by spaces, such as "COM1 ".
- "COM" or "LPT" followed by a superscript 1, 2, or 3.
`IsLocal` now correctly reports these names as non-local.
This is CVE-2023-45284 and https://go.dev/issue/63713.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
If a snapshot removal fails (during garbage collection), the entire garbage collection operation is
cancelled. This is problematic because once cleanup of any snapshot fails no other snapshots will be cleaned
and the disk usage will just keep increasing.
Solution is to return snapshot removal errors wrapped as "ErrFailedPrecondition" errors. The garbage
collectors continues cleanup if the error is of this type.
Signed-off-by: Amit Barve <ambarve@microsoft.com>
The runtime-spec just merged this PR:
https://github.com/opencontainers/runtime-spec/pull/1224
This means that it is now possible to request idmap mounts by specifying
"idmap" or "ridmap" in the mount options, without any mappings.
Let's add a check to see if they are requested in that way too.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
For backward compatibility, we should get runtimeInfo from sandbox in
db, or get it from the sandbox container in db.
Note that this is a temporary solution and we will remove the Container field in
Sandbox in cri cache, and replace it with a SandboxInsantance of type
containerd.Sandbox interface.
Signed-off-by: Abel Feng <fshb1988@gmail.com>
This is mostly to workaround an issue with gRPC based shims after containerd
restart. If a shim dies while containerd is also down/restarting, on reboot
grpc.DialContext with our current set of DialOptions will make us wait for
100 seconds per shim even if the socket no longer exists or has no listener.
Signed-off-by: Danny Canter <danny@dcantah.dev>
This change removes the hard-coded containerd endpoint
for CRI test and use the address in the config which would
honor the CLI flag.
Signed-off-by: Yibo Zhuang <yibzhuang@gmail.com>
This commit fixes the dialer function to make sure that
"npipe://" prefix is trimmed, just like the way it is done
in the Unix counterpart, `./dialer_unix.go:50`
This will also unblock some downstream work going on in
buildkit; setting up integration tests to run on Windows.
Signed-off-by: Anthony Nandaa <profnandaa@gmail.com>
Propagate parent distribution source labels to each of its children even
if they're not missing. This allows to cross-repo mount blobs when the
child content has different distribution source label from its
parent manifest/index. This could happen when different parts of image
were fetched from different sources.
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
Close connection if no more data. It's to fix false alert filed by image
pull progress.
```
dst = OpenWriter (--> Content Store)
src = Fetch
Open (--> Registry)
Mark it as active request
Copy(dst, src) (--> Keep updating total received bytes)
^
| (Active Request > 0, but total received bytes won't be updated)
v
defer src.Close()
content.Commit(dst)
```
Before migrating to transfer service, CRI plugin doesn't limit global
concurrent downloads for ImagePulls. Each ImagePull requests have 3 concurrent
goroutines to download blob and 1 goroutine to unpack blob. Like ext4
filesystem [1][1], the fsync from content.Commit may sync unrelated dirty pages
into disk. The host is running under IO pressure, and then the content.Commit
will take long time and block other goroutines. If httpreadseeker
doesn't close the connection after io.EOF, this connection will be
considered as active. The pull progress reporter reports there is no
bytes transfered and cancels the ImagePull.
The original 1-minute timeout[2][2] is from kubelet settting. Since CRI-plugin
can't limit the total concurrent downloads, this patch is to update 1-minute
to 5-minutes to prevent from unexpected cancel.
[1]: https://lwn.net/Articles/842385/
[2]: https://github.com/kubernetes/kubernetes/blob/release-1.23/pkg/kubelet/config/flags.go#L45-L48
Signed-off-by: Wei Fu <fuweid89@gmail.com>
The new active request is filed and there is no bytes read yet when the
progress reporter just wakes up. If the timeout / 2 is less than the
minPullProgressReportInternal, it's easy to file false alert.
We should remove the minPullProgressReportInternal limit.
Fixes: #8024
Signed-off-by: Wei Fu <fuweid89@gmail.com>
Adds debug message per layer unpacking and adds duration field to
the existing image unpacking debug message.
Signed-off-by: Austin Vazquez <macedonv@amazon.com>
Upgrade OpenTelemetry core libs to v1.19.0 and contrib (for gRPC
tracing) to v0.45.0.
The OpenTelemetry internal module structure/dependency graph is
rather complex, and recently some parts (e.g. metrics) have
graduated to "stable" from "unstable", so this upgrade is important
to unblock downstream projects to be able to use newer versions of
the OpenTelemetry libraries, as they can cause compatibility issues
due to internal/peer dependency changes otherwise.
Hopefully, future updates won't be as problematic, such that projects
using containerd as a dependency will be able to use newer versions
of the libraries in a compatible fashion.
Signed-off-by: Milas Bowman <milas.bowman@docker.com>
These logs were already using structured logs, so include "id" as a field,
which also prevents the id being quoted (and escaped when printing);
time="2023-11-15T11:30:23.745574884Z" level=info msg="loading plugin \"io.containerd.internal.v1.shutdown\"..." runtime=io.containerd.runc.v2 type=io.containerd.internal.v1
time="2023-11-15T11:30:23.745612425Z" level=info msg="loading plugin \"io.containerd.ttrpc.v1.pause\"..." runtime=io.containerd.runc.v2 type=io.containerd.ttrpc.v1
time="2023-11-15T11:30:23.745620884Z" level=info msg="loading plugin \"io.containerd.event.v1.publisher\"..." runtime=io.containerd.runc.v2 type=io.containerd.event.v1
time="2023-11-15T11:30:23.745625925Z" level=info msg="loading plugin \"io.containerd.ttrpc.v1.task\"..." runtime=io.containerd.runc.v2 type=io.containerd.ttrpc.v1
Also updated some changed `WithError().WithField()` calls, to prevent some
overhead.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
These logs were already using structured logs, so include "id" as a field,
which also prevents the id being quoted (and escaped when printing);
time="2023-11-15T11:30:23.745574884Z" level=info msg="loading plugin \"io.containerd.internal.v1.shutdown\"..." runtime=io.containerd.runc.v2 type=io.containerd.internal.v1
time="2023-11-15T11:30:23.745612425Z" level=info msg="loading plugin \"io.containerd.ttrpc.v1.pause\"..." runtime=io.containerd.runc.v2 type=io.containerd.ttrpc.v1
time="2023-11-15T11:30:23.745620884Z" level=info msg="loading plugin \"io.containerd.event.v1.publisher\"..." runtime=io.containerd.runc.v2 type=io.containerd.event.v1
time="2023-11-15T11:30:23.745625925Z" level=info msg="loading plugin \"io.containerd.ttrpc.v1.task\"..." runtime=io.containerd.runc.v2 type=io.containerd.ttrpc.v1
Also updated some changed `WithError().WithField()` calls, to prevent some
overhead.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
To break the cyclic dependency of cri plugin and podsandbox plugin,
we define a new plugin type of SandboxesServicePlugin and when cri init
it's own client, it will add the all the controllers by get them from
the SandboxesServicePlugin.
when podsandbox controller init it's client, it will not Require the
SandboxesServicePlugin.
Signed-off-by: Abel Feng <fshb1988@gmail.com>
- full diff: https://github.com/opencontainers/runc/compare/v1.1.9...v1.1.10
- release notes: https://github.com/opencontainers/runc/releases/tag/v1.1.10
This is the tenth (and most likely final) patch release in the 1.1.z
release branch of runc. It mainly fixes a few issues in cgroups, and a
umask-related issue in tmpcopyup.
- Add support for `hugetlb.<pagesize>.rsvd` limiting and accounting.
Fixes the issue of postgres failing when hugepage limits are set.
- Fixed permissions of a newly created directories to not depend on the value
of umask in tmpcopyup feature implementation.
- libcontainer: cgroup v1 GetStats now ignores missing `kmem.limit_in_bytes`
(fixes the compatibility with Linux kernel 6.1+).
- Fix a semi-arbitrary cgroup write bug when given a malicious hugetlb
configuration. This issue is not a security issue because it requires a
malicious config.json, which is outside of our threat model.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The TestUpgrade downloads the latest of previous release's binary and
use them to setup pods and then use current release to recover the
existing pods.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
This is effectively a revert of 2ac9968401, which
switched from os/exec to the golang.org/x/sys/execabs package to mitigate
security issues (mainly on Windows) with lookups resolving to binaries in the
current directory.
from the go1.19 release notes https://go.dev/doc/go1.19#os-exec-path
> ## PATH lookups
>
> Command and LookPath no longer allow results from a PATH search to be found
> relative to the current directory. This removes a common source of security
> problems but may also break existing programs that depend on using, say,
> exec.Command("prog") to run a binary named prog (or, on Windows, prog.exe) in
> the current directory. See the os/exec package documentation for information
> about how best to update such programs.
>
> On Windows, Command and LookPath now respect the NoDefaultCurrentDirectoryInExePath
> environment variable, making it possible to disable the default implicit search
> of “.” in PATH lookups on Windows systems.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The PR https://github.com/containerd/containerd/pull/8198 fixed this for CRI but missed clearing the commandline in the forked SB server. This simply adds that back in
Signed-off-by: James Sturtevant <jsturtevant@gmail.com>
Protobuf will automatically put the files generated for a v2 module into
a v2 directory. Move them to their correct location after running the
protobuild.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Upgrade google.golang.org/grpc to v1.58.3 in preparation for
upgrading OTel, which has a dependency on the latest version.
See also: containerd/containerd#9281.
Signed-off-by: Milas Bowman <milas.bowman@docker.com>
When the HTTP fallback is used, the scheme changes from HTTPS to HTTP
which can cause a mismatch on redirect, causing the authorizer to get
stripped out. Since the redirect host must match the redirect host in
this case, credentials are only sent to the same origin host that
returned the redirect.
This fixes an issue for a push getting a 401 unauthorized on the PUT
request even though credentials are available.
Signed-off-by: Derek McGowan <derek@mcg.dev>
The Server rpc in introspection service is extended to expose
deprecation warnings based on observed feature use in containerd.
Signed-off-by: Samuel Karp <samuelkarp@google.com>
This package enumerates the known deprecations in the current version of
containerd. New deprecations should be added here, and old ones
removed.
Signed-off-by: Samuel Karp <samuelkarp@google.com>
The TLS fallback should only be used when the protocol is ambiguous due
to provided TLS configurations and defaulting to http. Do not add TLS
configurations when defaulting to http. When the port is 80 or will be
defaulted to 80, there is no protocol ambiguity and TLS fallback should
not be used.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Avoid calling out to the client to get a sandbox controller and instead
setup the list of controllers on initialization. This fixes a test
failure which does not set the client.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Enhance cri/server/image/imagefs_info.go:ImageFsInfo() to support
snapshotter per runtime. Now `ImageFsInfoResponse.ImageFilesystems` may
contain multiple entries.
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
full diff: https://github.com/golang/text/compare/v0.13.0...v0.17.0
This fixes the same CVE as go1.21.3 and go1.20.10;
- net/http: rapid stream resets can cause excessive work
A malicious HTTP/2 client which rapidly creates requests and
immediately resets them can cause excessive server resource consumption.
While the total number of requests is bounded to the
http2.Server.MaxConcurrentStreams setting, resetting an in-progress
request allows the attacker to create a new request while the existing
one is still executing.
HTTP/2 servers now bound the number of simultaneously executing
handler goroutines to the stream concurrency limit. New requests
arriving when at the limit (which can only happen after the client
has reset an existing, in-flight request) will be queued until a
handler exits. If the request queue grows too large, the server
will terminate the connection.
This issue is also fixed in golang.org/x/net/http2 v0.17.0,
for users manually configuring HTTP/2.
The default stream concurrency limit is 250 streams (requests)
per HTTP/2 connection. This value may be adjusted using the
golang.org/x/net/http2 package; see the Server.MaxConcurrentStreams
setting and the ConfigureServer function.
This is CVE-2023-39325 and Go issue https://go.dev/issue/63417.
This is also tracked by CVE-2023-44487.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
server: prohibit more than MaxConcurrentStreams handlers from running at once
(CVE-2023-44487).
In addition to this change, applications should ensure they do not leave running
tasks behind related to the RPC before returning from method handlers, or should
enforce appropriate limits on any such work.
- https://github.com/grpc/grpc-go/compare/v1.57.0...v1.57.1
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The ShimManager.Start() will call loadShim() to get the existing shim if SandboxID
is specified for a container, but shimTask.PID() is called in loadShim,
which will call Connect() of Task API with the ID of a task that is not
created yet(containerd is getting the shim and Task API address to call
Create, so the task is not created yet).
In this commit we change the logic of loadShim() to get the shim without calling
Connect() of the not created container ID.
Signed-off-by: Abel Feng <fshb1988@gmail.com>
When call sandbox controller to create sandbox, we change the param from
sandbox id to total sandbox object to git all information to controller,
so that sandbox controller do not rely on the sandbox store anymore,
this is more decouple for the sandbox controller plugin inside
containerd, and it is neccesary for remote sandbox controller plugins as
it is not able to get sandbox from the sandbox store anymore.
Signed-off-by: Abel Feng <fshb1988@gmail.com>
make containerd extensible to support more sandbox controllers
registered into containerd by config.
we change the default sandbox controller plugin's name from "local" to "shim".
to make sure we can get the controller by the plugin name it registered into
containerd.
Signed-off-by: Abel Feng <fshb1988@gmail.com>
As we are going to support more kinds of sandboxers, we have to tell
containerd which sandboxer used to manipulate a specific sandbox.
Signed-off-by: Abel Feng <fshb1988@gmail.com>
LOOP_CONFIGURE is a new ioctl that is a lot faster than
the LOOP_SET_FD+LOOP_SET_STATUS64 calls
Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com>
Before snapshotter per runtime, CRI only supports a global snapshotter.
So a snapshot can be uniquely identified by `snapshot_key`. With snapshotter
per runtime enabled, there may be multiple snapshotters used by CRI. So only
(snapshotter_id, snapshot_key) can uniquely identify a snapshot.
Also extends CRI/store/snapshot/Store to support multiple snapshotters.
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
We, containerd, suggest user to use binary plugins or RPC-based plugins.
Since go plugin has too many restrictions, I'm not sure that how many users
use the go plugin to extend the core function in the production.
Based on the fact that we put a lot of effort to make external plugins
better, suggest to deprecate go-plugin type plugin in v2.0 and remove it
in v2.1
REF: https://github.com/containerd/containerd/pull/556
Signed-off-by: Wei Fu <fuweid89@gmail.com>
When a credential handler is provided but no basic auth credentials
are provided, handle the error specifically rather than treating
the credentials as not implemented. This allows a clearer error to
be provided to users rather than a confusing not implemented error
or generic unauthorized error.
Add unit tests for the basic auth case.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Remove containerd specific parts of the plugin package to prepare its
move out of the main repository. Separate the plugin registration
singleton into a separate package.
Separating out the plugin package and registration makes it easier to
implement external plugins without creating a dependency loop.
Signed-off-by: Derek McGowan <derek@mcg.dev>
The plugins packages defines the plugins used by containerd.
Move all the types and properties to this package.
Signed-off-by: Derek McGowan <derek@mcg.dev>
It's followup for #5890.
The containerd-shim process depends on the mount package to init rootfs
for container. For the container enable user namespace, the mount
package needs to fork child process to get the brand-new user namespace.
However, there are two reapers in one process (described by the
following list) and there are race-condition cases.
1. mount package
2. sys.Reaper as global one which watch all the SIGCHLD.
=== [kill(2)][kill] the wrong process ===
Currently, we use pipe to ensure that child process is alive. However,
the pide file descriptor can be hold by other process, which the child
process cannot exit by self. We should use [kill(2)][kill] to ensure the
child process. But we might kill the wrong process if the child process
might be reaped by containerd-shim and the PID might be reused by other
process.
=== [waitid(2)][waitid] on the wrong child process ===
```
containerd-shim process:
Goroutine 1(GetUsernsFD): Goroutine 2(Reaper)
1. Ready to wait for child process X
2. Received SIGCHLD from X
3. Reaped the zombie child process X
(X has been reused by other child process)
4. Wait on process X
The goroutine 1 will be stuck until the process X has been terminated.
```
=== open `/proc/X/ns/user` on the wrong child process ===
There is also pid-reused risk between opening `/proc/$pid/ns/user` and
writing `/proc/$pid/u[g]id_map`.
```
containerd-shim process:
Goroutine 1(GetUsernsFD): Goroutine 2(Reaper)
1. Fork child process X
2. Write /proc/X/uid_map,gid_map
3. Received SIGCHLD from X
4. Reaped the zombie child process X
(X has been reused by other process)
5. Open /proc/X/ns/user file as usernsFD
The usernsFD links to the wrong X!!!
```
In order to fix the race-condition, we should use [CLONE_PIDFD][clone2] (Since
Linux v5.2).
When we fork child process `X`, the kernel will return a process file
descriptor `X_PIDFD` referencing to child process `X`. With the pidfd, we can
use [pidfd_send_signal(2)][pidfd_send_signal] (Since Linux v5.1)
to send signal(0) to ensure the child process `X` is alive. If the `X` has
terminated and its PID has been recycled for another process. The
pidfd_send_signal fails with the error ESRCH.
Therefore, we can open `/proc/X/{ns/user,uid_map,gid_map}` file
descriptors as first and then use pidfd_send_signal to check the process
is still alive. If so, we can ensure the file descriptors are valid and
reference to the child process `X`. Even if the `X` PID has been reused
after pidfd_send_signal call, the file descriptors are still valid.
```code
X, pidfd = clone2(CLONE_PIDFD)
usernsFD = open /proc/X/ns/user
uidmapFD = open /proc/X/uid_map
gidmapFD = open /proc/X/gid_map
pidfd_send_signal pidfd, signal(0)
return err if no such process
== When we arrive here, we can ensure usernsFD/uidmapFD/gidmapFD are correct
== even if X has been reused after pidfd_send_signal call.
update uid/gid mapping by uidmapFD/gidmapFD
return usernsFD
```
And the [waitid(2)][waitid] also supports pidfd type (Since Linux 5.4).
We can use pidfd type waitid to ensure we are waiting for the correct
process. All the PID related race-condition issues can be resolved by
pidfd.
```bash
➜ mount git:(followup-idmapped) pwd
/home/fuwei/go/src/github.com/containerd/containerd/mount
➜ mount git:(followup-idmapped) sudo go test -test.root -run TestGetUsernsFD -count=1000 -failfast -p 100 ./...
PASS
ok github.com/containerd/containerd/mount 3.446s
```
[kill]: <https://man7.org/linux/man-pages/man2/kill.2.html>
[clone2]: <https://man7.org/linux/man-pages/man2/clone.2.html>
[pidfd_send_signal]: <https://man7.org/linux/man-pages/man2/pidfd_send_signal.2.html>
[waitid]: <https://man7.org/linux/man-pages/man2/waitid.2.html>
Signed-off-by: Wei Fu <fuweid89@gmail.com>
The point of this test is to see that we successfully can get all of
the pids running in the container and they match the number expected,
but for Windows this concept is a bit different. Windows containers
essentially go through the usermode boot phase of the operating system,
and have quite a few processes and system services running outside of
the "init" process you specify. Because of this, there's not a great
way to say "there should only be N processes running" like we can ensure
for Linux. So, on Windows check that we're at least greater than one.
Signed-off-by: Danny Canter <danny@dcantah.dev>
go1.21.3 (released 2023-10-10) includes a security fix to the net/http package.
See the Go 1.21.3 milestone on our issue tracker for details:
https://github.com/golang/go/issues?q=milestone%3AGo1.21.3+label%3ACherryPickApproved
full diff: https://github.com/golang/go/compare/go1.21.2...go1.21.3
From the security mailing:
[security] Go 1.21.3 and Go 1.20.10 are released
Hello gophers,
We have just released Go versions 1.21.3 and 1.20.10, minor point releases.
These minor releases include 1 security fixes following the security policy:
- net/http: rapid stream resets can cause excessive work
A malicious HTTP/2 client which rapidly creates requests and
immediately resets them can cause excessive server resource consumption.
While the total number of requests is bounded to the
http2.Server.MaxConcurrentStreams setting, resetting an in-progress
request allows the attacker to create a new request while the existing
one is still executing.
HTTP/2 servers now bound the number of simultaneously executing
handler goroutines to the stream concurrency limit. New requests
arriving when at the limit (which can only happen after the client
has reset an existing, in-flight request) will be queued until a
handler exits. If the request queue grows too large, the server
will terminate the connection.
This issue is also fixed in golang.org/x/net/http2 v0.17.0,
for users manually configuring HTTP/2.
The default stream concurrency limit is 250 streams (requests)
per HTTP/2 connection. This value may be adjusted using the
golang.org/x/net/http2 package; see the Server.MaxConcurrentStreams
setting and the ConfigureServer function.
This is CVE-2023-39325 and Go issue https://go.dev/issue/63417.
This is also tracked by CVE-2023-44487.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
go1.21.2 (released 2023-10-05) includes one security fixes to the cmd/go package,
as well as bug fixes to the compiler, the go command, the linker, the runtime,
and the runtime/metrics package. See the Go 1.21.2 milestone on our issue
tracker for details:
https://github.com/golang/go/issues?q=milestone%3AGo1.21.2+label%3ACherryPickApproved
full diff: https://github.com/golang/go/compare/go1.21.1...go1.21.2
From the security mailing:
[security] Go 1.21.2 and Go 1.20.9 are released
Hello gophers,
We have just released Go versions 1.21.2 and 1.20.9, minor point releases.
These minor releases include 1 security fixes following the security policy:
- cmd/go: line directives allows arbitrary execution during build
"//line" directives can be used to bypass the restrictions on "//go:cgo_"
directives, allowing blocked linker and compiler flags to be passed during
compliation. This can result in unexpected execution of arbitrary code when
running "go build". The line directive requires the absolute path of the file in
which the directive lives, which makes exploting this issue significantly more
complex.
This is CVE-2023-39323 and Go issue https://go.dev/issue/63211.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Both pigz and igzip can be disabled via the environment variables.
If disabled, calling exec.LookPath and logging "not found" message is,
even in the debug level, doesn't make much sense.
Signed-off-by: Kazuyoshi Kato <kaz@fly.io>
Intel ISA-L is Intel's open source (BSD) library that outperforms both
gzip and pigz. This commit checks and uses igzip if available.
Signed-off-by: Kazuyoshi Kato <kaz@fly.io>
Windows Containers have a default path already configured at bootup. WithDefaultPathEnv overwrites this with a unix path
Signed-off-by: charitykathure <kathurecharity505@gmail.com>
Windows Containers have a default path already configured at bootup. WithDefaultPathEnv overwrites this with a unix path
Signed-off-by: charitykathure <kathurecharity505@gmail.com>
`MountedFrom` was prefixed with the whole target repository instead of
just the registry hostname.
Also adjust the test cases to use the registry hostname.
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
Pass the passed in context into some nested function calls, wrap
errors instead of %+v, and change some tests to strictly just test
for an error and not an exact error.
Signed-off-by: Danny Canter <danny@dcantah.dev>
This silences govulncheck detecting
https://pkg.go.dev/vuln/GO-2023-1988.
containerd does not directly use x/net
Signed-off-by: Kern Walster <walster@amazon.com>
When a endpoint is configured for http and has a tls configuration,
always try to the tls connection and fallback to http when the tls
connections fails from receiving an http response. This fixes an issue
with default localhost endpoints which get defaulted to http with
insecure tls also configured but are using tls.
Signed-off-by: Derek McGowan <derek@mcg.dev>
After pr #8617, create handler of containerd-shim-runc-v2 will
call handleStarted() to record the init process and handle its exit.
Init process wouldn't quit so early in normal circumstances. But if
this screnario occurs, handleStarted() will call
handleProcessExit(), which will cause deadlock because create() had
acquired s.mu, and handleProcessExit() will try to lock it again.
So, I added a parameter muLocked to handleStarted to indicate whether
or not s.mu is currently locked, and thus deciding whether or not to
lock it when calling handleProcessExit.
Fix: #9103
Signed-off-by: Chen Yiyang <cyyzero@qq.com>
Previous code has already called `getContainer()`, just pass it into
`s.getContainerPids` to reduce unnecessary lock and map lookup.
Signed-off-by: Chen Yiyang <cyyzero@qq.com>
`NewCRIService()` may easily fail and its error has to be ignored
unless the CRI plugin is in the `required_plugins` list.
Now this has to be called before `RegisterReadiness()`, as
PR 9153 "Require plugins to succeed after registering readiness"
was merged on 2023-09-29.
Fix issue 9163: `[Regression in main (2023-09-29)]: containerd-rootless.sh doesn't start up`
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
This migrates uses of github.com/opencontainers/runc/libcontainer/user
to the new github.com/moby/sys/user module, which was extracted from
runc at commit [opencontainers/runc@a3a0ec48c4].
This is the initial release of the module, which is a straight copy, but
some changes may be made in the next release (such as fixing camel-casing
in some fields and functions (Uid -> UID).
[opencontainers/runc@a3a0ec48c4]: a3a0ec48c4
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
When readiness is registered on initialization, the plugin must not
fail. When such a plugin fails, containerd will hang on the readiness
condition.
Signed-off-by: Derek McGowan <derek@mcg.dev>
crun 1.4.3 as well as runc 1.1 both support to open bind-mounts before
dropping privileges, as they are inaccessible after switching to the
user namespace. So that is the minimum version to use with containerd
1.7.
Also, since containerd 2.0 we use idmap mounts for files mounted in the
container created by containerd (like etc/hostname, etc/hosts, etc.), so
in that case we require newer OCI runtimes too. However, as the kubelet
doesn't request idmap mounts for kube volumes, we can lower the kernel
version.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Userns requires idmap mounts or to opt-in for a slow and expensive
chown. As idmap mounts support for overlayfs was merged in 5.19, let's
add the slow_chown config for our CI.
The config is harmless to keep it in new kernels, as if idmap mounts is
supported, it will be just used. Whenever all our CI is run with kernels
>= 5.19, we can remove this setting.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
If we don't use idmap mounts, doing a chown per pod is very expensive:
it implies duplicating the container storage for the image for every pod
and the latency to start a new pod is affected too.
Let's make sure users are aware of this, by having them opt-in, for
snapshotters that we have a better solution (like overlayfs, that has
support for idmap mounts).
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Deprecate the pacakge, but suppress linting errors for now. This is to allow
backporting these changes to release branches, which may still need to transition.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This "soft" deprecates the package, but keeps the local uses of the package,
which can make backporting this to release-branches easier (we can
still move all uses in those branches as well though).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
While this is not strictly necessary as the default OCI config masks this
path, it is possible that the user disabled path masking, passed their
own list, or is using a forked (or future) daemon version that has a
modified default config/allows changing the default config.
Add some defense-in-depth by also masking out this problematic hardware
device with the AppArmor LSM.
Signed-off-by: Bjorn Neergaard <bjorn.neergaard@docker.com>
The ability to read these files may offer a power-based sidechannel
attack against any workloads running on the same kernel.
This was originally [CVE-2020-8694][1], which was fixed in
[949dd0104c496fa7c14991a23c03c62e44637e71][2] by restricting read access
to root. However, since many containers run as root, this is not
sufficient for our use case.
While untrusted code should ideally never be run, we can add some
defense in depth here by masking out the device class by default.
[Other mechanisms][3] to access this hardware exist, but they should not
be accessible to a container due to other safeguards in the
kernel/container stack (e.g. capabilities, perf paranoia).
[1]: https://nvd.nist.gov/vuln/detail/CVE-2020-8694
[2]: 949dd0104c
[3]: https://web.eece.maine.edu/~vweaver/projects/rapl/
Signed-off-by: Bjorn Neergaard <bjorn.neergaard@docker.com>
When a shim process is unexpectedly killed in a way that was not initiated through containerd - containerd reports the pod as not ready but the containers as running. This results in kubelet repeatedly sending container kill requests that fail since containerd cannot connect to the shim.
Changes:
- In the container exit handler, treat `err: Unavailable` as if the container has already exited out
- When attempting to get a connection to the shim, if the controller isn't available assume that the shim has been killed (needs to be done since we have a separate exit handler that cleans up the reference to the shim controller - before kubelet has the chance to call StopPodSandbox)
Signed-off-by: Aditya Ramani <a_ramani@apple.com>
Remove `LimitNOFILE` from `containerd.service` to rely on the systemd v240 implicit default of `1024:524288`. On supported platforms with systemd prior to v240, packagers will patch the service with an explicit `LimitNOFILE=1024:524288`.
- `1024` soft limit is an implicit default, avoiding unexpected breakage. Software that needs a higher limit should request to raise the soft limit for its process.
- `524288` hard limit is an implicit default since systemd v240 and is adequate for most processes (_half of the historical limit from `fs.nr_open` of `1048576`_), while 4096 is the implicit default from the kernel (often too low).
- The hard limit may not exceed `fs.nr_open` (_which a value of `infinity` will resolve to_). On most systems with systemd v240 or newer, this will resolve to an excessive size of 2^30 (over 1 billion).
- When set to `infinity` (usually as the soft limit) software may experience significantly increased resource usage, resulting in a performance regression or runtime failures that are difficult to troubleshoot.
Signed-off-by: Brennan Kinney <5098581+polarathene@users.noreply.github.com>
crun 1.9 was just released with fixes and exposes idmap mounts support
via the "features" sub-command.
We use that feature to throw a clear error to users (if they request
idmap mounts and the OCI runtime doesn't support it), but also to skip
tests on CI when the OCI runtime doesn't support it.
Let's bump it so the CI runs the tests with crun.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
runc, as mandated by the runtime-spec, ignores unknown fields in the
config.json. This is unfortunate for cases where we _must_ enable that
feature or fail.
For example, if we want to start a container with user namespaces and
volumes, using the uidMappings/gidMappings field is needed so the
UID/GIDs in the volume don't end up with garbage. However, if we don't
fail when runc will ignore these fields (because they are unknown to
runc), we will just start a container without using the mappings and the
UID/GIDs the container will persist to volumes the hostUID/GID, that can
change if the container is re-scheduled by Kubernetes.
This will end up in volumes having "garbage" and unmapped UIDs that the
container can no longer change. So, let's avoid this entirely by just
checking that runc supports idmap mounts if the container we are about
to create needs them.
Please note that the "runc features" subcommand is only run when we are
using idmap mounts. If idmap mounts are not used, the subcommand is not
run and therefore this should not affect containers that don't use idmap
mounts in any way.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
These utilities resulted in the platforms package to have the containerd
API as dependency. As this package is used in many parts of the code, as
well as external consumers, we should try to keep it light on dependencies,
with the potential to make it a standalone module.
These utilities were added in f3b7436b61,
which has not yet been included in a release, so skipping deprecation
and aliases for these.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This cleans up the platforms package from dependencies that are not strictly
needed. This is in preparation of making this package a separate module, which
can be shared by plugins, and containerd versions (as well as external consumers),
- Remove dependency on the errdefs package: most uses of these error
definitions were used internally, and other errors may not be useful
for external consumers as sentinel errors.
- ErrInvalidArgument may be a potential exception, although a look at
current uses of this package shows that there's no special handling of
invalid parameters vs other errors (all would boil down to "the passed
platform is invalid" (either the format, or parsing is not implemented
on a specific platform)
- Remove uses of the convenience "Platform" alias in favor of using the
upstream (from OCI spec). Consumers of this package can still use the
convenience alias, but make sure that function signatures do not imply
that it's a different type (which can cause confusion).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
When the kubelet sends the uid/gid mappings for a mount, just pass them
down to the OCI runtime.
OCI runtimes support this since runc 1.2 and crun 1.8.1.
And whenever we add mounts (container mounts or image spec volumes) and
userns are requested by the kubelet, we use those mappings in the mounts
so the mounts are idmapped correctly. If no userns is used, we don't
send any mappings which just keeps the current behavior.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
The `cri-containerd-*.tar.gz` release bundles have been deprecated
since containerd v1.6.
These bundles are no longer created in the CI, however, the
corresponding Makefile targets are still kept, as they are still used by
external CIs.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
This adds a new WithLabel function, which allows to set a single label on
a lease, without having to first construct an intermediate map[string]string.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
- don't define a type, but just an ad-hoc struct
- use a single slice with test-cases; this allows IDE's to pick up the
table as a test-table (which allows (re-)running individual tests)
- make use of testify's assert.Equal to compare the results, instead
of a DIY loop over the expected values.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
It says: The prefix path **must be absolute, have all symlinks resolved, and cleaned**. But those requirements are violated in lots of places.
What happens when it is given a non-canonicalized path is that `mountinfo.GetMounts` will not find mounts.
The trivial case is:
```
$ mkdir a && ln -s a b && mkdir b/c b/d && mount --bind b/c b/d && cat /proc/mounts | grep -- '[ab]/d'
/dev/sdd3 /home/user/a/d ext4 rw,noatime,discard 0 0
```
We asked to bind-mount b/c to b/d, but ended up with mount in a/d.
So, mount table always contains canonicalized mount points, and it is an error to look for non-canonicalized paths in it.
Signed-off-by: Marat Radchenko <marat@slonopotamus.org>
The ip.To16() function returns non-nil if `ip` is any kind
of IP address, including IPv4. To look for IPv6 specifically,
use ip.To4() == nil.
Signed-off-by: Sam Edwards <CFSworks@gmail.com>
go1.21.1 (released 2023-09-06) includes four security fixes to the cmd/go,
crypto/tls, and html/template packages, as well as bug fixes to the compiler,
the go command, the linker, the runtime, and the context, crypto/tls,
encoding/gob, encoding/xml, go/types, net/http, os, and path/filepath packages.
See the Go 1.21.1 milestone on our issue tracker for details:
https://github.com/golang/go/issues?q=milestone%3AGo1.21.1+label%3ACherryPickApproved
full diff: https://github.com/golang/go/compare/go1.21.0...go1.21.1
From the security mailing:
[security] Go 1.21.1 and Go 1.20.8 are released
Hello gophers,
We have just released Go versions 1.21.1 and 1.20.8, minor point releases.
These minor releases include 4 security fixes following the security policy:
- cmd/go: go.mod toolchain directive allows arbitrary execution
The go.mod toolchain directive, introduced in Go 1.21, could be leveraged to
execute scripts and binaries relative to the root of the module when the "go"
command was executed within the module. This applies to modules downloaded using
the "go" command from the module proxy, as well as modules downloaded directly
using VCS software.
Thanks to Juho Nurminen of Mattermost for reporting this issue.
This is CVE-2023-39320 and Go issue https://go.dev/issue/62198.
- html/template: improper handling of HTML-like comments within script contexts
The html/template package did not properly handle HMTL-like "<!--" and "-->"
comment tokens, nor hashbang "#!" comment tokens, in <script> contexts. This may
cause the template parser to improperly interpret the contents of <script>
contexts, causing actions to be improperly escaped. This could be leveraged to
perform an XSS attack.
Thanks to Takeshi Kaneko (GMO Cybersecurity by Ierae, Inc.) for reporting this
issue.
This is CVE-2023-39318 and Go issue https://go.dev/issue/62196.
- html/template: improper handling of special tags within script contexts
The html/template package did not apply the proper rules for handling occurrences
of "<script", "<!--", and "</script" within JS literals in <script> contexts.
This may cause the template parser to improperly consider script contexts to be
terminated early, causing actions to be improperly escaped. This could be
leveraged to perform an XSS attack.
Thanks to Takeshi Kaneko (GMO Cybersecurity by Ierae, Inc.) for reporting this
issue.
This is CVE-2023-39319 and Go issue https://go.dev/issue/62197.
- crypto/tls: panic when processing post-handshake message on QUIC connections
Processing an incomplete post-handshake message for a QUIC connection caused a panic.
Thanks to Marten Seemann for reporting this issue.
This is CVE-2023-39321 and CVE-2023-39322 and Go issue https://go.dev/issue/62266.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
check if the target tag that is to be created using ctr image tag
is valid and does not contain any forbidden characters.
Signed-off-by: Akhil Mohan <makhil@vmware.com>
This patch introduces idmapped mounts support for
container rootfs.
The idmapped mounts support was merged in Linux kernel 5.12
torvalds/linux@7d6beb7.
This functionality allows to address chown overhead for containers that
use user namespace.
The changes are based on experimental patchset published by
Mauricio Vásquez #4734.
Current version reiplements support of idmapped mounts using Golang.
Performance measurement results:
Image idmapped mount recursive chown
BusyBox 00.135 04.964
Ubuntu 00.171 15.713
Fedora 00.143 38.799
Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>
Signed-off-by: Artem Kuzin <artem.kuzin@huawei.com>
Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>
Signed-off-by: Ilya Hanov <ilya.hanov@huawei-partners.com>
Previously the only fuse-overlayfs supports "--remap-labels" option.
Since idmapped mounts were landed to Linux kernel v5.12 it becomes
possible to use it with overlayfs via mount_setattr() system call.
The changes are based on experimental patchset published by
Mauricio Vásquez #4734.
Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>
Signed-off-by: Artem Kuzin <artem.kuzin@huawei.com>
Signed-off-by: Ilya Hanov <ilya.hanov@huawei-partners.com>
Previously remapping of a snapshotter has been done using
recursive chown.
Commit
31a6449734 added a support
for "remap-ids" capability which allows snapshotter internals do
remappings in case of idmapped mounts support to avoid recursive
chown and creating a new remapped snapshot.
Signed-off-by: Ilya Hanov <ilya.hanov@huawei-partners.com>
Modify the parameter `-Path` to reference a folder, so `Copy-Item` create the destination folder.
Remove "-Container:$false" that flatten the hierarchy folder.
Signed-off-by: VERNOU Cédric <1659796+vernou@users.noreply.github.com>
This brings over the enhancement from a506630e57.
We don't expect the systemd state to change while containerd is running,
so we can use a `sync.Once` for this, to prevent stat'ing each time.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
runc considers libcontainer to be "unstable" (not for external use),
so we try not to use it. Commit ed47d6ba76
brought back the dependency on other parts of libcontainer, but looks to
be only depending on a single utility, which in itself was borrowed from
github.com/coreos/go-systemd to not introduce CGO code in the same package.
This patch copies the version from github.com/coreos/go-systemd (adding
proper attribution, although the function is pretty trivial).
runc is in process of moving the libcontainer/user package to an external
module, which means we can remove the dependency on libcontainer entirely
in the near future. There is one more use of `libcontainer` in our vendor
tree; it looks like CDI is depending on one utility (devices.DeviceFromPath);
a943033a8b/vendor/github.com/container-orchestrated-devices/container-device-interface/pkg/cdi/container-edits_unix.go (L38)
We should remove the dependency on that utility, and add a CI check to
prevent bringing it back.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The Go stdlib does not seem to have an efficient os.File.ReadFrom
routine for other platforms like it does on Linux with
copy_file_range. For Darwin at least we can use clonefile
in its place, otherwise if we have a sparse file we'd have
a fun surprise with the io.Copy approach..
We should see if there's other platforms that we can enhance here.
I've forgotten what's the right route on Windows.
Signed-off-by: Danny Canter <danny@dcantah.dev>
Update the garbage collector to support image expiration along with
support for image leasing. This allows making images collectible during
garbage collection and using a lease to prevent removal of an image.
Signed-off-by: Derek McGowan <derek@mcg.dev>
The metadata is small and useful for viewing all platforms
for an image and enabling push back to the same registry.
Signed-off-by: Derek McGowan <derek@mcg.dev>
When a blob does not exist locally, rather than erroring on info
lookup, inherit the parent distribution sources. Push is able
to succeed even if the blob does not exist locally when a cross
repository mount is done. This is a common operation pushing a
multi-platform image to the same registry but different namespace.
Signed-off-by: Derek McGowan <derek@mcg.dev>
The failed to recover state message didn't include the ID making this
not as useful as it could be..
This additionally moves some of the other logs to include the id for
the sandbox/container as a field instead of part of a format string.
Signed-off-by: Danny Canter <danny@dcantah.dev>
The reference/docker package was a fork of github.com/distribution/distribution,
which could not easily be used as a direct dependency, as it brought many other
dependencies with it.
The "reference' package has now moved to a separate repository, which means
we can replace the local fork, and use the upstream implementation again.
The new module was extracted from the distribution repository at commit:
b9b19409cf
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This reverts commit 778ac302b2.
(slightly modified, due to changes that were merged after that).
The reverted commit had two elements;
- Make `G` an actual function to improve the documentation
- Prevent `G` from being overwritten externally
From the commit that's reverted:
> The `G` variable is exported, and not expected to be overwritten
> externally. Defining it as a function also documents it as a function
> on https://pkg.go.dev, instead of a variable; https://pkg.go.dev/github.com/containerd/containerd@v1.6.22/log#pkg-variables
While it's unclear if the ability to replace the implementation was
_intentional_, it's this part that some external consumers were (ab)using.
We should look into that part in a follow-up, and design for this, for
example by providing a utility to replace the logger, and properly document
that.
In the meantime, let's revert the change.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
hcsshim tags v0.10.* is deprecated, so using the new
v0.12.0-rc.* versioning for hcsshim tags on containerd/main
Signed-off-by: Kirtana Ashok <kiashok@microsoft.com>
- For remote snapshotters, the unpack phase serves as an important step for
preparing the remote snapshot. With the missing unpacker.Wait, the
snapshotter `Prepare` context is always canceled.
- This patch allows remote snapshotter based archives to be imported via
the transfer service or `ctr image import`
Signed-off-by: Edgar Lee <edgarhinshunlee@gmail.com>
Somewhat similar to how we supply the version of runc to grab for testing via
a file in script/, this change supplies the Windows shim version to build off
of via a file in the same directory. This seems like a decent home given it now
lives next to the script that pulls and builds the shim to include in our build
artifacts/locally.
The motivation behind this change is:
Cut down on unneccessary hcsshim vendorings if no library code for containerd
changed. It was some what clunky how the Windows builds work today. The Windows
shim is developed out of tree at github.com/microsoft/hcsshim. To let containerd know
what tag to build the shim off of we'd vendor hcsshim into containerd, and then
parse the version string from go.mod, fetch this tag, and then build the shim and
include it in our artifacts. As mentioned, often times the vendoring would bring in
no actual changes that would affect containerd's usage of hcsshim as a library, and
would just serve as a means to bump the version of the containerd shim we should build.
Now this process can be a one line change and we can avoid the possible headaches that come
with bumping go.mod (bumping other unrelated deps etc.)
Signed-off-by: Danny Canter <danny@dcantah.dev>
From the Go docs:
"For a nil slice, the number of iterations is 0." [1]
Both `info.RootFS` and `host.clientPairs` are slices. Therefore, an
additional nil check for before the loop is unnecessary.
[1]: https://go.dev/ref/spec#For_range
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
This was the only option not configurable from the toml for the plugin.
This is useful if you want to restart containerd and try a different
blockfile/size for the snapshotter.
Signed-off-by: Danny Canter <danny@dcantah.dev>
Since a recent contributor edited the synced version of this in the website (containerd.io) repo, we should just update the main repo and let the auto-sync PR get these 2 files back in sync with the latest releases.
Signed-off-by: Phil Estes <estesp@amazon.com>
Distros usually like to install docs, so add a rule for that, so
dist maintainers don't need to care about the details.
Signed-off-by: Enrico Weigelt, metux IT consult <info@metux.net>
This is a partial revert of "cri/sbserver: Use platform instead of GOOS
for userns detection".
While what that commit did is 100% the right thing to do, when the
sandbox_mode is "shim" all controller.XXX() calls are RPCs and the
controller.Create() call initializes the controller. Therefore, things
like "getSandboxController()" don't work in the case of "shim"
sandbox_mode until after the controller.Create().
Due to this asymmetry and the lack of tests for shim mode, we didn't
catch it before.
This patch just reverts that commit so that the Create() and
getSandboxController() calls remain where they were, and just relies on
the config Linux section as a hack to detect if the pod sandbox will use
user namespaces or not.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Make the rather obscure systemd notification build-time optional by
setting 'no_systemd' tag and so skip dependencies on around 9kLoC
vendor code.
Signed-off-by: Enrico Weigelt, metux IT consult <info@metux.net>
cgroupv1HasHugetlb() and cgroupv2HasHugetlb() may return errors, but nobody
(there's just one call site anyways) ever cares. So drop the unnecessary code.
Signed-off-by: Enrico Weigelt, metux IT consult <info@metux.net>
The default version of MinGW and GCC on the GitHub-hosted Windows 2019
runners compile fine but lead to linker errors during runtime.
Signed-off-by: Nashwan Azhari <nazhari@cloudbasesolutions.com>
Tests in pkg/cri/[sb]server/container_create_linux_test.go depends on go:noinline
since Go 1.21.
e.g.,
> ```
> === FAIL: pkg/cri/sbserver TestGenerateSeccompSecurityProfileSpecOpts/should_set_default_seccomp_when_seccomp_is_runtime/default (0.00s)
> container_create_linux_test.go:1013:
> Error Trace: /home/runner/work/containerd/containerd/pkg/cri/sbserver/container_create_linux_test.go:1013
> Error: Not equal:
> expected: 0x263d880
> actual : 0x263cbc0
> Test: TestGenerateSeccompSecurityProfileSpecOpts/should_set_default_seccomp_when_seccomp_is_runtime/default
> ```
See comments in PR 8957.
Thanks to Wei Fu for analyzing this.
Co-authored-by: Wei Fu <fuweid89@gmail.com>
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
- Fill OSVersion field of ocispec.Platform for windows OS in
transfer service plugin init()
- Do not return error from transfer service ReceiveStream if
stream.Recv() returned context.Canceled error
Signed-off-by: Kirtana Ashok <kiashok@microsoft.com>
@samuelkarp's https://github.com/samuelkarp/runj is a de facto default
FreeBSD runtime.
This change creates a set of defaults for FreeBSD setting
`wtf.sbk.runj.v1` as the default runtime.
Signed-off-by: Artem Khramov <akhramov@pm.me>
Following the addition of annotations to the grpc/ttrpc API surface,
follow suit with adding annotations to the controller api surface.
Signed-off-by: Danny Canter <danny@dcantah.dev>
An oft employed scheme for a lot of our APIs is to include an
annotations field which is just a map of string to string. This
usually allows folks using the API to send over metadata or auxiliary
information without needing to get a new field added (especially where
the field might not make sense for it to be a standalone field). I think
having annotations for CreateSandbox make sense for this same use case.
Signed-off-by: Danny Canter <danny@dcantah.dev>
This update addresses an issue where the stat call on FreeBSD could
return -1 for regular files. This led to incorrect Devmajor and
Devminor values, which should be zero in such cases. Refer to the
discussion on this bug in the following PR:
https://github.com/containerd/containerd/pull/5991.
The code change now handles this scenario appropriately.
Signed-off-by: Artem Khramov <akhramov@pm.me>
Since the moby/moby can't handle duplicate exit event well, it's hard
for containerd to retry shutdown if there is error, like context
canceled.
In order to prevent from regression like #4769, I add skipped
integration case as TODO item and we should rethink about how to handle
the task/shim lifecycle.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
Distros tend to change this to specific locations (eg. on MVCC installs),
therefore introduce a generic environment variable that's a common practise
since 30+ years and thus already well known and supported by distros.
Signed-off-by: Enrico Weigelt, metux IT consult <info@metux.net>
Required for distros that wanna use their local version and
can't have some (possibly failing) git commands being run here.
Signed-off-by: Enrico Weigelt, metux IT consult <info@metux.net>
There still was one place that's calling the `go` command directly
instead of using the $(GO) variable.
Fixes: 9ea25634bd
Signed-off-by: Enrico Weigelt, metux IT consult <info@metux.net>
We have been using Cirrus CI for running vagrant workloads
as the standard runners of GHA lacks nested virtualization,
but it looks like GHA with the "larger" runners support nested
virtualization.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
In the sbserver we should not use the GOOS, as windows hosts can run
linux containers. On the sbserver we should use the platform param.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Includes a fix for CVE-2023-29409
go1.20.7 (released 2023-08-01) includes a security fix to the crypto/tls
package, as well as bug fixes to the assembler and the compiler. See the
Go 1.20.7 milestone on our issue tracker for details:
- https://github.com/golang/go/issues?q=milestone%3AGo1.20.7+label%3ACherryPickApproved
- full diff: https://github.com/golang/go/compare/go1.20.6...go1.20.7
go1.19.12 (released 2023-08-01) includes a security fix to the crypto/tls
package, as well as bug fixes to the assembler and the compiler. See the
Go 1.19.12 milestone on our issue tracker for details.
- https://github.com/golang/go/issues?q=milestone%3AGo1.19.12+label%3ACherryPickApproved
- full diff: https://github.com/golang/go/compare/go1.19.11...go1.19.12
From the mailing list announcement:
[security] Go 1.20.7 and Go 1.19.12 are released
Hello gophers,
We have just released Go versions 1.20.7 and 1.19.12, minor point releases.
These minor releases include 1 security fixes following the security policy:
- crypto/tls: restrict RSA keys in certificates to <= 8192 bits
Extremely large RSA keys in certificate chains can cause a client/server
to expend significant CPU time verifying signatures. Limit this by
restricting the size of RSA keys transmitted during handshakes to <=
8192 bits.
Based on a survey of publicly trusted RSA keys, there are currently only
three certificates in circulation with keys larger than this, and all
three appear to be test certificates that are not actively deployed. It
is possible there are larger keys in use in private PKIs, but we target
the web PKI, so causing breakage here in the interests of increasing the
default safety of users of crypto/tls seems reasonable.
Thanks to Mateusz Poliwczak for reporting this issue.
View the release notes for more information:
https://go.dev/doc/devel/release#go1.20.7
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Before this PR, if a stdin/stdout/stderr stream is nil,
and the corresponding FIFO is not an empty string,
a panic will occur when Read/Write of the nil stream is invoked in io.CopyBuffer.
Signed-off-by: Hsing-Yu (David) Chen <davidhsingyuchen@gmail.com>
Runc 1.1 throws a warning when using rel destination paths, and runc 1.2
is planning to thow an error (i.e. won't start the container).
Let's just make this an abs path in the only place it might not be: the
mounts created due to `VOLUME` directives in the Dockerfile.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
[`logrus.SetLevel()`][1], [`logrus.GetLevel()`][2] and [`logrus.SetFormatter()`][3]
are all convenience functions to configure logrus' standardlogger, which is the
logger to which we hold a reference in the Entry configured on [`log.L`][4].
This patch:
- swaps calls to `logrus.SetLevel`, `logrus.GetLevel` and `logrus.SetFormatter`
for their equivalents on `log.L`. This makes it clearer what `SetLevel` does,
and makes sure that we set the log-level of the logger / entry we define in
our package (even if that would be swapped with a different instance).
- removes the use of `logrus.NewEntry` with directly constructing a `Entry`,
using the local `Entry` alias (anticipating we can swap that type in future).
[1]: dd1b4c2e81/exported.go (L34C1-L37)
[2]: dd1b4c2e81/exported.go (L39-L42)
[3]: dd1b4c2e81/exported.go (L23-L26)
[4]: dd1b4c2e81/exported.go (L9-L16)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Add a package doc to (try to) describe the purpose of this package, and
to describe the purpose (and expectations) of aliases provided by the
package.
> Package log provides types and functions related to logging, passing
> loggers through a context, and attaching context to the logger.
>
> # Transitional types
>
> This package contains various types that are aliases for types in [logrus].
> These aliases are intended for transitioning away from hard-coding logrus
> as logging implementation. Consumers of this package are encouraged to use
> the type-aliases from this package instead of directly using their logrus
> equivalent.
>
> The intent is to replace these aliases with locally defined types and
> interfaces once all consumers are no longer directly importing logrus
> types.
>
> IMPORTANT: due to the transitional purpose of this package, it is not
> guaranteed for the full logrus API to be provided in the future. As
> outlined, these aliases are provided as a step to transition away from
> a specific implementation which, as a result, exposes the full logrus API.
> While no decisions have been made on the ultimate design and interface
> provided by this package, we do not expect carrying "less common" features.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The `G` variable is exported, and not expected to be overwritten
externally. Defining it as a function also documents it as a function
on https://pkg.go.dev, instead of a variable; https://pkg.go.dev/github.com/containerd/containerd@v1.6.22/log#pkg-variables
Note that (while the godoc suggests otherwise) I made `GetLogger` an alias
for `G`, as `G` is the most commonly used function (not the other way round),
although I don't think there's a performance gain in doing so.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
While other log-levels are not currently used in containerd itself,
they can be returned by `GetLevel()`, and are accepted (no error) by
`SetLevel()`. We should either accept those values, or produce an
error (in `SetLevel()`), but given that there's other ways to set the
log-level, we should probably acknowledge that this package is a transitional
package, and still closely tied to logrus (for the time being).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Testify was only used for a basic assertion. Remove the dependency,
in preparation of (potentially) moving this package to a separate
module.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The rpc only reports one field, i.e. the cgroup driver, to kubelet.
Containerd determines the effective cgroup driver by looking at all
runtime handlers, starting from the default runtime handler (the rest in
alphabetical order), and returning the cgroup driver setting of the
first runtime handler that supports one. If no runtime handler supports
cgroup driver (i.e. has a config option for it) containerd falls back to
auto-detection, returning systemd if systemd is running and cgroupfs
otherwise.
This patch implements the CRI server side of Kubernetes KEP-4033:
https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4033-group-driver-detection-over-cri
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
Previously, we would return the first non-404 error from a host.
This is logical, however, it can result in confusing errors to the
user:
- e.g. we have an HTTP host, and an HTTPS host.
If the image does not exist, we return "http: server gave HTTP
response to HTTPS client". This is technically correct, however, the
user is easily confused - the most relevant error in this case is the
404 error.
- e.g. we have a broken HTTP host that returns 5XX errors, and a HTTP
host with authentication.
On the request for an image, we return the 5XX error directly.
However, we have a host later on which returned an authentication
error which is now hidden from the user.
Note: this *can* be resolved by changing the order of hosts passed in,
however this requires 1. knowing ahead of time which hosts are going to
return certain errors and 2. this is often not desirable, we'd prefer
to use HTTPS if it's available, and only then fallback to HTTP.
To resolve this, we assign each possible error during resolution a
"priority" that marks how far through the image resolution process a
host/path combo got. Then we return the error with the highest priority,
which is much more likely to be the most relevant error to the user.
The ranking of priority then is (from lowest to highest):
- Underlying transport errors (TLS, TCP, etc)
- 404 errors
- Other 4XX/5XX errors
- Manifest rejection (due to max size exceeded)
Signed-off-by: Justin Chadwell <me@jedevc.com>
- `release.yml` continues to use Ubuntu 20.04 for glibc compatibility
- cgroup v1 is no longer tested with Ubuntu, but still tested with Rocky 8
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
If kubelet passes the swap limit (default memory limit = swap limit ),
it is configured for container irrespective if the node supports swap.
Signed-off-by: Qasim Sarfraz <qasimsarfraz@microsoft.com>
"ro" was not parsed out of the string, so it was passed as part of data
to mount().
This would lead to mount() returning an invalid argument code.
Separate out the "ro" option, much like "userxattr", which will allow
the MS-RDONLY mountflag to get set.
Signed-off-by: Ben Foster <bpfoster@gmail.com>
release notes: https://github.com/opencontainers/runc/releases/tag/v1.1.8
full diff: https://github.com/opencontainers/runc/compare/v1.1.7...v1.1.9
This is the eighth patch release of the 1.1.z release branch of runc.
The most notable change is the addition of RISC-V support, along with a
few bug fixes.
- Support riscv64.
- init: do not print environment variable value.
- libct: fix a race with systemd removal.
- tests/int: increase num retries for oom tests.
- man/runc: fixes.
- Fix tmpfs mode opts when dir already exists.
- docs/systemd: fix a broken link.
- ci/cirrus: enable some rootless tests on cs9.
- runc delete: call systemd's reset-failed.
- libct/cg/sd/v1: do not update non-frozen cgroup after frozen failed.
- CI: bump Fedora, Vagrant, bats.
- .codespellrc: update for 2.2.5.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
That commit neither helps without a working bind-mount implementation nor is needed when such implementation exists.
Testing shows that containerd can properly download and unpack image using bindfs mounts (see previous commit) even without Darwin-specific applier code.
Signed-off-by: Marat Radchenko <marat@slonopotamus.org>
These are not actually being pulled, just removing the deprecated k8s.gcr.io
from the code-base. While at it, also renamed / removed vars that shadowed
with package-level definitions
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
When the pods are transitioning there are several
cases where containers might not be in valid state.
There were several cases where the stats where
failing hard but we should just continue on as
they are transient and will be picked up again
when kubelet queries for the stats again.
Signed-off-by: James Sturtevant <jstur@microsoft.com>
Signed-off-by: Mark Rossetti <marosset@microsoft.com>
go1.20.6 (released 2023-07-11) includes a security fix to the net/http
package, as well as bug fixes to the compiler, cgo, the cover tool, the
go command, the runtime, and the crypto/ecdsa, go/build, go/printer,
net/mail, and text/template packages. See the Go 1.20.6 milestone on
our issue tracker for details.
https://github.com/golang/go/issues?q=milestone%3AGo1.20.6+label%3ACherryPickApproved
Full diff: https://github.com/golang/go/compare/go1.20.5...go1.20.6
These minor releases include 1 security fixes following the security policy:
- net/http: insufficient sanitization of Host header
The HTTP/1 client did not fully validate the contents of the Host header.
A maliciously crafted Host header could inject additional headers or
entire requests. The HTTP/1 client now refuses to send requests containing
an invalid Request.Host or Request.URL.Host value.
Thanks to Bartek Nowotarski for reporting this issue.
Includes security fixes for CVE-2023-29406 and Go issue https://go.dev/issue/60374
Signed-off-by: Danny Canter <danny@dcantah.dev>
This commit just updates the sbserver with the same fix we did on main:
9bf5aeca77 ("cri: Fix net.ipv4.ping_group_range with userns ")
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
This is a port of 31a6449734 ("Add capability for snapshotters to
declare support for UID remapping") to sbserver.
This patch remaps the rootfs in the platform-specific if user namespaces
are in use, so the pod can read/write to the rootfs.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
This patch requests the OCI runtime to create a userns when the CRI
message includes such request.
This is an adaptation of a7adeb6976 ("cri: Support pods with user
namespaces") to sbserver, although the container_create.go parts were
already ported as part of 40be96efa9 ("Have separate spec builder for
each platform"),
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
This commit just ports 36f520dc04 ("Let OCI runtime create netns when
userns is used") to sbserver.
The CNI network setup is done after OCI start, as it didn't seem simple
to get the sandbox PID we need for the netns otherwise.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Currently there is a big c&p of the helpers between these two folders
and a TODO in the platform agnostic file to organize them in the future,
when some other things settle.
So, let's just copy them for now.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Commit c085fac1e5 ("Move sandbox start behind controller") moved the
runtimeStart to only account for time _after_ the netns has been
created.
To match what we currently do in cri/server, let's move it to just after
the get the sandbox runtime.
This come up when porting userns to sbserver, as the CNI network setup
needs to be done at a later stage and runtimeStart was accounting for
the CNI network setup time only when userns is enabled.
To avoid that discrepancy, let's just move it earlier, that also matches
what we do in cri/server.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Since we merged support for userns in:
https://github.com/containerd/containerd/pull/7679
overlay has been doing a chown for the rootfs using WithRemapperLabels.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Beside the "in future the when" typo, we take the chance to reflect that
user namespaces are already merged.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
These two errors can occur in the following scenarios:
ECONNRESET: the target process reset connection between CRI and itself.
see: #111825 for detail
EPIPE: the target process did not read the received data, causing the
buffer in the kernel to be full, resulting in the occurrence of Zero Window,
then closing the connection (FIN, RESET)
see: #74551 for detail
In both cases, we should RESET the httpStream.
Signed-off-by: wangxiang <scottwangsxll@gmail.com>
Modify the loopback size in the blockfile snapshotter test setup.
Set the loopback size to 16MB when the page size is greater than 4096.
Signed-off-by: James Jenkins <James.Jenkins@ibm.com>
I saw Cirrus CI / Vagrant BOX:rockylinux/8@5.0.0 failing during setting
up Vagrant, which may be due to other scripts provisioning the machine;
Reading package lists...
apt-get install -y libvirt-daemon libvirt-daemon-system vagrant vagrant-libvirt
E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 2496 (apt-get)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
Configure dpkg to wait for locks to be released instead of failing. I used
60 second as timeout, which is relatively long, but given that the Vagrant
checks are known to take some time to run, is probably fine.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
userns.RunningInUserNS() checks if the code calling that function is
running inside a user namespace. But we need to check if the container
we will create will use a user namespace, in that case we need to
disable the sysctl too (or we would need to take the userns mapping into
account to set the IDs).
This was added in PR:
https://github.com/containerd/containerd/pull/6170/
And the param documentation says it is not enabled when user namespaces
are in use:
https://github.com/containerd/containerd/pull/6170/files#diff-91d0a4c61f6d3523b5a19717d1b40b5fffd7e392d8fe22aed7c905fe195b8902R118
I'm not sure if the intention was to disable this if containerd is
running inside a userns (rootless, if that is even supported) or just
when the pod has user namespaces.
Out of an abundance of caution, I'm keeping the userns.RunningInUserNS()
so it is still not used if containerd runs inside a user namespace.
With this patch and "enable_unprivileged_icmp = true" in the config,
running containerd as root on the host, pods with user namespaces start
just fine. Without this patch they fail with:
... failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: w
/proc/sys/net/ipv4/ping_group_range: invalid argument: unknown
Thanks a lot to Andy on the k8s slack for reporting the issue. He also
mentions he hits this with k3s on a default installation (the param
is off by default on containerd, but k3s turns that on by default it
seems). He also debugged which part of the stack was setting that
sysctl, found the PR that added this code in containerd and a workaround
(to turn the bool off).
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
go1.20.5 (released 2023-06-06) includes four security fixes to the cmd/go and
runtime packages, as well as bug fixes to the compiler, the go command, the
runtime, and the crypto/rsa, net, and os packages. See the Go 1.20.5 milestone
on our issue tracker for details:
https://github.com/golang/go/issues?q=milestone%3AGo1.20.5+label%3ACherryPickApproved
full diff: https://github.com/golang/go/compare/go1.20.4...go1.20.5
These minor releases include 3 security fixes following the security policy:
- cmd/go: cgo code injection
The go command may generate unexpected code at build time when using cgo. This
may result in unexpected behavior when running a go program which uses cgo.
This may occur when running an untrusted module which contains directories with
newline characters in their names. Modules which are retrieved using the go command,
i.e. via "go get", are not affected (modules retrieved using GOPATH-mode, i.e.
GO111MODULE=off, may be affected).
Thanks to Juho Nurminen of Mattermost for reporting this issue.
This is CVE-2023-29402 and Go issue https://go.dev/issue/60167.
- runtime: unexpected behavior of setuid/setgid binaries
The Go runtime didn't act any differently when a binary had the setuid/setgid
bit set. On Unix platforms, if a setuid/setgid binary was executed with standard
I/O file descriptors closed, opening any files could result in unexpected
content being read/written with elevated prilieges. Similarly if a setuid/setgid
program was terminated, either via panic or signal, it could leak the contents
of its registers.
Thanks to Vincent Dehors from Synacktiv for reporting this issue.
This is CVE-2023-29403 and Go issue https://go.dev/issue/60272.
- cmd/go: improper sanitization of LDFLAGS
The go command may execute arbitrary code at build time when using cgo. This may
occur when running "go get" on a malicious module, or when running any other
command which builds untrusted code. This is can by triggered by linker flags,
specified via a "#cgo LDFLAGS" directive.
Thanks to Juho Nurminen of Mattermost for reporting this issue.
This is CVE-2023-29404 and CVE-2023-29405 and Go issues https://go.dev/issue/60305 and https://go.dev/issue/60306.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
In commit 4b35c3829d, example shim erroneously started to depend on runc, fix that back.
Also, build example shim on all supported platforms to prevent such situations in the future.
Signed-off-by: Marat Radchenko <marat@slonopotamus.org>
since libseccomp is required only for building runc and we are only
building containerd binaries in nightly, the libseccomp-dev dependency
is removed. Foreign arch repositories are now removed since
crossbuild-essential-* packages are {arm64, ppc64el,..} cross compiler
packages for amd64 and arch specific repositories are not required.
Signed-off-by: Akhil Mohan <akhilerm@gmail.com>
Every shim implementation needs to select a correct publisher topic when posting events, so move it out of Linux-only runc code to the place where other shims can also use it
Otherwise, shims have to copy-paste this code. For example, see runj: 8158e558a3/containerd/shim.go (L144-L172)
Signed-off-by: Marat Radchenko <marat@slonopotamus.org>
The whiteout timestamps are no longer set to the source date epoch.
The source date epoch still applies to non-whiteout files.
Discussion happened in moby/buildkit PR 3560.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
This brings in a ton of great improvements, most notably for the containerd
daemon is performance improvements for cgroups1 and 2 for gathering stats,
as well as some fixes for enabling controllers and deleting v1 cgroups.
Signed-off-by: Danny Canter <danny@dcantah.dev>
Helpers to convert from a slice of platforms to our protobuf representation
and vice-versa appear a couple times. It seems sane to just expose this facility
in the platforms pkg.
Signed-off-by: Danny Canter <danny@dcantah.dev>
Helpers to convert from snapshot types to their protobuf structures and
vice-versa appear three times. It seems sane to just expose this facility
in the snapshots pkg. From/ToKind weren't used anywhere but doesn't hurt to
round out the types by exposing them.
Signed-off-by: Danny Canter <danny@dcantah.dev>
Helpers to convert from the OCI image specs [Descriptor] to its protobuf
structure for Descriptor and vice-versa appear three times. It seems sane
to just expose this facility in /oci.
Signed-off-by: Danny Canter <danny@dcantah.dev>
Helpers to convert from containerd's [Mount] to its protobuf structure for
[Mount] and vice-versa appear three times. It seems sane to just expose
this facility in /mount.
Signed-off-by: Danny Canter <danny@dcantah.dev>
All of the tests using this didn't need stdin/err (one of them not even
stdout), so we can just leave them "empty" and change to a withStdout
naming to make it more obvious.
Signed-off-by: Danny Canter <danny@dcantah.dev>
This introduces a ParseSourceDateEpoch function, which can be used
to parse "SOURCE_DATE_EPOCH" values for situations where those
values are not passed through an env-var (or the env-var has been
read through other means).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
These tests were failing on my macOS; could be the precision issue (like on
Windows), or just because they're "too fast".
=== RUN TestSourceDateEpoch/WithoutSourceDateEpoch
epoch_test.go:51:
Error Trace: /Users/thajeztah/go/src/github.com/containerd/containerd/pkg/epoch/epoch_test.go:51
Error: Should be true
Test: TestSourceDateEpoch/WithoutSourceDateEpoch
Messages: now: 2023-06-23 11:47:09.93118 +0000 UTC, v: 2023-06-23 11:47:09.93118 +0000 UTC
This patch:
- updates the rightAfter utility to allow the timestamps to be "equal"
- updates the asserts to provide some details about the timestamps
- uses UTC for the value we're comparing to, to match the timestamps
that are generated.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
I think NullIO is fine on Windows now. We have it as an option in ctr
and it's used for the pod sandbox container in CRI. Lets see if CI agrees..
Signed-off-by: Danny Canter <danny@dcantah.dev>
There was a todo for the windows variant of dependency installation that
hinted at making an install-hcsshim.sh script, however Windows today doesn't
rely on a standalone OCI runtime binary that gets invoked by the shim. Rather,
container creation/management is all handled by the shim itself in-proc. Due to
this, `make` or `make binaries` basically fulfills that purpose as it
clones hcsshim and builds the shim along with containerd.
Signed-off-by: Danny Canter <danny@dcantah.dev>
* Use direct-io mode to reduce IO.
* Add testViewHook helper to recovery the backing file since the ext4
might need writable permission to handle recovery. If the backing file
needs recovery and it's for View snapshot, the readonly mount will
cause error.
* Use 8 MiB as capacity to reduce the IO.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
Microsoft announced the removal of nondistributable layers from their
images today. This makes the convert test fail since it assumes the
first layer is nondistributable on Windows during the test.
Signed-off-by: Phil Estes <estesp@amazon.com>
As a follow up change to adding a SandboxMetrics rpc to the core
sandbox service, the controller needed a corresponding rpc for CRI
and others to eventually implement.
This leaves the CRI (non-shim mode) controller unimplemented just to
have a change with the API addition to start.
Signed-off-by: Danny Canter <danny@dcantah.dev>
To gather metrics/stats about a specific sandbox instance, it'd be nice to
have a dedicated rpc for this. Due to the same "what kind of stats are going
to be returned" dilemma exists for sandboxes as well, I've re-used the metrics
type we have as the data field is just an `any`, leaving the metrics returned
entirely up to the shim author. For CRI usecases this will just be cgroup and
windows stats as that's all that's supported right now.
Signed-off-by: Danny Canter <danny@dcantah.dev>
eventSendMu is causing severe lock contention when multiple processes
start and exit concurrently. Replace it with a different scheme for
maintaining causality w.r.t. start and exit events for a process which
does not rely on big locks for synchronization.
Keep track of all processes for which a Task(Exec)Start event has been
published and have not yet exited in a map, keyed by their PID.
Processing exits then is as simple as looking up which process
corresponds to the PID. If there are no started processes known with
that PID, the PID must either belong to a process which was started by
s.Start() and before the s.Start() call has added the process to the map
of running processes, or a reparented process which we don't care about.
Handle the former case by having each s.Start() call subscribe to exit
events before starting the process. It checks if the PID has exited in
the time between it starting the process and publishing the TaskStart
event, handling the exit if it has. Exit events for reparented processes
received when no s.Start() calls are in flight are immediately
discarded, and events received during an s.Start() call are discarded
when the s.Start() call returns.
Co-authored-by: Laura Brehm <laurabrehm@hey.com>
Signed-off-by: Cory Snider <csnider@mirantis.com>
When a container is just created, exited state the container will not have stats. A common case for this in k8s is the init containers for a pod. The will be present in the listed containers but will not have a running task and there for no stats.
Signed-off-by: James Sturtevant <jstur@microsoft.com>
This allows standard OTLP env vars to be used for configuring tracing
exporters.
Note: This does mean that, as written now, if no env var is set the
trace exporter will try to connect to the default OTLP address
(`localhost:4318`).
I've left this alone for now, but we could detect the OTLP vars
ourselves and if not set don't configure the exporter.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
The 10-containerd-net.conflist file generated from the conf_template
should be written atomically so that partial writes are not visible to
CNI plugins. Use the new consistentfile package to ensure this on
Unix-like platforms such as Linux, FreeBSD, and Darwin.
Fixes https://github.com/containerd/containerd/issues/8607
Signed-off-by: Samuel Karp <samuelkarp@google.com>
Certain files may need to be written atomically so that partial writes
are not visible to other processes. On Unix-like platforms such as
Linux, FreeBSD, and Darwin, this is accomplished by writing a temporary
file, syncing, and renaming over the destination file name. On Windows,
the same operations are performed, but Windows does not guarantee that a
rename operation is atomic.
Partial/inconsistent reads can occur due to:
1. A process attempting to read the file while containerd is writing it
(both in the case of a new file with a short/incomplete write or in
the case of an existing, updated file where new bytes may be written
at the beginning but old bytes may still be present after).
2. Concurrent goroutines in containerd leading to multiple active
writers of the same file.
The above mechanism explicitly protects against (1) as all writes are to
a file with a temporary name.
There is no explicit protection against multiple, concurrent goroutines
attempting to write the same file. However, atomically writing the file
should mean only one writer will "win" and a consistent file will be
visible.
Signed-off-by: Samuel Karp <samuelkarp@google.com>
The initial PR had a check for nil metrics but after some refactoring in the PR the test case that was suppose cover HPC was missing a scenario where the metric was not nil but didn't contain any metrics. This fixes that case and adds a testcase to cover it.
Signed-off-by: James Sturtevant <jstur@microsoft.com>
Go deprecation comments must be formatted to have an empty comment line before
them. Fix the formatting to make sure linters and editors detect that these
are deprecated.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This change adds support for CDI devices to the ctr --device flag.
If a fully-qualified CDI device name is specified, this is injected
into the OCI specification before creating the container.
Note that the CDI specifications and the devices that they represent
are local and mirror the behaviour of linux devices in the ctr command.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Several bits of code unmarshal image config JSON into an `ocispec.Image`, and then immediately create an `ocispec.Platform` out of it, but then discard the original image *and* miss several potential platform fields (most notably, `variant`).
Because `ocispec.Platform` is a strict subset of `ocispec.Image`, most of these can be updated to simply unmarshal the image config directly to `ocispec.Platform` instead, which allows these additional fields to be picked up appropriately.
We can use `tianon/raspbian` as a concrete reproducer to demonstrate.
Before:
```console
$ ctr content fetch docker.io/tianon/raspbian:bullseye-slim
...
$ ctr image ls
REF TYPE DIGEST SIZE PLATFORMS LABELS
docker.io/tianon/raspbian:bullseye-slim application/vnd.docker.distribution.manifest.v2+json sha256:66e96f8af40691b335acc54e5f69711584ef7f926597b339e7d12ab90cc394ce 28.6 MiB linux/arm/v7 -
```
(Note that the `PLATFORMS` column lists `linux/arm/v7` -- the image itself is actually `linux/arm/v6`, but one of these bits of code leads to only `linux/arm` being extracted from the image config, which `platforms.Normalize` then updates to an explicit `v7`.)
After:
```console
$ ctr image ls
REF TYPE DIGEST SIZE PLATFORMS LABELS
docker.io/tianon/raspbian:bullseye-slim application/vnd.docker.distribution.manifest.v2+json sha256:66e96f8af40691b335acc54e5f69711584ef7f926597b339e7d12ab90cc394ce 28.6 MiB linux/arm/v6 -
```
Signed-off-by: Tianon Gravi <admwiggin@gmail.com>
Co-authored-by: Sebastiaan van Stijn <github@gone.nl>
This flag allows cpuset.mems to be specified when running a container. If
provided, the container will use only the defined memory nodes.
Signed-off-by: Peteris Rudzusiks <rye@stripe.com>
This flag allows cpuset.cpus to be specified when starting a container. If
provided, the container will use only the defined CPU cores.
Signed-off-by: Peteris Rudzusiks <rye@stripe.com>
If this command is used without "-Container:$false" and the "containerd" directory does not already exist all files will be merged into a single "containerd" file instead of a new directory.
Signed-off-by: chschumacher1994 <115921143+chschumacher1994@users.noreply.github.com>
This patch switches the Azure-based Windows workflows to using the
vanilla `2019-Datacenter` Azure SKU following the deprecation of the
old specialized `2019-Datacenter-with-Containers-smalldisk` SKU which
was previously used.
Signed-off-by: Nashwan Azhari <nazhari@cloudbasesolutions.com>
Document the protocol buffer setup script and make note of external
proto files that must be added for successful generation.
Signed-off-by: James Jenkins <James.Jenkins@ibm.com>
Windows systems are capable of running both Windows Containers and Linux
containers. For windows containers we need to sanitize the volume path
and skip non-C volumes from the copy existing contents code path. Linux
containers running on Windows and Linux must not have the path sanitized
in any way.
Supplying the targetOS of the container allows us to proprely decide
when to activate that code path.
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Images may be created with a VOLUME stanza pointed to drive letters that
are not C:. Currently, an image that has such VOLUMEs defined, will
cause containerd to error out when starting a container.
This change skips copying existing contents to volumes that are not C:.
as an image can only hold files that are destined for the C: drive of a
container.
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
If a mount destination is specified both in the default spec and in a
--mount option, remove the default mount before adding new mounts. This
allows overriding the default sysfs mount, for example.
Signed-off-by: Samuel Karp <samuelkarp@google.com>
This commit fixes a broken link. This commit also updates the description about
the image handler. It now mentions about
`github.com/containerd/containerd/pkg/snapshotters` package.
Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
To further some ongoing work in containerd to make as much code as possible
able to be used on any platform (to handle runtimes that can virtualize/emulate
a variety of different OSes), this change makes stats able to be handled on
any of the supported stat types (just linux and windows). To accomplish this,
we use the platform the sandbox returns from its `Platform` rpc to decide
what format the containers in a given sandbox are returning metrics in, then
we can typecast/marshal accordingly.
Signed-off-by: Danny Canter <danny@dcantah.dev>
Add new test cases for volumes on both Linux and Windows. These new
volumes will be used to test that we don't accidentally mangle volume
paths on Linux and that non-C volume mounts work properly when defined
in an image on Windows.
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
The oci.WithUser option was being applied in container_create_linux.go
instead of the cross plat buildLinuxSpec method. There's been recent
work to try and make every spec option that can be applied on any platform
able to do so, and this falls under that. However, WithUser on linux platforms
relies on the containers SnapshotKey being filled out, which means the spec
option needs to be applied during container creation.
To make this a little more generic, I've created a new platformSpecOpts
method that handles any spec opts that rely on runtime state (rootfs mounted
for example) for some platforms, or just platform options that we still don't
have workarounds for to be able to specify them for other platforms
(apparmor, seccomp etc.) by internally calling the already existing
containerSpecOpts method.
Signed-off-by: Danny Canter <danny@dcantah.dev>
Follow-up to #8489. We don't need to call Close in the err and success
cases, we can just do it after Readdirnames returns.
Signed-off-by: Danny Canter <danny@dcantah.dev>
There was a couple uses of Readdir/ReadDir here where the only thing the return
value was used for was the Name of the entry. This is exactly what Readdirnames
returns, so we can avoid the overhead of making/returning a bunch of interfaces
and calling lstat everytime in the case of Readdir(-1).
https://cs.opensource.google/go/go/+/refs/tags/go1.20.4:src/os/dir_unix.go;l=114-137
Signed-off-by: Danny Canter <danny@dcantah.dev>
- we don't support go < 1.8. this restriction as added because plugin support
requires go 1.8 or up, but with such old versions being EOL, this check was
rather redundant
- add back arm64 support; in 6bd0710831, non-amd64
was disabled, pending golang/go#17138, which was tracking arm64 support, and
is now resolved. It's unclear if architectures other than amd64 and arm64 are
supported, so keeping it restricted to amd64 and arm64.
- enable plugin support on Windows; it was disabled in 0b44e24c07
but the code looks to take windows into account.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This makes it possible to check whether content didn't actually need to
be pushed to the remote registry and was cross-repo mounted or already
existed.
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
RunWithPrivileges() will enable privileges will lock a thread, change
privileges, and run the function passed in, within that thread. This
allows us to limit the scope in which we enable privileges and avoids
accidentally enabling privileges in threads that should never have them.
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
It seems that in certain situations, like having the containerd root
and state on a file system hosted on a mounted VHDX, we need
SeSecurityPrivilege when opening a file with winio.ACCESS_SYSTEM_SECURITY.
This happens in the base layer writer in hcsshim when adding a new file.
Enabling SeSecurityPrivilege allows the containerd root to be hosted on
a vhdx.
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
# See https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/configuring-pull-request-merges/managing-a-merge-queue#triggering-merge-group-checks-with-github-actions
merge_group:
push:
branches:
- main
- "release/**"
pull_request:
branches:
- main
- "release/**"
env:
# Go version we currently use to build containerd across all CI.
# Note: don't forget to update `Binaries` step, as it contains the matrix of all supported Go versions.
GO_VERSION:"1.20.4"
branches:['main','release/**']
permissions:# added using https://github.com/step-security/secure-workflows
stale-issue-message:'This issue is stale because it has been open 90 days with no activity. This issue will be closed in 7 days unless new comments are made or the stale label is removed.'
# Comment on the staled PRs
stale-pr-message:'This PR is stale because it has been open 90 days with no activity. This PR will be closed in 7 days unless new comments are made or the stale label is removed.'
# Comment on the staled issues while closed
close-issue-message:'This issue was closed because it has been stalled for 7 days with no activity.'
# Comment on the staled PRs while closed
close-pr-message:'This PR was closed because it has been stalled for 7 days with no activity.'
# Enable dry-run when changing this file from a PR.
@ -32,8 +32,6 @@ including the Balena project listed below.
**_Rancher's Rio project_** - Rancher Labs [Rio](https://github.com/rancher/rio) project uses containerd as the runtime for a combined Kubernetes, Istio, and container "Cloud Native Container Distribution" platform.
**_Eliot_** - The [Eliot](https://github.com/ernoaapa/eliot) container project for IoT device container management uses containerd as the runtime.
**_Balena_** - Resin's [Balena](https://github.com/resin-os/balena) container engine, based on moby/moby but for edge, embedded, and IoT use cases, uses the containerd and runc stack in the same way that the Docker engine uses containerd.
**_LinuxKit_** - the Moby project's [LinuxKit](https://github.com/linuxkit/linuxkit) for building secure, minimal Linux OS images in a container-native model uses containerd as the core runtime for system and service containers.
@ -58,7 +56,9 @@ including the Balena project listed below.
**_Deckhouse_** - [Deckhouse Kubernetes Platform](https://deckhouse.io/) from Flant allows you to manage Kubernetes clusters anywhere in a fully automatic and uniform fashion. It uses containerd as the default CRI runtime.
**_[Actuated](https://actuated.dev)** - Actuated is a platform for running self-hosted CI in securely-isolated Firecracker VMs. Actuated uses containerd's image pulling facility to distribute and update the root filesystem for VMs for CI agents.
**_[Actuated](https://actuated.dev)_** - Actuated is a platform for running self-hosted CI in securely-isolated Firecracker VMs. Actuated uses containerd's image pulling facility to distribute and update the root filesystem for VMs for CI agents.
**_[Syself Autopilot](https://syself.com)** - Syself Autopilot is a simplified Kubernetes platform based on Cluster API that can run on various providers. Syself Autopilot uses containerd as the default CRI runtime.
**_Other Projects_** - While the above list provides a cross-section of well known uses of containerd, the simplicity and clear API layer for containerd has inspired many smaller projects around providing simple container management platforms. Several examples of building higher layer functionality on top of the containerd base have come from various containerd community participants:
- Michael Crosby's [boss](https://github.com/crosbymichael/boss) project,
@ -25,7 +25,7 @@ A codespace will open in a web-based version of Visual Studio Code. The [dev con
To build the `containerd` daemon, and the `ctr` simple test client, the following build system dependencies are required:
* Go 1.19.x or above
* Go 1.22.x or above
* Protoc 3.x compiler and headers (download at the [Google protobuf releases page](https://github.com/protocolbuffers/protobuf/releases))
* Btrfs headers and libraries for your distribution. Note that building the btrfs driver can be disabled via the build tag `no_btrfs`, removing this dependency.
@ -43,12 +43,7 @@ You need `git` to checkout the source code:
For proper results, install the `protoc` release into `/usr/local` on your build system. For example, the following commands will download and install the 3.11.4 release for a 64-bit Linux host:
For proper results, install the `protoc` release into `/usr/local` on your build system. When generating source code from `.proto` files, containerd may rely on some external protocol buffer files. These external dependencies should be added to the `/usr/local/include` directory. To install the appropriate version of `protoc` and download any necessary external protocol buffer files on a Linux host, run the install script located at `script/setup/install-protobuf`.
To enable optional [Btrfs](https://en.wikipedia.org/wiki/Btrfs) snapshotter, you should have the headers from the Linux kernel 4.12 or later.
The dependency on the kernel headers only affects users building containerd from source.
@ -125,6 +120,9 @@ make generate
> * `no_btrfs`: A build tag disables building the Btrfs snapshot driver.
> * `no_devmapper`: A build tag disables building the device mapper snapshot driver.
> * `no_zfs`: A build tag disables building the ZFS snapshot driver.
> * platform
> * `no_systemd`: disables any systemd specific code
> * `no_dynamic_plugins`: A build tag disables dynamic plugins.
>
> For example, adding `BUILDTAGS=no_btrfs` to your environment before calling the **binaries**
> Makefile target will disable the btrfs driver within the containerd Go build.
@ -153,52 +151,33 @@ make STATIC=1
# Via Docker container
The following instructions assume you are at the parent directory of containerd source directory.
> [!NOTE]
> The following instructions assume you are at the **parent** directory of containerd source directory.
## Build containerd in a container
You can build `containerd` via a Linux-based Docker container.
You can build an image from this `Dockerfile`:
You can build `containerd` via a Linux-based Docker container using the [Docker official `golang` image](https://hub.docker.com/_/golang/)
```dockerfile
FROM golang
```
Let's suppose that you built an image called `containerd/build`. From the
containerd source root directory you can run the following command:
From the **parent** directory of `containerd`'s cloned repo you can run the following command:
-w /go/src/github.com/containerd/containerd containerd/build sh
-v ${PWD}/containerd:/src/containerd \
-w /src/containerd golang
```
This mounts `containerd` repository
This mounts the `containerd` repository inside the image at `/src/containerd` and, by default, runs a shell at that directory.
You are now ready to [build](#build-containerd):
```sh
make && make install
```
Now, you are now ready to follow the [build instructions](#build-containerd):
## Build containerd and runc in a container
To have complete core container runtime, you will need both `containerd` and `runc`. It is possible to build both of these via Docker container.
You can use `git` to checkout `runc`:
You can clone `runc` in the same parent directory where you cloned `containerd` and you should clone [the latest stable version of `runc`](https://github.com/opencontainers/runc/releases), e.g. v1.1.13:
In our Docker container we will build `runc` build, which includes
@ -209,36 +188,66 @@ do not require external libraries at build time). Refer to [RUNC.md](docs/RUNC.m
in the docs directory to for details about building runc, and to learn about
supported versions of `runc` as used by containerd.
Let's suppose you build an image called `containerd/build` from the above Dockerfile. You can run the following command:
Since we need [`libseccomp-dev`](https://packages.debian.org/stable/libseccomp-dev) installed as a dependency, we will need a custom Docker image derived from the official `golang` image. You can use the following `Dockerfile` to build your custom image:
This guide will help familiarize contributors to the `containerd/containerd` repository.
## Prerequisite
First read the containerd project's [general guidelines around contribution](https://github.com/containerd/project/blob/main/CONTRIBUTING.md)
which apply to all containerd projects.
## Getting started
See [`BUILDING.md`](https://github.com/containerd/containerd/blob/main/BUILDING.md) for instructions for setting up a development environment.
If you are also a new user to containerd, you can first check out the [_Getting started with containerd_](https://github.com/containerd/containerd/blob/main/docs/getting-started.md) guide.
## Setting up your local environment
At a minimum, the dev tools from `script/setup/install-dev-tools` should be installed.
Run `make install-deps` to install dependencies used for running and developing the CRI plugin.
Other install scripts under `script/setup` may need to be run depending on your environment and your preference for installing libraries and dependencies.
The versions used by `containerd/containerd` CI can be found in `script/setup` and referred to if installing manually.
```
$ script/setup/install-dev-tools
$ make install-deps
```
## Code style
- Go files adhere to standard Go formatting and styling
- Protobuf files use tabs for indentation
- Other files must not contain trailing whitespace and should end with a single new line character
Use the `check` command in the makefile to verify your code matches the expected style.
```
make check
```
## Updating protobuf files
Ensure protoc and dev tools have been installed, then run `make protos`
> **Note**
> When running `make protos`, the current working directory should be found under the `GOPATH` environment
> variable to ensure protoc can properly resolve the paths of protofiles in the project.
## Naming packages
Package names should be short and simple. Avoid using `_` and repeating words from parent directories.
### Where to put packages
Try to put a new package under the appropriate root directories. The root directory is reserved for
configuration and build files, no source files will be accepted in root since containerd v2.0.
- `api` - All protobuf service definitions and types used by services
- `bin` - Autogenerated during build, do not check in file here
- `client` - All Go files for the containerd client (formerly in `containerd/containerd` root in 1.x)
- `cmd` - All Go main packages and the packages used only for that main package
- `contrib` - Files, configurations, and packages related to external tools or libraries
- `core` - Core Go packages with interface definitions and built-in implementations
- `docs` - All containerd technical documentation using markdown
- `internal` - All utility packages used by containerd and not intended for direct import
- `man`- All containerd reference manuals used for the `man` command
- `pkg` - Non-core Go packages used by clients and other containerd packages
- `plugins` - All included containerd plugins which are registered via init
- `releases` - All release note files
- `script` - All scripts used for testing, development, and CI
- `test` - Test scripts used for external end to end testing of containerd, do not add new files here
- `vendor` - Autogenerated vendor files from `make vendor` command, do not manually edit files here
- `version` - Version package with the current containerd version
@ -256,7 +270,7 @@ bin/gen-manpages: cmd/gen-manpages FORCE
bin/containerd-shim-runc-v2:cmd/containerd-shim-runc-v2 FORCE# set !cgo and omit pie for a static shim build: https://github.com/golang/go/issues/17789#issuecomment-258542220

containerd is an industry-standard container runtime with an emphasis on simplicity, robustness, and portability. It is available as a daemon for Linux and Windows, which can manage the complete container lifecycle of its host system: image transfer and storage, container execution and supervision, low-level storage and network attachments, etc.
@ -17,16 +19,8 @@ containerd is designed to be embedded into a larger system, rather than being us
## Announcements
### Hello Kubernetes v1.24!
The containerd project would like to announce containerd [v1.6.4](https://github.com/containerd/containerd/releases/tag/v1.6.4). While other prior releases are supported, this latest release and the containerd [v1.5.11](https://github.com/containerd/containerd/releases/tag/v1.5.11) release are recommended for Kubernetes v1.24.
We felt it important to announce this, particularly in view of [the dockershim removal from this release of Kubernetes](https://kubernetes.io/blog/2022/05/03/dockershim-historical-context/).
It should be noted here that moving to CRI integrations has been in the plan for many years. `containerd` began as part of `Docker` and was donated to `CNCF`. `containerd` remains in use today by Docker/moby/buildkit etc., and has many other [adopters](https://github.com/containerd/containerd/blob/main/ADOPTERS.md). `containerd` has a namespace that isolates use of `containerd` from various clients/adopters. The Kubernetes namespace is appropriately named `k8s.io`. The CRI API and `containerd` CRI plugin project has, from the start, been an effort to reduce the impact surface for Kubernetes container runtime integration. If you can't tell, we are excited to see this come to fruition.
If you have any concerns or questions, we will be here to answer them in [issues, discussions, and/or on slack](#communication). Below you will find information/detail about our [CRI Integration](#cri) implementation.
For containerd users already on v1.6.0-v1.6.3, there are known issues addressed by [v1.6.4](https://github.com/containerd/containerd/releases/tag/v1.6.4). The issues are primarily related to [CNI setup](https://github.com/kubernetes/website/blob/dev-1.24/content/en/docs/tasks/administer-cluster/migrating-from-dockershim/troubleshooting-cni-plugin-related-errors.md)
### containerd v2.0 is now released!
See [`docs/containerd-2.0.md`](docs/containerd-2.0.md).
### Now Recruiting
@ -47,7 +41,7 @@ See our documentation on [containerd.io](https://containerd.io):
* [namespaces](docs/namespaces.md)
* [client options](docs/client-opts.md)
See how to build containerd from source at [BUILDING](BUILDING.md).
To get started contributing to containerd, see [CONTRIBUTING](CONTRIBUTING.md).
If you are interested in trying out containerd see our example at [Getting Started](docs/getting-started.md).
@ -98,164 +92,8 @@ For configuring registries, see [registry host configuration documentation](docs
## Features
### Client
containerd offers a full client package to help you integrate containerd into your platform.
Namespaces allow multiple consumers to use the same containerd without conflicting with each other. It has the benefit of sharing content while maintaining separation with containers and images.
In containerd, a container is a metadata object. Resources such as an OCI runtime specification, image, root filesystem, and other metadata can be attached to a container.
containerd fully supports the OCI runtime specification for running containers. We have built-in functions to help you generate runtime specifications based on images as well as custom parameters.
You can specify options when creating a container about how to modify the specification.
Taking a container object and turning it into a runnable process on a system is done by creating a new `Task` from the container. A task represents the runnable object within containerd.
// the task is now running and has a pid that can be used to setup networking
// or other runtime settings outside of containerd
pid := task.Pid()
// start the redis-server process inside the container
err := task.Start(context)
// wait for the task to exit and get the exit status
status, err := task.Wait(context)
```
### Checkpoint and Restore
If you have [criu](https://criu.org/Main_Page) installed on your machine you can checkpoint and restore containers and their tasks. This allows you to clone and/or live migrate containers to other machines.
In addition to the built-in Snapshot plugins in containerd, additional external
plugins can be configured using GRPC. An external plugin is made available using
the configured name and appears as a plugin alongside the built-in ones.
To add an external snapshot plugin, add the plugin to containerd's config file
(by default at `/etc/containerd/config.toml`). The string following
`proxy_plugin.` will be used as the name of the snapshotter and the address
should refer to a socket with a GRPC listener serving containerd's Snapshot
GRPC API. Remember to restart containerd for any configuration changes to take
effect.
```
[proxy_plugins]
[proxy_plugins.customsnapshot]
type = "snapshot"
address = "/var/run/mysnapshotter.sock"
```
See [PLUGINS.md](/docs/PLUGINS.md) for how to create plugins
For a detailed overview of containerd's core concepts and the features it supports,
please refer to the [FEATURES.MD](./docs/features.md) document.
### Releases and API Stability
@ -299,9 +137,6 @@ loaded for the user's shell environment.
`cri` is a native plugin of containerd. Since containerd 1.1, the cri plugin is built into the release binaries and enabled by default.
> **Note:** As of containerd 1.5, the `cri` plugin is merged into the containerd/containerd repo. For example, the source code previously stored under [`containerd/cri/pkg`](https://github.com/containerd/cri/tree/release/1.4/pkg)
was moved to [`containerd/containerd/pkg/cri` package](https://github.com/containerd/containerd/tree/main/pkg/cri).
The `cri` plugin has reached GA status, representing that it is:
* Feature complete
* Works with Kubernetes 1.10 and above
@ -309,7 +144,7 @@ The `cri` plugin has reached GA status, representing that it is:
* Passes all [node e2e tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-node/e2e-node-tests.md).
* Passes all [e2e tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-testing/e2e-tests.md).
See results on the containerd k8s [test dashboard](https://k8s-testgrid.appspot.com/sig-node-containerd)
See results on the containerd k8s [test dashboard](https://testgrid.k8s.io/containerd)
#### Validating Your `cri` Setup
A Kubernetes incubator project, [cri-tools](https://github.com/kubernetes-sigs/cri-tools), includes programs for exercising CRI implementations. More importantly, cri-tools includes the program `critest` which is used for running [CRI Validation Testing](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-node/cri-validation.md).
| [0.0](https://github.com/containerd/containerd/releases/tag/0.0.5) | End of Life | Dec 4, 2015 | - |
| [0.1](https://github.com/containerd/containerd/releases/tag/v0.1.0) | End of Life | Mar 21, 2016 | - |
| [0.2](https://github.com/containerd/containerd/tree/v0.2.x) | End of Life | Apr 21, 2016 | December 5, 2017 |
| [1.0](https://github.com/containerd/containerd/releases/tag/v1.0.3) | End of Life | December 5, 2017 | December 5, 2018 |
| [1.1](https://github.com/containerd/containerd/releases/tag/v1.1.8) | End of Life | April 23, 2018 | October 23, 2019 |
| [1.2](https://github.com/containerd/containerd/releases/tag/v1.2.13) | End of Life | October 24, 2018 | October 15, 2020 |
| [1.3](https://github.com/containerd/containerd/releases/tag/v1.3.10) | End of Life | September 26, 2019 | March 4, 2021 |
| [1.4](https://github.com/containerd/containerd/releases/tag/v1.4.13) | End of Life | August 17, 2020 | March 3, 2022 |
| [1.5](https://github.com/containerd/containerd/releases/tag/v1.5.18) | End of Life | May 3, 2021 | February 28, 2023 |
| [1.6](https://github.com/containerd/containerd/releases/tag/v1.6.36) | LTS | February 15, 2022 | next LTS + 6 months |
| [1.7](https://github.com/containerd/containerd/releases/tag/v1.7.23) | Active | March 10, 2023 | active(May 5, 2025), extended(EOL of 1.6) |
| [2.0](https://github.com/containerd/containerd/releases/tag/v2.0.0) | Active | November 5, 2024 | max(November 5, 2025 or release of 2.1 + 6 months) |
| [2.1](https://github.com/containerd/containerd/milestone/48) | Next | TBD | TBD |
> **_NOTE_** containerd v1.7 will end of life at the same time as v1.6 LTS. Due to
> [Minimal Version Selection](https://go.dev/ref/mod#minimal-version-selection) used
> by Go modules, 1.7 must be supported until EOL of all 1.x releases. Once 1.7 is in
> extended support, it will continue to accept security patches in addition to client
> changes relevant for package importers using the 1.6 LTS daemon.
### Kubernetes Support
@ -130,25 +139,25 @@ for the list of actively tested versions. Kubernetes only supports n-3 minor
release versions and containerd will ensure there is always a supported version
of containerd for every supported version of Kubernetes.
| Kubernetes Version | containerd Version | CRI Version |
** Note: containerd v1.6.*, and v1.7.* support CRI v1 and v1alpha2 through EOL as those releases continue to support older versions of k8s, cloud providers, and other clients using CRI v1alpha2. CRI v1alpha2 is deprecated in v1.7 and will be removed in containerd v2.0.
| Kubernetes Version | containerd Version | CRI Version |
** Note: containerd v1.6.*, and v1.7.* support CRI v1 and v1alpha2 through EOL as those releases continue to support older versions of k8s, cloud providers, and other clients using CRI v1alpha2. CRI v1alpha2 is deprecated in v1.7 and will be removed in containerd v2.0.
### Backporting
@ -193,6 +202,11 @@ process:
```console
$ git cherry-pick -xsS <commit>
```
If all of the work from a particular PR/set of PRs is wanted,
cherry-pick the individual commits instead of the merge commit.
Take #8624 for example, 82ec62b is favored over 9e834e7.
(Optional) If other commits exist in the main branch which are related
to the cherry-picked commit; eg: fixes to the main PR. It is recommended
to cherry-pick those commits also into this same `my-backport-branch`.
| Legacy CRI implementation of podsandbox support | containerd v2.0 | containerd v2.0 ✅ | |
| Go-Plugin library (`*.so`) as containerd runtime plugin | containerd v2.0 | containerd v2.1 | Use external plugins (proxy or binary) |
- Pulling Schema 1 images has been disabled in containerd v2.0, but it still can be enabled by setting an environment variable `CONTAINERD_ENABLE_DEPRECATED_PULL_SCHEMA_1_IMAGE=1`
until containerd v2.1. `ctr` users have to specify `--local` too (e.g., `ctr images pull --local`).
### Deprecated config properties
The deprecated properties in [`config.toml`](./docs/cri/config.md) are shown in the following table:
@ -387,15 +457,23 @@ The deprecated properties in [`config.toml`](./docs/cri/config.md) are shown in
| Property Group | Property | Deprecation release | Target release for removal | Recommendation |
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.