The interface that combines both content.InfoProvider and
content.Provider was duplicated in multiple places - create one directly
in `content` package and use it instead.
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
It's to ensure the data integrity during unexpected power failure.
Background:
Since release 1.3, in Linux system, containerD unpacks and writes files into
overlayfs snapshot directly. It doesn’t involve any mount-umount operations
so that the performance of pulling image has been improved.
As we know, the umount syscall for overlayfs will force kernel to flush
all the dirty pages into disk. Without umount syscall, the files’ data relies
on kernel’s writeback threads or filesystem's commit setting (for
instance, ext4 filesystem).
The files in committed snapshot can be loss after unexpected power failure.
However, the snapshot has been committed and the metadata also has been
fsynced. There is data inconsistency between snapshot metadata and files
in that snapshot.
We, containerd, received several issues about data loss after unexpected
power failure.
* https://github.com/containerd/containerd/issues/5854
* https://github.com/containerd/containerd/issues/3369#issuecomment-1787334907
Solution:
* Option 1: SyncFs after unpack
Linux platform provides [syncfs][syncfs] syscall to synchronize just the
filesystem containing a given file.
* Option 2: Fsync directories recursively and fsync on regular file
The fsync doesn't support symlink/block device/char device files. We
need to use fsync the parent directory to ensure that entry is
persisted.
However, based on [xfstest-dev][xfstest-dev], there is no case to ensure
fsync-on-parent can persist the special file's metadata, for example,
uid/gid, access mode.
Checkout [generic/690][generic/690]: Syncing parent dir can persist
symlink. But for f2fs, it needs special mount option. And it doesn't say
that uid/gid can be persisted. All the details are behind the
implemetation.
> NOTE: All the related test cases has `_flakey_drop_and_remount` in
[xfstest-dev].
Based on discussion about [Documenting the crash-recovery guarantees of Linux file systems][kernel-crash-recovery-data-integrity],
we can't rely on Fsync-on-parent.
* Option 1 is winner
This patch is using option 1.
There is test result based on [test-tool][test-tool].
All the networking traffic created by pull is local.
* Image: docker.io/library/golang:1.19.4 (992 MiB)
* Current: 5.446738579s
* WIOS=21081, WBytes=1329741824, RIOS=79, RBytes=1197056
* Option 1: 6.239686088s
* WIOS=34804, WBytes=1454845952, RIOS=79, RBytes=1197056
* Option 2: 1m30.510934813s
* WIOS=42143, WBytes=1471397888, RIOS=82, RBytes=1209344
* Image: docker.io/tensorflow/tensorflow:latest (1.78 GiB, ~32590 Inodes)
* Current: 8.852718042s
* WIOS=39417, WBytes=2412818432, RIOS=2673, RBytes=335987712
* Option 1: 9.683387174s
* WIOS=42767, WBytes=2431750144, RIOS=89, RBytes=1238016
* Option 2: 1m54.302103719s
* WIOS=54403, WBytes=2460528640, RIOS=1709, RBytes=208237568
The Option 1 will increase `wios`. So, the `image_pull_with_sync_fs` is
option in CRI plugin.
[syncfs]: <https://man7.org/linux/man-pages/man2/syncfs.2.html>
[xfstest-dev]: <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git>
[generic/690]: <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/tests/generic/690?h=v2023.11.19>
[kernel-crash-recovery-data-integrity]: <https://lore.kernel.org/linux-fsdevel/1552418820-18102-1-git-send-email-jaya@cs.utexas.edu/>
[test-tool]: <a17fb2010d/contrib/syncfs/containerd/main_test.go (L51)>
Signed-off-by: Wei Fu <fuweid89@gmail.com>
For backward compatibility, we should get runtimeInfo from sandbox in
db, or get it from the sandbox container in db.
Note that this is a temporary solution and we will remove the Container field in
Sandbox in cri cache, and replace it with a SandboxInsantance of type
containerd.Sandbox interface.
Signed-off-by: Abel Feng <fshb1988@gmail.com>
This commit fixes the dialer function to make sure that
"npipe://" prefix is trimmed, just like the way it is done
in the Unix counterpart, `./dialer_unix.go:50`
This will also unblock some downstream work going on in
buildkit; setting up integration tests to run on Windows.
Signed-off-by: Anthony Nandaa <profnandaa@gmail.com>
Close connection if no more data. It's to fix false alert filed by image
pull progress.
```
dst = OpenWriter (--> Content Store)
src = Fetch
Open (--> Registry)
Mark it as active request
Copy(dst, src) (--> Keep updating total received bytes)
^
| (Active Request > 0, but total received bytes won't be updated)
v
defer src.Close()
content.Commit(dst)
```
Before migrating to transfer service, CRI plugin doesn't limit global
concurrent downloads for ImagePulls. Each ImagePull requests have 3 concurrent
goroutines to download blob and 1 goroutine to unpack blob. Like ext4
filesystem [1][1], the fsync from content.Commit may sync unrelated dirty pages
into disk. The host is running under IO pressure, and then the content.Commit
will take long time and block other goroutines. If httpreadseeker
doesn't close the connection after io.EOF, this connection will be
considered as active. The pull progress reporter reports there is no
bytes transfered and cancels the ImagePull.
The original 1-minute timeout[2][2] is from kubelet settting. Since CRI-plugin
can't limit the total concurrent downloads, this patch is to update 1-minute
to 5-minutes to prevent from unexpected cancel.
[1]: https://lwn.net/Articles/842385/
[2]: https://github.com/kubernetes/kubernetes/blob/release-1.23/pkg/kubelet/config/flags.go#L45-L48
Signed-off-by: Wei Fu <fuweid89@gmail.com>
The new active request is filed and there is no bytes read yet when the
progress reporter just wakes up. If the timeout / 2 is less than the
minPullProgressReportInternal, it's easy to file false alert.
We should remove the minPullProgressReportInternal limit.
Fixes: #8024
Signed-off-by: Wei Fu <fuweid89@gmail.com>
Adds debug message per layer unpacking and adds duration field to
the existing image unpacking debug message.
Signed-off-by: Austin Vazquez <macedonv@amazon.com>
To break the cyclic dependency of cri plugin and podsandbox plugin,
we define a new plugin type of SandboxesServicePlugin and when cri init
it's own client, it will add the all the controllers by get them from
the SandboxesServicePlugin.
when podsandbox controller init it's client, it will not Require the
SandboxesServicePlugin.
Signed-off-by: Abel Feng <fshb1988@gmail.com>
This is effectively a revert of 2ac9968401, which
switched from os/exec to the golang.org/x/sys/execabs package to mitigate
security issues (mainly on Windows) with lookups resolving to binaries in the
current directory.
from the go1.19 release notes https://go.dev/doc/go1.19#os-exec-path
> ## PATH lookups
>
> Command and LookPath no longer allow results from a PATH search to be found
> relative to the current directory. This removes a common source of security
> problems but may also break existing programs that depend on using, say,
> exec.Command("prog") to run a binary named prog (or, on Windows, prog.exe) in
> the current directory. See the os/exec package documentation for information
> about how best to update such programs.
>
> On Windows, Command and LookPath now respect the NoDefaultCurrentDirectoryInExePath
> environment variable, making it possible to disable the default implicit search
> of “.” in PATH lookups on Windows systems.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The PR https://github.com/containerd/containerd/pull/8198 fixed this for CRI but missed clearing the commandline in the forked SB server. This simply adds that back in
Signed-off-by: James Sturtevant <jsturtevant@gmail.com>
This package enumerates the known deprecations in the current version of
containerd. New deprecations should be added here, and old ones
removed.
Signed-off-by: Samuel Karp <samuelkarp@google.com>
Avoid calling out to the client to get a sandbox controller and instead
setup the list of controllers on initialization. This fixes a test
failure which does not set the client.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Enhance cri/server/image/imagefs_info.go:ImageFsInfo() to support
snapshotter per runtime. Now `ImageFsInfoResponse.ImageFilesystems` may
contain multiple entries.
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
When call sandbox controller to create sandbox, we change the param from
sandbox id to total sandbox object to git all information to controller,
so that sandbox controller do not rely on the sandbox store anymore,
this is more decouple for the sandbox controller plugin inside
containerd, and it is neccesary for remote sandbox controller plugins as
it is not able to get sandbox from the sandbox store anymore.
Signed-off-by: Abel Feng <fshb1988@gmail.com>
make containerd extensible to support more sandbox controllers
registered into containerd by config.
we change the default sandbox controller plugin's name from "local" to "shim".
to make sure we can get the controller by the plugin name it registered into
containerd.
Signed-off-by: Abel Feng <fshb1988@gmail.com>
Before snapshotter per runtime, CRI only supports a global snapshotter.
So a snapshot can be uniquely identified by `snapshot_key`. With snapshotter
per runtime enabled, there may be multiple snapshotters used by CRI. So only
(snapshotter_id, snapshot_key) can uniquely identify a snapshot.
Also extends CRI/store/snapshot/Store to support multiple snapshotters.
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Remove containerd specific parts of the plugin package to prepare its
move out of the main repository. Separate the plugin registration
singleton into a separate package.
Separating out the plugin package and registration makes it easier to
implement external plugins without creating a dependency loop.
Signed-off-by: Derek McGowan <derek@mcg.dev>
The plugins packages defines the plugins used by containerd.
Move all the types and properties to this package.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Pass the passed in context into some nested function calls, wrap
errors instead of %+v, and change some tests to strictly just test
for an error and not an exact error.
Signed-off-by: Danny Canter <danny@dcantah.dev>
`NewCRIService()` may easily fail and its error has to be ignored
unless the CRI plugin is in the `required_plugins` list.
Now this has to be called before `RegisterReadiness()`, as
PR 9153 "Require plugins to succeed after registering readiness"
was merged on 2023-09-29.
Fix issue 9163: `[Regression in main (2023-09-29)]: containerd-rootless.sh doesn't start up`
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
When a shim process is unexpectedly killed in a way that was not initiated through containerd - containerd reports the pod as not ready but the containers as running. This results in kubelet repeatedly sending container kill requests that fail since containerd cannot connect to the shim.
Changes:
- In the container exit handler, treat `err: Unavailable` as if the container has already exited out
- When attempting to get a connection to the shim, if the controller isn't available assume that the shim has been killed (needs to be done since we have a separate exit handler that cleans up the reference to the shim controller - before kubelet has the chance to call StopPodSandbox)
Signed-off-by: Aditya Ramani <a_ramani@apple.com>
runc, as mandated by the runtime-spec, ignores unknown fields in the
config.json. This is unfortunate for cases where we _must_ enable that
feature or fail.
For example, if we want to start a container with user namespaces and
volumes, using the uidMappings/gidMappings field is needed so the
UID/GIDs in the volume don't end up with garbage. However, if we don't
fail when runc will ignore these fields (because they are unknown to
runc), we will just start a container without using the mappings and the
UID/GIDs the container will persist to volumes the hostUID/GID, that can
change if the container is re-scheduled by Kubernetes.
This will end up in volumes having "garbage" and unmapped UIDs that the
container can no longer change. So, let's avoid this entirely by just
checking that runc supports idmap mounts if the container we are about
to create needs them.
Please note that the "runc features" subcommand is only run when we are
using idmap mounts. If idmap mounts are not used, the subcommand is not
run and therefore this should not affect containers that don't use idmap
mounts in any way.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
These utilities resulted in the platforms package to have the containerd
API as dependency. As this package is used in many parts of the code, as
well as external consumers, we should try to keep it light on dependencies,
with the potential to make it a standalone module.
These utilities were added in f3b7436b61,
which has not yet been included in a release, so skipping deprecation
and aliases for these.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
When the kubelet sends the uid/gid mappings for a mount, just pass them
down to the OCI runtime.
OCI runtimes support this since runc 1.2 and crun 1.8.1.
And whenever we add mounts (container mounts or image spec volumes) and
userns are requested by the kubelet, we use those mappings in the mounts
so the mounts are idmapped correctly. If no userns is used, we don't
send any mappings which just keeps the current behavior.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
It says: The prefix path **must be absolute, have all symlinks resolved, and cleaned**. But those requirements are violated in lots of places.
What happens when it is given a non-canonicalized path is that `mountinfo.GetMounts` will not find mounts.
The trivial case is:
```
$ mkdir a && ln -s a b && mkdir b/c b/d && mount --bind b/c b/d && cat /proc/mounts | grep -- '[ab]/d'
/dev/sdd3 /home/user/a/d ext4 rw,noatime,discard 0 0
```
We asked to bind-mount b/c to b/d, but ended up with mount in a/d.
So, mount table always contains canonicalized mount points, and it is an error to look for non-canonicalized paths in it.
Signed-off-by: Marat Radchenko <marat@slonopotamus.org>
The ip.To16() function returns non-nil if `ip` is any kind
of IP address, including IPv4. To look for IPv6 specifically,
use ip.To4() == nil.
Signed-off-by: Sam Edwards <CFSworks@gmail.com>
This brings over the enhancement from a506630e57.
We don't expect the systemd state to change while containerd is running,
so we can use a `sync.Once` for this, to prevent stat'ing each time.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
runc considers libcontainer to be "unstable" (not for external use),
so we try not to use it. Commit ed47d6ba76
brought back the dependency on other parts of libcontainer, but looks to
be only depending on a single utility, which in itself was borrowed from
github.com/coreos/go-systemd to not introduce CGO code in the same package.
This patch copies the version from github.com/coreos/go-systemd (adding
proper attribution, although the function is pretty trivial).
runc is in process of moving the libcontainer/user package to an external
module, which means we can remove the dependency on libcontainer entirely
in the near future. There is one more use of `libcontainer` in our vendor
tree; it looks like CDI is depending on one utility (devices.DeviceFromPath);
a943033a8b/vendor/github.com/container-orchestrated-devices/container-device-interface/pkg/cdi/container-edits_unix.go (L38)
We should remove the dependency on that utility, and add a CI check to
prevent bringing it back.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The failed to recover state message didn't include the ID making this
not as useful as it could be..
This additionally moves some of the other logs to include the id for
the sandbox/container as a field instead of part of a format string.
Signed-off-by: Danny Canter <danny@dcantah.dev>
The reference/docker package was a fork of github.com/distribution/distribution,
which could not easily be used as a direct dependency, as it brought many other
dependencies with it.
The "reference' package has now moved to a separate repository, which means
we can replace the local fork, and use the upstream implementation again.
The new module was extracted from the distribution repository at commit:
b9b19409cf
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
- For remote snapshotters, the unpack phase serves as an important step for
preparing the remote snapshot. With the missing unpacker.Wait, the
snapshotter `Prepare` context is always canceled.
- This patch allows remote snapshotter based archives to be imported via
the transfer service or `ctr image import`
Signed-off-by: Edgar Lee <edgarhinshunlee@gmail.com>
This is a partial revert of "cri/sbserver: Use platform instead of GOOS
for userns detection".
While what that commit did is 100% the right thing to do, when the
sandbox_mode is "shim" all controller.XXX() calls are RPCs and the
controller.Create() call initializes the controller. Therefore, things
like "getSandboxController()" don't work in the case of "shim"
sandbox_mode until after the controller.Create().
Due to this asymmetry and the lack of tests for shim mode, we didn't
catch it before.
This patch just reverts that commit so that the Create() and
getSandboxController() calls remain where they were, and just relies on
the config Linux section as a hack to detect if the pod sandbox will use
user namespaces or not.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
cgroupv1HasHugetlb() and cgroupv2HasHugetlb() may return errors, but nobody
(there's just one call site anyways) ever cares. So drop the unnecessary code.
Signed-off-by: Enrico Weigelt, metux IT consult <info@metux.net>
- Fill OSVersion field of ocispec.Platform for windows OS in
transfer service plugin init()
- Do not return error from transfer service ReceiveStream if
stream.Recv() returned context.Canceled error
Signed-off-by: Kirtana Ashok <kiashok@microsoft.com>
In the sbserver we should not use the GOOS, as windows hosts can run
linux containers. On the sbserver we should use the platform param.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Runc 1.1 throws a warning when using rel destination paths, and runc 1.2
is planning to thow an error (i.e. won't start the container).
Let's just make this an abs path in the only place it might not be: the
mounts created due to `VOLUME` directives in the Dockerfile.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
The rpc only reports one field, i.e. the cgroup driver, to kubelet.
Containerd determines the effective cgroup driver by looking at all
runtime handlers, starting from the default runtime handler (the rest in
alphabetical order), and returning the cgroup driver setting of the
first runtime handler that supports one. If no runtime handler supports
cgroup driver (i.e. has a config option for it) containerd falls back to
auto-detection, returning systemd if systemd is running and cgroupfs
otherwise.
This patch implements the CRI server side of Kubernetes KEP-4033:
https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4033-group-driver-detection-over-cri
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
If kubelet passes the swap limit (default memory limit = swap limit ),
it is configured for container irrespective if the node supports swap.
Signed-off-by: Qasim Sarfraz <qasimsarfraz@microsoft.com>
These are not actually being pulled, just removing the deprecated k8s.gcr.io
from the code-base. While at it, also renamed / removed vars that shadowed
with package-level definitions
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
When the pods are transitioning there are several
cases where containers might not be in valid state.
There were several cases where the stats where
failing hard but we should just continue on as
they are transient and will be picked up again
when kubelet queries for the stats again.
Signed-off-by: James Sturtevant <jstur@microsoft.com>
Signed-off-by: Mark Rossetti <marosset@microsoft.com>
This commit just updates the sbserver with the same fix we did on main:
9bf5aeca77 ("cri: Fix net.ipv4.ping_group_range with userns ")
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
This is a port of 31a6449734 ("Add capability for snapshotters to
declare support for UID remapping") to sbserver.
This patch remaps the rootfs in the platform-specific if user namespaces
are in use, so the pod can read/write to the rootfs.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
This patch requests the OCI runtime to create a userns when the CRI
message includes such request.
This is an adaptation of a7adeb6976 ("cri: Support pods with user
namespaces") to sbserver, although the container_create.go parts were
already ported as part of 40be96efa9 ("Have separate spec builder for
each platform"),
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
This commit just ports 36f520dc04 ("Let OCI runtime create netns when
userns is used") to sbserver.
The CNI network setup is done after OCI start, as it didn't seem simple
to get the sandbox PID we need for the netns otherwise.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Currently there is a big c&p of the helpers between these two folders
and a TODO in the platform agnostic file to organize them in the future,
when some other things settle.
So, let's just copy them for now.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Commit c085fac1e5 ("Move sandbox start behind controller") moved the
runtimeStart to only account for time _after_ the netns has been
created.
To match what we currently do in cri/server, let's move it to just after
the get the sandbox runtime.
This come up when porting userns to sbserver, as the CNI network setup
needs to be done at a later stage and runtimeStart was accounting for
the CNI network setup time only when userns is enabled.
To avoid that discrepancy, let's just move it earlier, that also matches
what we do in cri/server.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Beside the "in future the when" typo, we take the chance to reflect that
user namespaces are already merged.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
These two errors can occur in the following scenarios:
ECONNRESET: the target process reset connection between CRI and itself.
see: #111825 for detail
EPIPE: the target process did not read the received data, causing the
buffer in the kernel to be full, resulting in the occurrence of Zero Window,
then closing the connection (FIN, RESET)
see: #74551 for detail
In both cases, we should RESET the httpStream.
Signed-off-by: wangxiang <scottwangsxll@gmail.com>
userns.RunningInUserNS() checks if the code calling that function is
running inside a user namespace. But we need to check if the container
we will create will use a user namespace, in that case we need to
disable the sysctl too (or we would need to take the userns mapping into
account to set the IDs).
This was added in PR:
https://github.com/containerd/containerd/pull/6170/
And the param documentation says it is not enabled when user namespaces
are in use:
https://github.com/containerd/containerd/pull/6170/files#diff-91d0a4c61f6d3523b5a19717d1b40b5fffd7e392d8fe22aed7c905fe195b8902R118
I'm not sure if the intention was to disable this if containerd is
running inside a userns (rootless, if that is even supported) or just
when the pod has user namespaces.
Out of an abundance of caution, I'm keeping the userns.RunningInUserNS()
so it is still not used if containerd runs inside a user namespace.
With this patch and "enable_unprivileged_icmp = true" in the config,
running containerd as root on the host, pods with user namespaces start
just fine. Without this patch they fail with:
... failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: w
/proc/sys/net/ipv4/ping_group_range: invalid argument: unknown
Thanks a lot to Andy on the k8s slack for reporting the issue. He also
mentions he hits this with k3s on a default installation (the param
is off by default on containerd, but k3s turns that on by default it
seems). He also debugged which part of the stack was setting that
sysctl, found the PR that added this code in containerd and a workaround
(to turn the bool off).
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Helpers to convert from a slice of platforms to our protobuf representation
and vice-versa appear a couple times. It seems sane to just expose this facility
in the platforms pkg.
Signed-off-by: Danny Canter <danny@dcantah.dev>
This introduces a ParseSourceDateEpoch function, which can be used
to parse "SOURCE_DATE_EPOCH" values for situations where those
values are not passed through an env-var (or the env-var has been
read through other means).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
These tests were failing on my macOS; could be the precision issue (like on
Windows), or just because they're "too fast".
=== RUN TestSourceDateEpoch/WithoutSourceDateEpoch
epoch_test.go:51:
Error Trace: /Users/thajeztah/go/src/github.com/containerd/containerd/pkg/epoch/epoch_test.go:51
Error: Should be true
Test: TestSourceDateEpoch/WithoutSourceDateEpoch
Messages: now: 2023-06-23 11:47:09.93118 +0000 UTC, v: 2023-06-23 11:47:09.93118 +0000 UTC
This patch:
- updates the rightAfter utility to allow the timestamps to be "equal"
- updates the asserts to provide some details about the timestamps
- uses UTC for the value we're comparing to, to match the timestamps
that are generated.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
As a follow up change to adding a SandboxMetrics rpc to the core
sandbox service, the controller needed a corresponding rpc for CRI
and others to eventually implement.
This leaves the CRI (non-shim mode) controller unimplemented just to
have a change with the API addition to start.
Signed-off-by: Danny Canter <danny@dcantah.dev>
When a container is just created, exited state the container will not have stats. A common case for this in k8s is the init containers for a pod. The will be present in the listed containers but will not have a running task and there for no stats.
Signed-off-by: James Sturtevant <jstur@microsoft.com>