It's to ensure the data integrity during unexpected power failure.
Background:
Since release 1.3, in Linux system, containerD unpacks and writes files into
overlayfs snapshot directly. It doesn’t involve any mount-umount operations
so that the performance of pulling image has been improved.
As we know, the umount syscall for overlayfs will force kernel to flush
all the dirty pages into disk. Without umount syscall, the files’ data relies
on kernel’s writeback threads or filesystem's commit setting (for
instance, ext4 filesystem).
The files in committed snapshot can be loss after unexpected power failure.
However, the snapshot has been committed and the metadata also has been
fsynced. There is data inconsistency between snapshot metadata and files
in that snapshot.
We, containerd, received several issues about data loss after unexpected
power failure.
* https://github.com/containerd/containerd/issues/5854
* https://github.com/containerd/containerd/issues/3369#issuecomment-1787334907
Solution:
* Option 1: SyncFs after unpack
Linux platform provides [syncfs][syncfs] syscall to synchronize just the
filesystem containing a given file.
* Option 2: Fsync directories recursively and fsync on regular file
The fsync doesn't support symlink/block device/char device files. We
need to use fsync the parent directory to ensure that entry is
persisted.
However, based on [xfstest-dev][xfstest-dev], there is no case to ensure
fsync-on-parent can persist the special file's metadata, for example,
uid/gid, access mode.
Checkout [generic/690][generic/690]: Syncing parent dir can persist
symlink. But for f2fs, it needs special mount option. And it doesn't say
that uid/gid can be persisted. All the details are behind the
implemetation.
> NOTE: All the related test cases has `_flakey_drop_and_remount` in
[xfstest-dev].
Based on discussion about [Documenting the crash-recovery guarantees of Linux file systems][kernel-crash-recovery-data-integrity],
we can't rely on Fsync-on-parent.
* Option 1 is winner
This patch is using option 1.
There is test result based on [test-tool][test-tool].
All the networking traffic created by pull is local.
* Image: docker.io/library/golang:1.19.4 (992 MiB)
* Current: 5.446738579s
* WIOS=21081, WBytes=1329741824, RIOS=79, RBytes=1197056
* Option 1: 6.239686088s
* WIOS=34804, WBytes=1454845952, RIOS=79, RBytes=1197056
* Option 2: 1m30.510934813s
* WIOS=42143, WBytes=1471397888, RIOS=82, RBytes=1209344
* Image: docker.io/tensorflow/tensorflow:latest (1.78 GiB, ~32590 Inodes)
* Current: 8.852718042s
* WIOS=39417, WBytes=2412818432, RIOS=2673, RBytes=335987712
* Option 1: 9.683387174s
* WIOS=42767, WBytes=2431750144, RIOS=89, RBytes=1238016
* Option 2: 1m54.302103719s
* WIOS=54403, WBytes=2460528640, RIOS=1709, RBytes=208237568
The Option 1 will increase `wios`. So, the `image_pull_with_sync_fs` is
option in CRI plugin.
[syncfs]: <https://man7.org/linux/man-pages/man2/syncfs.2.html>
[xfstest-dev]: <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git>
[generic/690]: <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/tests/generic/690?h=v2023.11.19>
[kernel-crash-recovery-data-integrity]: <https://lore.kernel.org/linux-fsdevel/1552418820-18102-1-git-send-email-jaya@cs.utexas.edu/>
[test-tool]: <a17fb2010d/contrib/syncfs/containerd/main_test.go (L51)>
Signed-off-by: Wei Fu <fuweid89@gmail.com>
For backward compatibility, we should get runtimeInfo from sandbox in
db, or get it from the sandbox container in db.
Note that this is a temporary solution and we will remove the Container field in
Sandbox in cri cache, and replace it with a SandboxInsantance of type
containerd.Sandbox interface.
Signed-off-by: Abel Feng <fshb1988@gmail.com>
Close connection if no more data. It's to fix false alert filed by image
pull progress.
```
dst = OpenWriter (--> Content Store)
src = Fetch
Open (--> Registry)
Mark it as active request
Copy(dst, src) (--> Keep updating total received bytes)
^
| (Active Request > 0, but total received bytes won't be updated)
v
defer src.Close()
content.Commit(dst)
```
Before migrating to transfer service, CRI plugin doesn't limit global
concurrent downloads for ImagePulls. Each ImagePull requests have 3 concurrent
goroutines to download blob and 1 goroutine to unpack blob. Like ext4
filesystem [1][1], the fsync from content.Commit may sync unrelated dirty pages
into disk. The host is running under IO pressure, and then the content.Commit
will take long time and block other goroutines. If httpreadseeker
doesn't close the connection after io.EOF, this connection will be
considered as active. The pull progress reporter reports there is no
bytes transfered and cancels the ImagePull.
The original 1-minute timeout[2][2] is from kubelet settting. Since CRI-plugin
can't limit the total concurrent downloads, this patch is to update 1-minute
to 5-minutes to prevent from unexpected cancel.
[1]: https://lwn.net/Articles/842385/
[2]: https://github.com/kubernetes/kubernetes/blob/release-1.23/pkg/kubelet/config/flags.go#L45-L48
Signed-off-by: Wei Fu <fuweid89@gmail.com>
The new active request is filed and there is no bytes read yet when the
progress reporter just wakes up. If the timeout / 2 is less than the
minPullProgressReportInternal, it's easy to file false alert.
We should remove the minPullProgressReportInternal limit.
Fixes: #8024
Signed-off-by: Wei Fu <fuweid89@gmail.com>
To break the cyclic dependency of cri plugin and podsandbox plugin,
we define a new plugin type of SandboxesServicePlugin and when cri init
it's own client, it will add the all the controllers by get them from
the SandboxesServicePlugin.
when podsandbox controller init it's client, it will not Require the
SandboxesServicePlugin.
Signed-off-by: Abel Feng <fshb1988@gmail.com>
The PR https://github.com/containerd/containerd/pull/8198 fixed this for CRI but missed clearing the commandline in the forked SB server. This simply adds that back in
Signed-off-by: James Sturtevant <jsturtevant@gmail.com>
Avoid calling out to the client to get a sandbox controller and instead
setup the list of controllers on initialization. This fixes a test
failure which does not set the client.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Enhance cri/server/image/imagefs_info.go:ImageFsInfo() to support
snapshotter per runtime. Now `ImageFsInfoResponse.ImageFilesystems` may
contain multiple entries.
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
When call sandbox controller to create sandbox, we change the param from
sandbox id to total sandbox object to git all information to controller,
so that sandbox controller do not rely on the sandbox store anymore,
this is more decouple for the sandbox controller plugin inside
containerd, and it is neccesary for remote sandbox controller plugins as
it is not able to get sandbox from the sandbox store anymore.
Signed-off-by: Abel Feng <fshb1988@gmail.com>
make containerd extensible to support more sandbox controllers
registered into containerd by config.
we change the default sandbox controller plugin's name from "local" to "shim".
to make sure we can get the controller by the plugin name it registered into
containerd.
Signed-off-by: Abel Feng <fshb1988@gmail.com>
Before snapshotter per runtime, CRI only supports a global snapshotter.
So a snapshot can be uniquely identified by `snapshot_key`. With snapshotter
per runtime enabled, there may be multiple snapshotters used by CRI. So only
(snapshotter_id, snapshot_key) can uniquely identify a snapshot.
Also extends CRI/store/snapshot/Store to support multiple snapshotters.
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
The plugins packages defines the plugins used by containerd.
Move all the types and properties to this package.
Signed-off-by: Derek McGowan <derek@mcg.dev>