Commit Graph

10 Commits

Author SHA1 Message Date
Wei Fu
23278c81fb *: introduce image_pull_with_sync_fs in CRI
It's to ensure the data integrity during unexpected power failure.

Background:

Since release 1.3, in Linux system, containerD unpacks and writes files into
overlayfs snapshot directly. It doesn’t involve any mount-umount operations
so that the performance of pulling image has been improved.

As we know, the umount syscall for overlayfs will force kernel to flush
all the dirty pages into disk. Without umount syscall, the files’ data relies
on kernel’s writeback threads or filesystem's commit setting (for
instance, ext4 filesystem).

The files in committed snapshot can be loss after unexpected power failure.
However, the snapshot has been committed and the metadata also has been
fsynced. There is data inconsistency between snapshot metadata and files
in that snapshot.

We, containerd, received several issues about data loss after unexpected
power failure.

* https://github.com/containerd/containerd/issues/5854
* https://github.com/containerd/containerd/issues/3369#issuecomment-1787334907

Solution:

* Option 1: SyncFs after unpack

Linux platform provides [syncfs][syncfs] syscall to synchronize just the
filesystem containing a given file.

* Option 2: Fsync directories recursively and fsync on regular file

The fsync doesn't support symlink/block device/char device files. We
need to use fsync the parent directory to ensure that entry is
persisted.

However, based on [xfstest-dev][xfstest-dev], there is no case to ensure
fsync-on-parent can persist the special file's metadata, for example,
uid/gid, access mode.

Checkout [generic/690][generic/690]: Syncing parent dir can persist
symlink. But for f2fs, it needs special mount option. And it doesn't say
that uid/gid can be persisted. All the details are behind the
implemetation.

> NOTE: All the related test cases has `_flakey_drop_and_remount` in
[xfstest-dev].

Based on discussion about [Documenting the crash-recovery guarantees of Linux file systems][kernel-crash-recovery-data-integrity],
we can't rely on Fsync-on-parent.

* Option 1 is winner

This patch is using option 1.

There is test result based on [test-tool][test-tool].
All the networking traffic created by pull is local.

  * Image: docker.io/library/golang:1.19.4 (992 MiB)
    * Current: 5.446738579s
      * WIOS=21081, WBytes=1329741824, RIOS=79, RBytes=1197056
    * Option 1: 6.239686088s
      * WIOS=34804, WBytes=1454845952, RIOS=79, RBytes=1197056
    * Option 2: 1m30.510934813s
      * WIOS=42143, WBytes=1471397888, RIOS=82, RBytes=1209344

  * Image: docker.io/tensorflow/tensorflow:latest (1.78 GiB, ~32590 Inodes)
    * Current: 8.852718042s
      * WIOS=39417, WBytes=2412818432, RIOS=2673, RBytes=335987712
    * Option 1: 9.683387174s
      * WIOS=42767, WBytes=2431750144, RIOS=89, RBytes=1238016
    * Option 2: 1m54.302103719s
      * WIOS=54403, WBytes=2460528640, RIOS=1709, RBytes=208237568

The Option 1 will increase `wios`. So, the `image_pull_with_sync_fs` is
option in CRI plugin.

[syncfs]: <https://man7.org/linux/man-pages/man2/syncfs.2.html>
[xfstest-dev]: <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git>
[generic/690]: <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/tests/generic/690?h=v2023.11.19>
[kernel-crash-recovery-data-integrity]: <https://lore.kernel.org/linux-fsdevel/1552418820-18102-1-git-send-email-jaya@cs.utexas.edu/>
[test-tool]: <a17fb2010d/contrib/syncfs/containerd/main_test.go (L51)>

Signed-off-by: Wei Fu <fuweid89@gmail.com>
2023-12-12 10:18:39 +08:00
Derek McGowan
5fdf55e493
Update go module to github.com/containerd/containerd/v2
Signed-off-by: Derek McGowan <derek@mcg.dev>
2023-10-29 20:52:21 -07:00
Akihiro Suda
86fc1ccab4
Remove aufs snapshotter (deprecated since v1.5)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2023-03-14 14:37:13 +09:00
haoyun
bbe46b8c43 feat: replace github.com/pkg/errors to errors
Signed-off-by: haoyun <yun.hao@daocloud.io>
Co-authored-by: zounengren <zouyee1989@gmail.com>
2022-01-07 10:27:03 +08:00
Derek McGowan
6f027e38a8
Remove redundant build tags
Remove build tags which are already implied by the name of the file.
Ensures build tags are used consistently

Signed-off-by: Derek McGowan <derek@mcg.dev>
2021-08-05 22:27:46 -07:00
Sebastiaan van Stijn
708299ca40
Move RunningInUserNS() to its own package
This allows using the utility without bringing whole of "sys" with it.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-03-23 11:29:53 +01:00
Sebastiaan van Stijn
6a9b94927f
Remove remaining uses of libcontainer/system package
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-05-04 18:56:57 +02:00
Akihiro Suda
c224edc5c6 apply: use naive applier when running in UserNS
`OverlayConvertWhiteout` calls `mknod c 0 0` which is not allowed when
running in a user namespace, even in Ubuntu kernel.

Although there is an alternative hacky way to create whiteouts without
calling mknod as Moby `overlay2` actually does(see #3762), let's use
naive applier when running in UserNS and call it a day.

Close #3762

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-10-24 13:33:09 +09:00
Derek McGowan
bcc4a146e4
Support applying with parent directories
Signed-off-by: Derek McGowan <derek@mcgstyle.net>
2019-08-16 15:49:47 -07:00
Lantao Liu
81386df917
Add direct unpack support for overlay and aufs
Signed-off-by: Lantao Liu <lantaol@google.com>
2019-08-15 12:09:32 -07:00