containerd/pkg
Wei Fu 23278c81fb *: introduce image_pull_with_sync_fs in CRI
It's to ensure the data integrity during unexpected power failure.

Background:

Since release 1.3, in Linux system, containerD unpacks and writes files into
overlayfs snapshot directly. It doesn’t involve any mount-umount operations
so that the performance of pulling image has been improved.

As we know, the umount syscall for overlayfs will force kernel to flush
all the dirty pages into disk. Without umount syscall, the files’ data relies
on kernel’s writeback threads or filesystem's commit setting (for
instance, ext4 filesystem).

The files in committed snapshot can be loss after unexpected power failure.
However, the snapshot has been committed and the metadata also has been
fsynced. There is data inconsistency between snapshot metadata and files
in that snapshot.

We, containerd, received several issues about data loss after unexpected
power failure.

* https://github.com/containerd/containerd/issues/5854
* https://github.com/containerd/containerd/issues/3369#issuecomment-1787334907

Solution:

* Option 1: SyncFs after unpack

Linux platform provides [syncfs][syncfs] syscall to synchronize just the
filesystem containing a given file.

* Option 2: Fsync directories recursively and fsync on regular file

The fsync doesn't support symlink/block device/char device files. We
need to use fsync the parent directory to ensure that entry is
persisted.

However, based on [xfstest-dev][xfstest-dev], there is no case to ensure
fsync-on-parent can persist the special file's metadata, for example,
uid/gid, access mode.

Checkout [generic/690][generic/690]: Syncing parent dir can persist
symlink. But for f2fs, it needs special mount option. And it doesn't say
that uid/gid can be persisted. All the details are behind the
implemetation.

> NOTE: All the related test cases has `_flakey_drop_and_remount` in
[xfstest-dev].

Based on discussion about [Documenting the crash-recovery guarantees of Linux file systems][kernel-crash-recovery-data-integrity],
we can't rely on Fsync-on-parent.

* Option 1 is winner

This patch is using option 1.

There is test result based on [test-tool][test-tool].
All the networking traffic created by pull is local.

  * Image: docker.io/library/golang:1.19.4 (992 MiB)
    * Current: 5.446738579s
      * WIOS=21081, WBytes=1329741824, RIOS=79, RBytes=1197056
    * Option 1: 6.239686088s
      * WIOS=34804, WBytes=1454845952, RIOS=79, RBytes=1197056
    * Option 2: 1m30.510934813s
      * WIOS=42143, WBytes=1471397888, RIOS=82, RBytes=1209344

  * Image: docker.io/tensorflow/tensorflow:latest (1.78 GiB, ~32590 Inodes)
    * Current: 8.852718042s
      * WIOS=39417, WBytes=2412818432, RIOS=2673, RBytes=335987712
    * Option 1: 9.683387174s
      * WIOS=42767, WBytes=2431750144, RIOS=89, RBytes=1238016
    * Option 2: 1m54.302103719s
      * WIOS=54403, WBytes=2460528640, RIOS=1709, RBytes=208237568

The Option 1 will increase `wios`. So, the `image_pull_with_sync_fs` is
option in CRI plugin.

[syncfs]: <https://man7.org/linux/man-pages/man2/syncfs.2.html>
[xfstest-dev]: <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git>
[generic/690]: <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/tests/generic/690?h=v2023.11.19>
[kernel-crash-recovery-data-integrity]: <https://lore.kernel.org/linux-fsdevel/1552418820-18102-1-git-send-email-jaya@cs.utexas.edu/>
[test-tool]: <a17fb2010d/contrib/syncfs/containerd/main_test.go (L51)>

Signed-off-by: Wei Fu <fuweid89@gmail.com>
2023-12-12 10:18:39 +08:00
..
apparmor pkg/apparmor: clarify Godoc 2023-02-10 10:23:59 -07:00
atomicfile atomicfile: new package for atomic file writes 2023-06-02 16:56:33 -07:00
blockio Move to use github.com/containerd/log 2023-09-22 07:53:23 -07:00
cap lint: remove //nolint:dupword that are no longer needed 2023-02-16 03:50:23 +09:00
cleanup Add cleanup package for context management during cleanup 2023-01-03 12:30:26 -08:00
cri *: introduce image_pull_with_sync_fs in CRI 2023-12-12 10:18:39 +08:00
deprecation cri: add deprecation warning for configs 2023-11-02 11:17:32 -07:00
dialer fix(pkg/dialer): minor fix on dialer function for windows 2023-11-22 04:25:11 -08:00
display Update go module to github.com/containerd/containerd/v2 2023-10-29 20:52:21 -07:00
epoch Refactor: Removing inherently flaky and unused SourceDateEpochOrNow function. 2023-09-17 08:34:26 -07:00
failpoint Fix some typos 2023-05-16 10:12:50 +08:00
hasher digest: use github.com/minio/sha256-simd 2022-12-08 18:50:00 +09:00
imageverifier Update go module to github.com/containerd/containerd/v2 2023-10-29 20:52:21 -07:00
ioutil Run gofmt 1.19 2022-08-04 18:18:33 -07:00
kmutex Update go module to github.com/containerd/containerd/v2 2023-10-29 20:52:21 -07:00
netns Update go module to github.com/containerd/containerd/v2 2023-10-29 20:52:21 -07:00
nri Switch to github.com/containerd/plugin 2023-11-01 23:01:42 -07:00
oom Update go module to github.com/containerd/containerd/v2 2023-10-29 20:52:21 -07:00
os Update go module to github.com/containerd/containerd/v2 2023-10-29 20:52:21 -07:00
progress update golangci-lint to v1.49.0 2022-10-12 14:41:01 +02:00
randutil Stop using math/rand.Read and rand.Seed (deprecated in Go 1.20) 2023-02-16 03:50:23 +09:00
rdt Move to use github.com/containerd/log 2023-09-22 07:53:23 -07:00
registrar feat: replace github.com/pkg/errors to errors 2022-01-07 10:27:03 +08:00
runtimeoptions/v1 Update go module to github.com/containerd/containerd/v2 2023-10-29 20:52:21 -07:00
schedcore add runc shim support for sched core 2021-10-08 16:18:09 +00:00
seccomp chore: use go fix to cleanup old +build buildtag 2022-12-29 14:25:14 +08:00
seed Stop using math/rand.Read and rand.Seed (deprecated in Go 1.20) 2023-02-16 03:50:23 +09:00
seutil seutil: Fix setting the "container_kvm_t" label 2021-12-14 00:09:17 +01:00
shutdown Expose Done and Err in Shutdown service 2022-11-16 22:03:44 -08:00
snapshotters Update go module to github.com/containerd/containerd/v2 2023-10-29 20:52:21 -07:00
stdio Add logging binary support when terminal is true 2020-08-25 17:28:29 -07:00
streaming go.mod: github.com/containerd/typeurl/v2 v2.1.0 2023-02-11 23:39:52 +09:00
systemd pkg/systemd: use sync.Once for systemd detection 2023-09-01 12:14:56 +02:00
testutil Update go module to github.com/containerd/containerd/v2 2023-10-29 20:52:21 -07:00
timeout feat: use rwmutex instead 2021-11-16 11:06:40 +08:00
tomlext tomlext.Duration add MarshalText method 2023-11-22 19:28:46 +08:00
transfer Move client to subpackage 2023-11-01 10:37:00 -07:00
truncindex error strings should not be capitalized 2023-02-15 14:30:36 +08:00
ttrpcutil Update go module to github.com/containerd/containerd/v2 2023-10-29 20:52:21 -07:00
unpack Enhance container image unpack client logs 2023-11-15 17:30:53 +00:00
userns chore: use go fix to cleanup old +build buildtag 2022-12-29 14:25:14 +08:00