It's to ensure the data integrity during unexpected power failure.
Background:
Since release 1.3, in Linux system, containerD unpacks and writes files into
overlayfs snapshot directly. It doesn’t involve any mount-umount operations
so that the performance of pulling image has been improved.
As we know, the umount syscall for overlayfs will force kernel to flush
all the dirty pages into disk. Without umount syscall, the files’ data relies
on kernel’s writeback threads or filesystem's commit setting (for
instance, ext4 filesystem).
The files in committed snapshot can be loss after unexpected power failure.
However, the snapshot has been committed and the metadata also has been
fsynced. There is data inconsistency between snapshot metadata and files
in that snapshot.
We, containerd, received several issues about data loss after unexpected
power failure.
* https://github.com/containerd/containerd/issues/5854
* https://github.com/containerd/containerd/issues/3369#issuecomment-1787334907
Solution:
* Option 1: SyncFs after unpack
Linux platform provides [syncfs][syncfs] syscall to synchronize just the
filesystem containing a given file.
* Option 2: Fsync directories recursively and fsync on regular file
The fsync doesn't support symlink/block device/char device files. We
need to use fsync the parent directory to ensure that entry is
persisted.
However, based on [xfstest-dev][xfstest-dev], there is no case to ensure
fsync-on-parent can persist the special file's metadata, for example,
uid/gid, access mode.
Checkout [generic/690][generic/690]: Syncing parent dir can persist
symlink. But for f2fs, it needs special mount option. And it doesn't say
that uid/gid can be persisted. All the details are behind the
implemetation.
> NOTE: All the related test cases has `_flakey_drop_and_remount` in
[xfstest-dev].
Based on discussion about [Documenting the crash-recovery guarantees of Linux file systems][kernel-crash-recovery-data-integrity],
we can't rely on Fsync-on-parent.
* Option 1 is winner
This patch is using option 1.
There is test result based on [test-tool][test-tool].
All the networking traffic created by pull is local.
* Image: docker.io/library/golang:1.19.4 (992 MiB)
* Current: 5.446738579s
* WIOS=21081, WBytes=1329741824, RIOS=79, RBytes=1197056
* Option 1: 6.239686088s
* WIOS=34804, WBytes=1454845952, RIOS=79, RBytes=1197056
* Option 2: 1m30.510934813s
* WIOS=42143, WBytes=1471397888, RIOS=82, RBytes=1209344
* Image: docker.io/tensorflow/tensorflow:latest (1.78 GiB, ~32590 Inodes)
* Current: 8.852718042s
* WIOS=39417, WBytes=2412818432, RIOS=2673, RBytes=335987712
* Option 1: 9.683387174s
* WIOS=42767, WBytes=2431750144, RIOS=89, RBytes=1238016
* Option 2: 1m54.302103719s
* WIOS=54403, WBytes=2460528640, RIOS=1709, RBytes=208237568
The Option 1 will increase `wios`. So, the `image_pull_with_sync_fs` is
option in CRI plugin.
[syncfs]: <https://man7.org/linux/man-pages/man2/syncfs.2.html>
[xfstest-dev]: <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git>
[generic/690]: <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/tests/generic/690?h=v2023.11.19>
[kernel-crash-recovery-data-integrity]: <https://lore.kernel.org/linux-fsdevel/1552418820-18102-1-git-send-email-jaya@cs.utexas.edu/>
[test-tool]: <a17fb2010d/contrib/syncfs/containerd/main_test.go (L51)>
Signed-off-by: Wei Fu <fuweid89@gmail.com>
That commit neither helps without a working bind-mount implementation nor is needed when such implementation exists.
Testing shows that containerd can properly download and unpack image using bindfs mounts (see previous commit) even without Darwin-specific applier code.
Signed-off-by: Marat Radchenko <marat@slonopotamus.org>
`rc.r.Read()` may return a negative `int` on an error
when the reader is set to a custom content store implementation
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
When unpacking a TAR archive, containerd preserves file's owner:
https://github.com/containerd/containerd/blob/main/archive/tar.go#L384
In some cases this behavior is not desired. In current implementation we
avoid `Lchown` on Windows. Another case when this should be skipped is
when using native snapshotter on darwin and running as non-root user.
This PR extracts a generic option - `WithNoSameOwner` (same as
`tar --no-same-owner`) to skip `Lchown` when its not required.
Signed-off-by: Maksym Pavlenko <pavlenko.maksym@gmail.com>
The io/ioutil package has been deprecated as of Go 1.16, see
https://golang.org/doc/go1.16#ioutil. This commit replaces the existing
io/ioutil functions with their new definitions in io and os packages.
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
Remove build tags which are already implied by the name of the file.
Ensures build tags are used consistently
Signed-off-by: Derek McGowan <derek@mcg.dev>
`OverlayConvertWhiteout` calls `mknod c 0 0` which is not allowed when
running in a user namespace, even in Ubuntu kernel.
Although there is an alternative hacky way to create whiteouts without
calling mknod as Moby `overlay2` actually does(see #3762), let's use
naive applier when running in UserNS and call it a day.
Close#3762
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Extend the Applier interface's Apply method with an optional
options parameter.
For the container image encryption we intend to use the options
parameter to pass image decryption parameters ('dcparameters'),
which are primarily (privatte) keys, in form of a JSON document
under the map key '_dcparameters', and pass them to the Applier's
Apply() method. This helps us to access decryption keys and start
the pipeline with the layer decryption before the layer data are
unzipped and untarred.
Signed-off-by: Stefan Berger <stefanb@linux.ibm.com>
Signed-off-by: Harshal Patil <harshal.patil@in.ibm.com>
megacheck, gosimple and unused has been deprecated and subsumed by
staticcheck. And staticcheck also has been upgraded. we need to update
code for the linter issue.
close: #2945
Signed-off-by: Wei Fu <fuweid89@gmail.com>
This change allows implementations to resolve the location of the actual data
using OCI descriptor fields such as MediaType.
No OCI descriptor field is written to the store.
No change on gRPC API.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>