It's to ensure the data integrity during unexpected power failure.
Background:
Since release 1.3, in Linux system, containerD unpacks and writes files into
overlayfs snapshot directly. It doesn’t involve any mount-umount operations
so that the performance of pulling image has been improved.
As we know, the umount syscall for overlayfs will force kernel to flush
all the dirty pages into disk. Without umount syscall, the files’ data relies
on kernel’s writeback threads or filesystem's commit setting (for
instance, ext4 filesystem).
The files in committed snapshot can be loss after unexpected power failure.
However, the snapshot has been committed and the metadata also has been
fsynced. There is data inconsistency between snapshot metadata and files
in that snapshot.
We, containerd, received several issues about data loss after unexpected
power failure.
* https://github.com/containerd/containerd/issues/5854
* https://github.com/containerd/containerd/issues/3369#issuecomment-1787334907
Solution:
* Option 1: SyncFs after unpack
Linux platform provides [syncfs][syncfs] syscall to synchronize just the
filesystem containing a given file.
* Option 2: Fsync directories recursively and fsync on regular file
The fsync doesn't support symlink/block device/char device files. We
need to use fsync the parent directory to ensure that entry is
persisted.
However, based on [xfstest-dev][xfstest-dev], there is no case to ensure
fsync-on-parent can persist the special file's metadata, for example,
uid/gid, access mode.
Checkout [generic/690][generic/690]: Syncing parent dir can persist
symlink. But for f2fs, it needs special mount option. And it doesn't say
that uid/gid can be persisted. All the details are behind the
implemetation.
> NOTE: All the related test cases has `_flakey_drop_and_remount` in
[xfstest-dev].
Based on discussion about [Documenting the crash-recovery guarantees of Linux file systems][kernel-crash-recovery-data-integrity],
we can't rely on Fsync-on-parent.
* Option 1 is winner
This patch is using option 1.
There is test result based on [test-tool][test-tool].
All the networking traffic created by pull is local.
* Image: docker.io/library/golang:1.19.4 (992 MiB)
* Current: 5.446738579s
* WIOS=21081, WBytes=1329741824, RIOS=79, RBytes=1197056
* Option 1: 6.239686088s
* WIOS=34804, WBytes=1454845952, RIOS=79, RBytes=1197056
* Option 2: 1m30.510934813s
* WIOS=42143, WBytes=1471397888, RIOS=82, RBytes=1209344
* Image: docker.io/tensorflow/tensorflow:latest (1.78 GiB, ~32590 Inodes)
* Current: 8.852718042s
* WIOS=39417, WBytes=2412818432, RIOS=2673, RBytes=335987712
* Option 1: 9.683387174s
* WIOS=42767, WBytes=2431750144, RIOS=89, RBytes=1238016
* Option 2: 1m54.302103719s
* WIOS=54403, WBytes=2460528640, RIOS=1709, RBytes=208237568
The Option 1 will increase `wios`. So, the `image_pull_with_sync_fs` is
option in CRI plugin.
[syncfs]: <https://man7.org/linux/man-pages/man2/syncfs.2.html>
[xfstest-dev]: <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git>
[generic/690]: <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/tests/generic/690?h=v2023.11.19>
[kernel-crash-recovery-data-integrity]: <https://lore.kernel.org/linux-fsdevel/1552418820-18102-1-git-send-email-jaya@cs.utexas.edu/>
[test-tool]: <a17fb2010d/contrib/syncfs/containerd/main_test.go (L51)>
Signed-off-by: Wei Fu <fuweid89@gmail.com>
Remove containerd specific parts of the plugin package to prepare its
move out of the main repository. Separate the plugin registration
singleton into a separate package.
Separating out the plugin package and registration makes it easier to
implement external plugins without creating a dependency loop.
Signed-off-by: Derek McGowan <derek@mcg.dev>
The plugins packages defines the plugins used by containerd.
Move all the types and properties to this package.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Helpers to convert from the OCI image specs [Descriptor] to its protobuf
structure for Descriptor and vice-versa appear three times. It seems sane
to just expose this facility in /oci.
Signed-off-by: Danny Canter <danny@dcantah.dev>
Helpers to convert from containerd's [Mount] to its protobuf structure for
[Mount] and vice-versa appear three times. It seems sane to just expose
this facility in /mount.
Signed-off-by: Danny Canter <danny@dcantah.dev>
- Add Target to mount.Mount.
- Add UnmountMounts to unmount a list of mounts in reverse order.
- Add UnmountRecursive to unmount deepest mount first for a given target, using
moby/sys/mountinfo.
Signed-off-by: Edgar Lee <edgarhinshunlee@gmail.com>
This makes diff archives to be reproducible.
The value is expected to be passed from CLI applications via the $SOUCE_DATE_EPOCH env var.
See https://reproducible-builds.org/docs/source-date-epoch/
for the $SOURCE_DATE_EPOCH specification.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Previouslty "Size" was reserved by protoc-gen-gogoctrd and user-generated
"Size" was automatically renamed to "Size_" to avoid conflicts.
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
This commit hides types.Any from the diff package's interface. Clients
(incl. imgcrypt) shouldn't aware about gogo/protobuf.
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
gogoproto.customtype is used to have go-digest.Digest instead of string.
While it is convinient, protoc-gen-go doesn't support the extension
and that blocks #6564.
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
Remove build tags which are already implied by the name of the file.
Ensures build tags are used consistently
Signed-off-by: Derek McGowan <derek@mcg.dev>
Extend the Applier interface's Apply method with an optional
options parameter.
For the container image encryption we intend to use the options
parameter to pass image decryption parameters ('dcparameters'),
which are primarily (privatte) keys, in form of a JSON document
under the map key '_dcparameters', and pass them to the Applier's
Apply() method. This helps us to access decryption keys and start
the pipeline with the layer decryption before the layer data are
unzipped and untarred.
Signed-off-by: Stefan Berger <stefanb@linux.ibm.com>
Signed-off-by: Harshal Patil <harshal.patil@in.ibm.com>
Implements the Windows lcow differ/snapshotter responsible for managing
the creation and lifetime of lcow containers on Windows.
Signed-off-by: Justin Terry (VM) <juterry@microsoft.com>
Since Go 1.7, context is a standard package, superceding the
"x/net/context". Since Go 1.9, the latter only provides a few type
aliases from the former. Therefore, it makes sense to switch to the
standard package.
This commit was generated by the following script (with a couple of
minor fixups to remove extra changes done by goimports):
#!/bin/bash
if [ $# -ge 1 ]; then
FILES=$*
else
FILES=$(git ls-files \*.go | grep -vF ".pb.go" | grep -v
^vendor/)
fi
for f in $FILES; do
printf .
sed -i -e 's|"golang.org/x/net/context"$|"context"|' $f
goimports -w $f
awk ' /^$/ {e=1; next;}
/[[:space:]]"context"$/ {e=0;}
{if (e) {print ""; e=0}; print;}' < $f > $f.new && \
mv $f.new $f
goimports -w $f
done
echo
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This implements the Windows snapshotter and diff Apply function.
This allows for Windows layers to be created, and layers to be pulled
from the hub.
Signed-off-by: Darren Stahl <darst@microsoft.com>
Add differ options and package with interface.
Update optional values on diff interface to use options.
Signed-off-by: Derek McGowan <derek@mcgstyle.net>
Matching support is now implemented in the platforms package. The
`Parse` function now returns a matcher object that can be used to
match OCI platform specifications. We define this as an interface to
allow the creation of helpers oriented around platform selection.
Signed-off-by: Stephen J Day <stephen.day@docker.com>
Updates the differ service to support calling and configuring
multiple differs. The differs are configured as an ordered list
of differs which will each be attempting until a supported differ
is called.
Additionally a not supported error type was added to allow differs
to be selective of whether the differ arguments are supported by
the differ. This error type corresponds to the GRPC unimplemented error.
Signed-off-by: Derek McGowan <derek@mcgstyle.net>
To simplify use of types, we have consolidate the packages for the mount
and descriptor protobuf types into a single Go package. We also drop the
versioning from the type packages, as these types will remain the same
between versions.
Signed-off-by: Stephen J Day <stephen.day@docker.com>
This moves both the Mount type and mountinfo into a single mount
package.
This also opens up the root of the repo to hold the containerd client
implementation.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Remove rootfs service in place of snapshot service. Adds
diff service for extracting and creating diffs. Diff
creation is not yet implemented. This service allows
pulling or creating images without needing root access to
mount. Additionally in the future this will allow containerd
to ensure extractions happen safely in a chroot if needed.
Signed-off-by: Derek McGowan <derek@mcgstyle.net>