It's to ensure the data integrity during unexpected power failure.
Background:
Since release 1.3, in Linux system, containerD unpacks and writes files into
overlayfs snapshot directly. It doesn’t involve any mount-umount operations
so that the performance of pulling image has been improved.
As we know, the umount syscall for overlayfs will force kernel to flush
all the dirty pages into disk. Without umount syscall, the files’ data relies
on kernel’s writeback threads or filesystem's commit setting (for
instance, ext4 filesystem).
The files in committed snapshot can be loss after unexpected power failure.
However, the snapshot has been committed and the metadata also has been
fsynced. There is data inconsistency between snapshot metadata and files
in that snapshot.
We, containerd, received several issues about data loss after unexpected
power failure.
* https://github.com/containerd/containerd/issues/5854
* https://github.com/containerd/containerd/issues/3369#issuecomment-1787334907
Solution:
* Option 1: SyncFs after unpack
Linux platform provides [syncfs][syncfs] syscall to synchronize just the
filesystem containing a given file.
* Option 2: Fsync directories recursively and fsync on regular file
The fsync doesn't support symlink/block device/char device files. We
need to use fsync the parent directory to ensure that entry is
persisted.
However, based on [xfstest-dev][xfstest-dev], there is no case to ensure
fsync-on-parent can persist the special file's metadata, for example,
uid/gid, access mode.
Checkout [generic/690][generic/690]: Syncing parent dir can persist
symlink. But for f2fs, it needs special mount option. And it doesn't say
that uid/gid can be persisted. All the details are behind the
implemetation.
> NOTE: All the related test cases has `_flakey_drop_and_remount` in
[xfstest-dev].
Based on discussion about [Documenting the crash-recovery guarantees of Linux file systems][kernel-crash-recovery-data-integrity],
we can't rely on Fsync-on-parent.
* Option 1 is winner
This patch is using option 1.
There is test result based on [test-tool][test-tool].
All the networking traffic created by pull is local.
* Image: docker.io/library/golang:1.19.4 (992 MiB)
* Current: 5.446738579s
* WIOS=21081, WBytes=1329741824, RIOS=79, RBytes=1197056
* Option 1: 6.239686088s
* WIOS=34804, WBytes=1454845952, RIOS=79, RBytes=1197056
* Option 2: 1m30.510934813s
* WIOS=42143, WBytes=1471397888, RIOS=82, RBytes=1209344
* Image: docker.io/tensorflow/tensorflow:latest (1.78 GiB, ~32590 Inodes)
* Current: 8.852718042s
* WIOS=39417, WBytes=2412818432, RIOS=2673, RBytes=335987712
* Option 1: 9.683387174s
* WIOS=42767, WBytes=2431750144, RIOS=89, RBytes=1238016
* Option 2: 1m54.302103719s
* WIOS=54403, WBytes=2460528640, RIOS=1709, RBytes=208237568
The Option 1 will increase `wios`. So, the `image_pull_with_sync_fs` is
option in CRI plugin.
[syncfs]: <https://man7.org/linux/man-pages/man2/syncfs.2.html>
[xfstest-dev]: <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git>
[generic/690]: <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/tests/generic/690?h=v2023.11.19>
[kernel-crash-recovery-data-integrity]: <https://lore.kernel.org/linux-fsdevel/1552418820-18102-1-git-send-email-jaya@cs.utexas.edu/>
[test-tool]: <a17fb2010d/contrib/syncfs/containerd/main_test.go (L51)>
Signed-off-by: Wei Fu <fuweid89@gmail.com>
This is effectively a revert of 2ac9968401, which
switched from os/exec to the golang.org/x/sys/execabs package to mitigate
security issues (mainly on Windows) with lookups resolving to binaries in the
current directory.
from the go1.19 release notes https://go.dev/doc/go1.19#os-exec-path
> ## PATH lookups
>
> Command and LookPath no longer allow results from a PATH search to be found
> relative to the current directory. This removes a common source of security
> problems but may also break existing programs that depend on using, say,
> exec.Command("prog") to run a binary named prog (or, on Windows, prog.exe) in
> the current directory. See the os/exec package documentation for information
> about how best to update such programs.
>
> On Windows, Command and LookPath now respect the NoDefaultCurrentDirectoryInExePath
> environment variable, making it possible to disable the default implicit search
> of “.” in PATH lookups on Windows systems.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Remove containerd specific parts of the plugin package to prepare its
move out of the main repository. Separate the plugin registration
singleton into a separate package.
Separating out the plugin package and registration makes it easier to
implement external plugins without creating a dependency loop.
Signed-off-by: Derek McGowan <derek@mcg.dev>
The plugins packages defines the plugins used by containerd.
Move all the types and properties to this package.
Signed-off-by: Derek McGowan <derek@mcg.dev>
That commit neither helps without a working bind-mount implementation nor is needed when such implementation exists.
Testing shows that containerd can properly download and unpack image using bindfs mounts (see previous commit) even without Darwin-specific applier code.
Signed-off-by: Marat Radchenko <marat@slonopotamus.org>
The whiteout timestamps are no longer set to the source date epoch.
The source date epoch still applies to non-whiteout files.
Discussion happened in moby/buildkit PR 3560.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Helpers to convert from the OCI image specs [Descriptor] to its protobuf
structure for Descriptor and vice-versa appear three times. It seems sane
to just expose this facility in /oci.
Signed-off-by: Danny Canter <danny@dcantah.dev>
Helpers to convert from containerd's [Mount] to its protobuf structure for
[Mount] and vice-versa appear three times. It seems sane to just expose
this facility in /mount.
Signed-off-by: Danny Canter <danny@dcantah.dev>
The WCOW layer support does not support creating sandboxes with no
parent. Instead, parentless scratch layers must be layed out as a
directory containing only a directory named 'Files', and all data stored
inside 'Files'. At commit-time, this will be converted in-place into a
read-only layer suitable for use as a parent layer.
The WCOW layer support also does not deal with making read-only layers,
i.e. layers that are prepared to be parent layers, visible in a
read-only manner. A bind-mount or junction point cannot be made
read-only, so a view must instead be a small sandbox layer that we can
mount via WCOW, and discard later, to protect the layer against
accidental or deliberate modification.
Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
This is necessary so we can mount snapshots more than once with overlayfs,
otherwise mounts enter an unknown state.
related: https://github.com/moby/buildkit/pull/1100
Signed-off-by: Laura Brehm <laurabrehm@hey.com>
Co-authored-by: Zou Nengren <zouyee1989@gmail.com>
From golangci-lint:
> SA1019: rand.Read has been deprecated since Go 1.20 because it
>shouldn't be used: For almost all use cases, crypto/rand.Read is more
>appropriate. (staticcheck)
> SA1019: rand.Seed has been deprecated since Go 1.20 and an alternative
>has been available since Go 1.0: Programs that call Seed and then expect
>a specific sequence of results from the global random source (using
>functions such as Int) can be broken when a dependency changes how
>much it consumes from the global random source. To avoid such breakages,
>programs that need a specific result sequence should use
>NewRand(NewSource(seed)) to obtain a random generator that other
>packages cannot access. (staticcheck)
See also:
- https://pkg.go.dev/math/rand@go1.20#Read
- https://pkg.go.dev/math/rand@go1.20#Seed
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
`rc.r.Read()` may return a negative `int` on an error
when the reader is set to a custom content store implementation
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
This makes diff archives to be reproducible.
The value is expected to be passed from CLI applications via the $SOUCE_DATE_EPOCH env var.
See https://reproducible-builds.org/docs/source-date-epoch/
for the $SOURCE_DATE_EPOCH specification.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
When unpacking a TAR archive, containerd preserves file's owner:
https://github.com/containerd/containerd/blob/main/archive/tar.go#L384
In some cases this behavior is not desired. In current implementation we
avoid `Lchown` on Windows. Another case when this should be skipped is
when using native snapshotter on darwin and running as non-root user.
This PR extracts a generic option - `WithNoSameOwner` (same as
`tar --no-same-owner`) to skip `Lchown` when its not required.
Signed-off-by: Maksym Pavlenko <pavlenko.maksym@gmail.com>
On Windows the two differs we register by default are the "windows" and
"windows-lcow" differs. The diff service checks if Apply returns
ErrNotImplemented and will move on to the next differ in the line.
The Windows differ makes use of this to fallback to LCOW if it's
determined the mount type passed is incorrect, but the LCOW differ
does not return ErrNotImplemented for the same scenario. This puts
a strict ordering requirement on the default differ entries in the config,
namely that ["windows", "windows-lcow"] will work, as windows will correctly
fall back to the lcow differ, but ["windows-lcow", "windows"] won't as
the diff services Apply will just return the error directly.
Signed-off-by: Daniel Canter <dcanter@microsoft.com>
Add exported `Wait(ctx context.Context) error` interface that waits on
the underlying command (or context cancellation) and returns the error.
This fixes a race condition between `.wait()` and `.Err error`:
https://github.com/containerd/containerd/issues/6914
Signed-off-by: Hamza El-Saawy <hamzaelsaawy@microsoft.com>
This commit hides types.Any from the diff package's interface. Clients
(incl. imgcrypt) shouldn't aware about gogo/protobuf.
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
The io/ioutil package has been deprecated as of Go 1.16, see
https://golang.org/doc/go1.16#ioutil. This commit replaces the existing
io/ioutil functions with their new definitions in io and os packages.
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
Go 1.15.7 contained a security fix for CVE-2021-3115, which allowed arbitrary
code to be executed at build time when using cgo on Windows. This issue also
affects Unix users who have “.” listed explicitly in their PATH and are running
“go get” outside of a module or with module mode disabled.
This issue is not limited to the go command itself, and can also affect binaries
that use `os.Command`, `os.LookPath`, etc.
From the related blogpost (ttps://blog.golang.org/path-security):
> Are your own programs affected?
>
> If you use exec.LookPath or exec.Command in your own programs, you only need to
> be concerned if you (or your users) run your program in a directory with untrusted
> contents. If so, then a subprocess could be started using an executable from dot
> instead of from a system directory. (Again, using an executable from dot happens
> always on Windows and only with uncommon PATH settings on Unix.)
>
> If you are concerned, then we’ve published the more restricted variant of os/exec
> as golang.org/x/sys/execabs. You can use it in your program by simply replacing
This patch replaces all uses of `os/exec` with `golang.org/x/sys/execabs`. While
some uses of `os/exec` should not be problematic (e.g. part of tests), it is
probably good to be consistent, in case code gets moved around.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Remove build tags which are already implied by the name of the file.
Ensures build tags are used consistently
Signed-off-by: Derek McGowan <derek@mcg.dev>