It's to ensure the data integrity during unexpected power failure.
Background:
Since release 1.3, in Linux system, containerD unpacks and writes files into
overlayfs snapshot directly. It doesn’t involve any mount-umount operations
so that the performance of pulling image has been improved.
As we know, the umount syscall for overlayfs will force kernel to flush
all the dirty pages into disk. Without umount syscall, the files’ data relies
on kernel’s writeback threads or filesystem's commit setting (for
instance, ext4 filesystem).
The files in committed snapshot can be loss after unexpected power failure.
However, the snapshot has been committed and the metadata also has been
fsynced. There is data inconsistency between snapshot metadata and files
in that snapshot.
We, containerd, received several issues about data loss after unexpected
power failure.
* https://github.com/containerd/containerd/issues/5854
* https://github.com/containerd/containerd/issues/3369#issuecomment-1787334907
Solution:
* Option 1: SyncFs after unpack
Linux platform provides [syncfs][syncfs] syscall to synchronize just the
filesystem containing a given file.
* Option 2: Fsync directories recursively and fsync on regular file
The fsync doesn't support symlink/block device/char device files. We
need to use fsync the parent directory to ensure that entry is
persisted.
However, based on [xfstest-dev][xfstest-dev], there is no case to ensure
fsync-on-parent can persist the special file's metadata, for example,
uid/gid, access mode.
Checkout [generic/690][generic/690]: Syncing parent dir can persist
symlink. But for f2fs, it needs special mount option. And it doesn't say
that uid/gid can be persisted. All the details are behind the
implemetation.
> NOTE: All the related test cases has `_flakey_drop_and_remount` in
[xfstest-dev].
Based on discussion about [Documenting the crash-recovery guarantees of Linux file systems][kernel-crash-recovery-data-integrity],
we can't rely on Fsync-on-parent.
* Option 1 is winner
This patch is using option 1.
There is test result based on [test-tool][test-tool].
All the networking traffic created by pull is local.
* Image: docker.io/library/golang:1.19.4 (992 MiB)
* Current: 5.446738579s
* WIOS=21081, WBytes=1329741824, RIOS=79, RBytes=1197056
* Option 1: 6.239686088s
* WIOS=34804, WBytes=1454845952, RIOS=79, RBytes=1197056
* Option 2: 1m30.510934813s
* WIOS=42143, WBytes=1471397888, RIOS=82, RBytes=1209344
* Image: docker.io/tensorflow/tensorflow:latest (1.78 GiB, ~32590 Inodes)
* Current: 8.852718042s
* WIOS=39417, WBytes=2412818432, RIOS=2673, RBytes=335987712
* Option 1: 9.683387174s
* WIOS=42767, WBytes=2431750144, RIOS=89, RBytes=1238016
* Option 2: 1m54.302103719s
* WIOS=54403, WBytes=2460528640, RIOS=1709, RBytes=208237568
The Option 1 will increase `wios`. So, the `image_pull_with_sync_fs` is
option in CRI plugin.
[syncfs]: <https://man7.org/linux/man-pages/man2/syncfs.2.html>
[xfstest-dev]: <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git>
[generic/690]: <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/tests/generic/690?h=v2023.11.19>
[kernel-crash-recovery-data-integrity]: <https://lore.kernel.org/linux-fsdevel/1552418820-18102-1-git-send-email-jaya@cs.utexas.edu/>
[test-tool]: <a17fb2010d/contrib/syncfs/containerd/main_test.go (L51)>
Signed-off-by: Wei Fu <fuweid89@gmail.com>
There are many Kubernetes clusters running on ARM64. Enable ARM64 runner
is to commit to support ARM64 platform officially.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
All components need to use a consistent `semconv` version or OTel
will emit errors about "cannot merge resource due to conflicting Schema URL".
Switch to the appropriate semconv version, which requires dropping
usage of `httpconv`. Instead, the upstream HTTP client hooks are
used directly. (The lower-level functions are no longer exported by
OTel.)
Signed-off-by: Milas Bowman <milas.bowman@docker.com>
go1.21.5 (released 2023-12-05) includes security fixes to the go command,
and the net/http and path/filepath packages, as well as bug fixes to the
compiler, the go command, the runtime, and the crypto/rand, net, os, and
syscall packages. See the Go 1.21.5 milestone on our issue tracker for
details:
- https://github.com/golang/go/issues?q=milestone%3AGo1.21.5+label%3ACherryPickApproved
- full diff: https://github.com/golang/go/compare/go1.21.4...go1.21.5
from the security mailing:
[security] Go 1.21.5 and Go 1.20.12 are released
Hello gophers,
We have just released Go versions 1.21.5 and 1.20.12, minor point releases.
These minor releases include 3 security fixes following the security policy:
- net/http: limit chunked data overhead
A malicious HTTP sender can use chunk extensions to cause a receiver
reading from a request or response body to read many more bytes from
the network than are in the body.
A malicious HTTP client can further exploit this to cause a server to
automatically read a large amount of data (up to about 1GiB) when a
handler fails to read the entire body of a request.
Chunk extensions are a little-used HTTP feature which permit including
additional metadata in a request or response body sent using the chunked
encoding. The net/http chunked encoding reader discards this metadata.
A sender can exploit this by inserting a large metadata segment with
each byte transferred. The chunk reader now produces an error if the
ratio of real body to encoded bytes grows too small.
Thanks to Bartek Nowotarski for reporting this issue.
This is CVE-2023-39326 and Go issue https://go.dev/issue/64433.
- cmd/go: go get may unexpectedly fallback to insecure git
Using go get to fetch a module with the ".git" suffix may unexpectedly
fallback to the insecure "git://" protocol if the module is unavailable
via the secure "https://" and "git+ssh://" protocols, even if GOINSECURE
is not set for said module. This only affects users who are not using
the module proxy and are fetching modules directly (i.e. GOPROXY=off).
Thanks to David Leadbeater for reporting this issue.
This is CVE-2023-45285 and Go issue https://go.dev/issue/63845.
- path/filepath: retain trailing \ when cleaning paths like \\?\c:\
Go 1.20.11 and Go 1.21.4 inadvertently changed the definition of the
volume name in Windows paths starting with \\?\, resulting in
filepath.Clean(\\?\c:\) returning \\?\c: rather than \\?\c:\ (among
other effects). The previous behavior has been restored.
This is an update to CVE-2023-45283 and Go issue https://go.dev/issue/64028.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
go1.21.4 (released 2023-11-07) includes security fixes to the path/filepath
package, as well as bug fixes to the linker, the runtime, the compiler, and
the go/types, net/http, and runtime/cgo packages. See the Go 1.21.4 milestone
on our issue tracker for details:
- https://github.com/golang/go/issues?q=milestone%3AGo1.21.4+label%3ACherryPickApproved
- full diff: https://github.com/golang/go/compare/go1.21.3...go1.21.4
from the security mailing:
[security] Go 1.21.4 and Go 1.20.11 are released
Hello gophers,
We have just released Go versions 1.21.4 and 1.20.11, minor point releases.
These minor releases include 2 security fixes following the security policy:
- path/filepath: recognize `\??\` as a Root Local Device path prefix.
On Windows, a path beginning with `\??\` is a Root Local Device path equivalent
to a path beginning with `\\?\`. Paths with a `\??\` prefix may be used to
access arbitrary locations on the system. For example, the path `\??\c:\x`
is equivalent to the more common path c:\x.
The filepath package did not recognize paths with a `\??\` prefix as special.
Clean could convert a rooted path such as `\a\..\??\b` into
the root local device path `\??\b`. It will now convert this
path into `.\??\b`.
`IsAbs` did not report paths beginning with `\??\` as absolute.
It now does so.
VolumeName now reports the `\??\` prefix as a volume name.
`Join(`\`, `??`, `b`)` could convert a seemingly innocent
sequence of path elements into the root local device path
`\??\b`. It will now convert this to `\.\??\b`.
This is CVE-2023-45283 and https://go.dev/issue/63713.
- path/filepath: recognize device names with trailing spaces and superscripts
The `IsLocal` function did not correctly detect reserved names in some cases:
- reserved names followed by spaces, such as "COM1 ".
- "COM" or "LPT" followed by a superscript 1, 2, or 3.
`IsLocal` now correctly reports these names as non-local.
This is CVE-2023-45284 and https://go.dev/issue/63713.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
If a snapshot removal fails (during garbage collection), the entire garbage collection operation is
cancelled. This is problematic because once cleanup of any snapshot fails no other snapshots will be cleaned
and the disk usage will just keep increasing.
Solution is to return snapshot removal errors wrapped as "ErrFailedPrecondition" errors. The garbage
collectors continues cleanup if the error is of this type.
Signed-off-by: Amit Barve <ambarve@microsoft.com>
The runtime-spec just merged this PR:
https://github.com/opencontainers/runtime-spec/pull/1224
This means that it is now possible to request idmap mounts by specifying
"idmap" or "ridmap" in the mount options, without any mappings.
Let's add a check to see if they are requested in that way too.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
For backward compatibility, we should get runtimeInfo from sandbox in
db, or get it from the sandbox container in db.
Note that this is a temporary solution and we will remove the Container field in
Sandbox in cri cache, and replace it with a SandboxInsantance of type
containerd.Sandbox interface.
Signed-off-by: Abel Feng <fshb1988@gmail.com>
This is mostly to workaround an issue with gRPC based shims after containerd
restart. If a shim dies while containerd is also down/restarting, on reboot
grpc.DialContext with our current set of DialOptions will make us wait for
100 seconds per shim even if the socket no longer exists or has no listener.
Signed-off-by: Danny Canter <danny@dcantah.dev>