Commit Graph

16 Commits

Author SHA1 Message Date
Derek McGowan
8f0eb26311
Move tracing to pkg/tracing
Signed-off-by: Derek McGowan <derek@mcg.dev>
2024-01-17 09:56:25 -08:00
Derek McGowan
44a836c9b5
Move errdefs to pkg/errdefs
Signed-off-by: Derek McGowan <derek@mcg.dev>
2024-01-17 09:54:45 -08:00
Derek McGowan
0dabf6f154
Move remotes to core/remotes
Signed-off-by: Derek McGowan <derek@mcg.dev>
2024-01-17 09:52:21 -08:00
Derek McGowan
57ea8aef3d
Move images to core/images
Signed-off-by: Derek McGowan <derek@mcg.dev>
2024-01-17 09:51:26 -08:00
Derek McGowan
913edcd489
Move diff to core/diff
Signed-off-by: Derek McGowan <derek@mcg.dev>
2024-01-17 09:51:17 -08:00
Derek McGowan
ad4c9f8a9d
Update CRI runtime platform and pinned image configuration
Updates the CRI image service to own image related configuration and
separate it from the runtime configuration.

Signed-off-by: Derek McGowan <derek@mcg.dev>
2024-01-11 09:55:09 -08:00
Derek McGowan
02a9a456e1
Split image config from CRI plugin
Signed-off-by: Derek McGowan <derek@mcg.dev>
2024-01-11 09:55:09 -08:00
Derek McGowan
d23ac1122e
Split CRI image service from GRPC handler
Prepares the CRI image service for splitting CRI into multiple plugins.
Also prepares for config migration which will spread across multiple
different plugins.

Signed-off-by: Derek McGowan <derek@mcg.dev>
2024-01-11 09:55:08 -08:00
Wei Fu
23278c81fb *: introduce image_pull_with_sync_fs in CRI
It's to ensure the data integrity during unexpected power failure.

Background:

Since release 1.3, in Linux system, containerD unpacks and writes files into
overlayfs snapshot directly. It doesn’t involve any mount-umount operations
so that the performance of pulling image has been improved.

As we know, the umount syscall for overlayfs will force kernel to flush
all the dirty pages into disk. Without umount syscall, the files’ data relies
on kernel’s writeback threads or filesystem's commit setting (for
instance, ext4 filesystem).

The files in committed snapshot can be loss after unexpected power failure.
However, the snapshot has been committed and the metadata also has been
fsynced. There is data inconsistency between snapshot metadata and files
in that snapshot.

We, containerd, received several issues about data loss after unexpected
power failure.

* https://github.com/containerd/containerd/issues/5854
* https://github.com/containerd/containerd/issues/3369#issuecomment-1787334907

Solution:

* Option 1: SyncFs after unpack

Linux platform provides [syncfs][syncfs] syscall to synchronize just the
filesystem containing a given file.

* Option 2: Fsync directories recursively and fsync on regular file

The fsync doesn't support symlink/block device/char device files. We
need to use fsync the parent directory to ensure that entry is
persisted.

However, based on [xfstest-dev][xfstest-dev], there is no case to ensure
fsync-on-parent can persist the special file's metadata, for example,
uid/gid, access mode.

Checkout [generic/690][generic/690]: Syncing parent dir can persist
symlink. But for f2fs, it needs special mount option. And it doesn't say
that uid/gid can be persisted. All the details are behind the
implemetation.

> NOTE: All the related test cases has `_flakey_drop_and_remount` in
[xfstest-dev].

Based on discussion about [Documenting the crash-recovery guarantees of Linux file systems][kernel-crash-recovery-data-integrity],
we can't rely on Fsync-on-parent.

* Option 1 is winner

This patch is using option 1.

There is test result based on [test-tool][test-tool].
All the networking traffic created by pull is local.

  * Image: docker.io/library/golang:1.19.4 (992 MiB)
    * Current: 5.446738579s
      * WIOS=21081, WBytes=1329741824, RIOS=79, RBytes=1197056
    * Option 1: 6.239686088s
      * WIOS=34804, WBytes=1454845952, RIOS=79, RBytes=1197056
    * Option 2: 1m30.510934813s
      * WIOS=42143, WBytes=1471397888, RIOS=82, RBytes=1209344

  * Image: docker.io/tensorflow/tensorflow:latest (1.78 GiB, ~32590 Inodes)
    * Current: 8.852718042s
      * WIOS=39417, WBytes=2412818432, RIOS=2673, RBytes=335987712
    * Option 1: 9.683387174s
      * WIOS=42767, WBytes=2431750144, RIOS=89, RBytes=1238016
    * Option 2: 1m54.302103719s
      * WIOS=54403, WBytes=2460528640, RIOS=1709, RBytes=208237568

The Option 1 will increase `wios`. So, the `image_pull_with_sync_fs` is
option in CRI plugin.

[syncfs]: <https://man7.org/linux/man-pages/man2/syncfs.2.html>
[xfstest-dev]: <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git>
[generic/690]: <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/tests/generic/690?h=v2023.11.19>
[kernel-crash-recovery-data-integrity]: <https://lore.kernel.org/linux-fsdevel/1552418820-18102-1-git-send-email-jaya@cs.utexas.edu/>
[test-tool]: <a17fb2010d/contrib/syncfs/containerd/main_test.go (L51)>

Signed-off-by: Wei Fu <fuweid89@gmail.com>
2023-12-12 10:18:39 +08:00
Wei Fu
80dd779deb remotes/docker: close connection if no more data
Close connection if no more data. It's to fix false alert filed by image
pull progress.

```
dst = OpenWriter (--> Content Store)

src = Fetch
        Open (--> Registry)
        Mark it as active request

Copy(dst, src) (--> Keep updating total received bytes)

   ^
   |  (Active Request > 0, but total received bytes won't be updated)
   v

defer src.Close()
content.Commit(dst)
```

Before migrating to transfer service, CRI plugin doesn't limit global
concurrent downloads for ImagePulls. Each ImagePull requests have 3 concurrent
goroutines to download blob and 1 goroutine to unpack blob. Like ext4
filesystem [1][1], the fsync from content.Commit may sync unrelated dirty pages
into disk. The host is running under IO pressure, and then the content.Commit
will take long time and block other goroutines. If httpreadseeker
doesn't close the connection after io.EOF, this connection will be
considered as active. The pull progress reporter reports there is no
bytes transfered and cancels the ImagePull.

The original 1-minute timeout[2][2] is from kubelet settting. Since CRI-plugin
can't limit the total concurrent downloads, this patch is to update 1-minute
to 5-minutes to prevent from unexpected cancel.

[1]: https://lwn.net/Articles/842385/
[2]: https://github.com/kubernetes/kubernetes/blob/release-1.23/pkg/kubelet/config/flags.go#L45-L48

Signed-off-by: Wei Fu <fuweid89@gmail.com>
2023-11-18 10:23:05 +08:00
Wei Fu
2e9686c054 fix: deflake TestCRIImagePullTimeout/HoldingContentOpenWriter
The new active request is filed and there is no bytes read yet when the
progress reporter just wakes up. If the timeout / 2 is less than the
minPullProgressReportInternal, it's easy to file false alert.

We should remove the minPullProgressReportInternal limit.

Fixes: #8024

Signed-off-by: Wei Fu <fuweid89@gmail.com>
2023-11-18 10:23:05 +08:00
Abel Feng
32bf805e57 sandbox: add a sandboxService interface to criService
so that we can add a fakeSandboxService to the criService in tests.

Signed-off-by: Abel Feng <fshb1988@gmail.com>
2023-11-15 09:25:58 +08:00
Derek McGowan
261e01c2ac
Move client to subpackage
Signed-off-by: Derek McGowan <derek@mcg.dev>
2023-11-01 10:37:00 -07:00
Derek McGowan
5fdf55e493
Update go module to github.com/containerd/containerd/v2
Signed-off-by: Derek McGowan <derek@mcg.dev>
2023-10-29 20:52:21 -07:00
Derek McGowan
638b474c81
Temporarily remove imgcrypt in CRI to fix circular dependency
Signed-off-by: Derek McGowan <derek@mcg.dev>
2023-10-27 15:36:47 -07:00
Maksym Pavlenko
1b31993240 Rename sbserver to server
Signed-off-by: Maksym Pavlenko <pavlenko.maksym@gmail.com>
2023-10-12 15:46:57 -07:00