
It's to ensure the data integrity during unexpected power failure.
Background:
Since release 1.3, in Linux system, containerD unpacks and writes files into
overlayfs snapshot directly. It doesn’t involve any mount-umount operations
so that the performance of pulling image has been improved.
As we know, the umount syscall for overlayfs will force kernel to flush
all the dirty pages into disk. Without umount syscall, the files’ data relies
on kernel’s writeback threads or filesystem's commit setting (for
instance, ext4 filesystem).
The files in committed snapshot can be loss after unexpected power failure.
However, the snapshot has been committed and the metadata also has been
fsynced. There is data inconsistency between snapshot metadata and files
in that snapshot.
We, containerd, received several issues about data loss after unexpected
power failure.
* https://github.com/containerd/containerd/issues/5854
* https://github.com/containerd/containerd/issues/3369#issuecomment-1787334907
Solution:
* Option 1: SyncFs after unpack
Linux platform provides [syncfs][syncfs] syscall to synchronize just the
filesystem containing a given file.
* Option 2: Fsync directories recursively and fsync on regular file
The fsync doesn't support symlink/block device/char device files. We
need to use fsync the parent directory to ensure that entry is
persisted.
However, based on [xfstest-dev][xfstest-dev], there is no case to ensure
fsync-on-parent can persist the special file's metadata, for example,
uid/gid, access mode.
Checkout [generic/690][generic/690]: Syncing parent dir can persist
symlink. But for f2fs, it needs special mount option. And it doesn't say
that uid/gid can be persisted. All the details are behind the
implemetation.
> NOTE: All the related test cases has `_flakey_drop_and_remount` in
[xfstest-dev].
Based on discussion about [Documenting the crash-recovery guarantees of Linux file systems][kernel-crash-recovery-data-integrity],
we can't rely on Fsync-on-parent.
* Option 1 is winner
This patch is using option 1.
There is test result based on [test-tool][test-tool].
All the networking traffic created by pull is local.
* Image: docker.io/library/golang:1.19.4 (992 MiB)
* Current: 5.446738579s
* WIOS=21081, WBytes=1329741824, RIOS=79, RBytes=1197056
* Option 1: 6.239686088s
* WIOS=34804, WBytes=1454845952, RIOS=79, RBytes=1197056
* Option 2: 1m30.510934813s
* WIOS=42143, WBytes=1471397888, RIOS=82, RBytes=1209344
* Image: docker.io/tensorflow/tensorflow:latest (1.78 GiB, ~32590 Inodes)
* Current: 8.852718042s
* WIOS=39417, WBytes=2412818432, RIOS=2673, RBytes=335987712
* Option 1: 9.683387174s
* WIOS=42767, WBytes=2431750144, RIOS=89, RBytes=1238016
* Option 2: 1m54.302103719s
* WIOS=54403, WBytes=2460528640, RIOS=1709, RBytes=208237568
The Option 1 will increase `wios`. So, the `image_pull_with_sync_fs` is
option in CRI plugin.
[syncfs]: <https://man7.org/linux/man-pages/man2/syncfs.2.html>
[xfstest-dev]: <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git>
[generic/690]: <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/tests/generic/690?h=v2023.11.19>
[kernel-crash-recovery-data-integrity]: <https://lore.kernel.org/linux-fsdevel/1552418820-18102-1-git-send-email-jaya@cs.utexas.edu/>
[test-tool]: <a17fb2010d/contrib/syncfs/containerd/main_test.go (L51)
>
Signed-off-by: Wei Fu <fuweid89@gmail.com>
149 lines
4.9 KiB
Go
149 lines
4.9 KiB
Go
/*
|
|
Copyright The containerd Authors.
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License");
|
|
you may not use this file except in compliance with the License.
|
|
You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
See the License for the specific language governing permissions and
|
|
limitations under the License.
|
|
*/
|
|
|
|
package diff
|
|
|
|
import (
|
|
"context"
|
|
"io"
|
|
"time"
|
|
|
|
"github.com/containerd/containerd/v2/mount"
|
|
"github.com/containerd/typeurl/v2"
|
|
ocispec "github.com/opencontainers/image-spec/specs-go/v1"
|
|
)
|
|
|
|
// Config is used to hold parameters needed for a diff operation
|
|
type Config struct {
|
|
// MediaType is the type of diff to generate
|
|
// Default depends on the differ,
|
|
// i.e. application/vnd.oci.image.layer.v1.tar+gzip
|
|
MediaType string
|
|
|
|
// Reference is the content upload reference
|
|
// Default will use a random reference string
|
|
Reference string
|
|
|
|
// Labels are the labels to apply to the generated content
|
|
Labels map[string]string
|
|
|
|
// Compressor is a function to compress the diff stream
|
|
// instead of the default gzip compressor. Differ passes
|
|
// the MediaType of the target diff content to the compressor.
|
|
// When using this config, MediaType must be specified as well.
|
|
Compressor func(dest io.Writer, mediaType string) (io.WriteCloser, error)
|
|
|
|
// SourceDateEpoch specifies the SOURCE_DATE_EPOCH without touching the env vars.
|
|
SourceDateEpoch *time.Time
|
|
}
|
|
|
|
// Opt is used to configure a diff operation
|
|
type Opt func(*Config) error
|
|
|
|
// Comparer allows creation of filesystem diffs between mounts
|
|
type Comparer interface {
|
|
// Compare computes the difference between two mounts and returns a
|
|
// descriptor for the computed diff. The options can provide
|
|
// a ref which can be used to track the content creation of the diff.
|
|
// The media type which is used to determine the format of the created
|
|
// content can also be provided as an option.
|
|
Compare(ctx context.Context, lower, upper []mount.Mount, opts ...Opt) (ocispec.Descriptor, error)
|
|
}
|
|
|
|
// ApplyConfig is used to hold parameters needed for a apply operation
|
|
type ApplyConfig struct {
|
|
// ProcessorPayloads specifies the payload sent to various processors
|
|
ProcessorPayloads map[string]typeurl.Any
|
|
// SyncFs is to synchronize the underlying filesystem containing files
|
|
SyncFs bool
|
|
}
|
|
|
|
// ApplyOpt is used to configure an Apply operation
|
|
type ApplyOpt func(context.Context, ocispec.Descriptor, *ApplyConfig) error
|
|
|
|
// Applier allows applying diffs between mounts
|
|
type Applier interface {
|
|
// Apply applies the content referred to by the given descriptor to
|
|
// the provided mount. The method of applying is based on the
|
|
// implementation and content descriptor. For example, in the common
|
|
// case the descriptor is a file system difference in tar format,
|
|
// that tar would be applied on top of the mounts.
|
|
Apply(ctx context.Context, desc ocispec.Descriptor, mount []mount.Mount, opts ...ApplyOpt) (ocispec.Descriptor, error)
|
|
}
|
|
|
|
// WithCompressor sets the function to be used to compress the diff stream.
|
|
func WithCompressor(f func(dest io.Writer, mediaType string) (io.WriteCloser, error)) Opt {
|
|
return func(c *Config) error {
|
|
c.Compressor = f
|
|
return nil
|
|
}
|
|
}
|
|
|
|
// WithMediaType sets the media type to use for creating the diff, without
|
|
// specifying the differ will choose a default.
|
|
func WithMediaType(m string) Opt {
|
|
return func(c *Config) error {
|
|
c.MediaType = m
|
|
return nil
|
|
}
|
|
}
|
|
|
|
// WithReference is used to set the content upload reference used by
|
|
// the diff operation. This allows the caller to track the upload through
|
|
// the content store.
|
|
func WithReference(ref string) Opt {
|
|
return func(c *Config) error {
|
|
c.Reference = ref
|
|
return nil
|
|
}
|
|
}
|
|
|
|
// WithLabels is used to set content labels on the created diff content.
|
|
func WithLabels(labels map[string]string) Opt {
|
|
return func(c *Config) error {
|
|
c.Labels = labels
|
|
return nil
|
|
}
|
|
}
|
|
|
|
// WithPayloads sets the apply processor payloads to the config
|
|
func WithPayloads(payloads map[string]typeurl.Any) ApplyOpt {
|
|
return func(_ context.Context, _ ocispec.Descriptor, c *ApplyConfig) error {
|
|
c.ProcessorPayloads = payloads
|
|
return nil
|
|
}
|
|
}
|
|
|
|
// WithSyncFs sets sync flag to the config.
|
|
func WithSyncFs(sync bool) ApplyOpt {
|
|
return func(_ context.Context, _ ocispec.Descriptor, c *ApplyConfig) error {
|
|
c.SyncFs = sync
|
|
return nil
|
|
}
|
|
}
|
|
|
|
// WithSourceDateEpoch specifies the timestamp used to provide control for reproducibility.
|
|
// See also https://reproducible-builds.org/docs/source-date-epoch/ .
|
|
//
|
|
// Since containerd v2.0, the whiteout timestamps are set to zero (1970-01-01),
|
|
// not to the source date epoch.
|
|
func WithSourceDateEpoch(tm *time.Time) Opt {
|
|
return func(c *Config) error {
|
|
c.SourceDateEpoch = tm
|
|
return nil
|
|
}
|
|
}
|