containerd/services/diff/local.go
Wei Fu 23278c81fb *: introduce image_pull_with_sync_fs in CRI
It's to ensure the data integrity during unexpected power failure.

Background:

Since release 1.3, in Linux system, containerD unpacks and writes files into
overlayfs snapshot directly. It doesn’t involve any mount-umount operations
so that the performance of pulling image has been improved.

As we know, the umount syscall for overlayfs will force kernel to flush
all the dirty pages into disk. Without umount syscall, the files’ data relies
on kernel’s writeback threads or filesystem's commit setting (for
instance, ext4 filesystem).

The files in committed snapshot can be loss after unexpected power failure.
However, the snapshot has been committed and the metadata also has been
fsynced. There is data inconsistency between snapshot metadata and files
in that snapshot.

We, containerd, received several issues about data loss after unexpected
power failure.

* https://github.com/containerd/containerd/issues/5854
* https://github.com/containerd/containerd/issues/3369#issuecomment-1787334907

Solution:

* Option 1: SyncFs after unpack

Linux platform provides [syncfs][syncfs] syscall to synchronize just the
filesystem containing a given file.

* Option 2: Fsync directories recursively and fsync on regular file

The fsync doesn't support symlink/block device/char device files. We
need to use fsync the parent directory to ensure that entry is
persisted.

However, based on [xfstest-dev][xfstest-dev], there is no case to ensure
fsync-on-parent can persist the special file's metadata, for example,
uid/gid, access mode.

Checkout [generic/690][generic/690]: Syncing parent dir can persist
symlink. But for f2fs, it needs special mount option. And it doesn't say
that uid/gid can be persisted. All the details are behind the
implemetation.

> NOTE: All the related test cases has `_flakey_drop_and_remount` in
[xfstest-dev].

Based on discussion about [Documenting the crash-recovery guarantees of Linux file systems][kernel-crash-recovery-data-integrity],
we can't rely on Fsync-on-parent.

* Option 1 is winner

This patch is using option 1.

There is test result based on [test-tool][test-tool].
All the networking traffic created by pull is local.

  * Image: docker.io/library/golang:1.19.4 (992 MiB)
    * Current: 5.446738579s
      * WIOS=21081, WBytes=1329741824, RIOS=79, RBytes=1197056
    * Option 1: 6.239686088s
      * WIOS=34804, WBytes=1454845952, RIOS=79, RBytes=1197056
    * Option 2: 1m30.510934813s
      * WIOS=42143, WBytes=1471397888, RIOS=82, RBytes=1209344

  * Image: docker.io/tensorflow/tensorflow:latest (1.78 GiB, ~32590 Inodes)
    * Current: 8.852718042s
      * WIOS=39417, WBytes=2412818432, RIOS=2673, RBytes=335987712
    * Option 1: 9.683387174s
      * WIOS=42767, WBytes=2431750144, RIOS=89, RBytes=1238016
    * Option 2: 1m54.302103719s
      * WIOS=54403, WBytes=2460528640, RIOS=1709, RBytes=208237568

The Option 1 will increase `wios`. So, the `image_pull_with_sync_fs` is
option in CRI plugin.

[syncfs]: <https://man7.org/linux/man-pages/man2/syncfs.2.html>
[xfstest-dev]: <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git>
[generic/690]: <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/tests/generic/690?h=v2023.11.19>
[kernel-crash-recovery-data-integrity]: <https://lore.kernel.org/linux-fsdevel/1552418820-18102-1-git-send-email-jaya@cs.utexas.edu/>
[test-tool]: <a17fb2010d/contrib/syncfs/containerd/main_test.go (L51)>

Signed-off-by: Wei Fu <fuweid89@gmail.com>
2023-12-12 10:18:39 +08:00

165 lines
4.2 KiB
Go

/*
Copyright The containerd Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package diff
import (
"context"
"fmt"
diffapi "github.com/containerd/containerd/v2/api/services/diff/v1"
"github.com/containerd/containerd/v2/diff"
"github.com/containerd/containerd/v2/errdefs"
"github.com/containerd/containerd/v2/mount"
"github.com/containerd/containerd/v2/oci"
"github.com/containerd/containerd/v2/plugins"
"github.com/containerd/containerd/v2/services"
"github.com/containerd/plugin"
"github.com/containerd/plugin/registry"
"github.com/containerd/typeurl/v2"
ocispec "github.com/opencontainers/image-spec/specs-go/v1"
"google.golang.org/grpc"
)
type config struct {
// Order is the order of preference in which to try diff algorithms, the
// first differ which is supported is used.
// Note when multiple differs may be supported, this order will be
// respected for which is chosen. Each differ should return the same
// correct output, allowing any ordering to be used to prefer
// more optimimal implementations.
Order []string `toml:"default"`
}
type differ interface {
diff.Comparer
diff.Applier
}
func init() {
registry.Register(&plugin.Registration{
Type: plugins.ServicePlugin,
ID: services.DiffService,
Requires: []plugin.Type{
plugins.DiffPlugin,
},
Config: defaultDifferConfig,
InitFn: func(ic *plugin.InitContext) (interface{}, error) {
differs, err := ic.GetByType(plugins.DiffPlugin)
if err != nil {
return nil, err
}
orderedNames := ic.Config.(*config).Order
ordered := make([]differ, len(orderedNames))
for i, n := range orderedNames {
d, ok := differs[n]
if !ok {
return nil, fmt.Errorf("needed differ not loaded: %s", n)
}
ordered[i], ok = d.(differ)
if !ok {
return nil, fmt.Errorf("differ does not implement Comparer and Applier interface: %s", n)
}
}
return &local{
differs: ordered,
}, nil
},
})
}
type local struct {
differs []differ
}
var _ diffapi.DiffClient = &local{}
func (l *local) Apply(ctx context.Context, er *diffapi.ApplyRequest, _ ...grpc.CallOption) (*diffapi.ApplyResponse, error) {
var (
ocidesc ocispec.Descriptor
err error
desc = oci.DescriptorFromProto(er.Diff)
mounts = mount.FromProto(er.Mounts)
)
var opts []diff.ApplyOpt
if er.Payloads != nil {
payloads := make(map[string]typeurl.Any)
for k, v := range er.Payloads {
payloads[k] = v
}
opts = append(opts, diff.WithPayloads(payloads))
}
opts = append(opts, diff.WithSyncFs(er.SyncFs))
for _, differ := range l.differs {
ocidesc, err = differ.Apply(ctx, desc, mounts, opts...)
if !errdefs.IsNotImplemented(err) {
break
}
}
if err != nil {
return nil, errdefs.ToGRPC(err)
}
return &diffapi.ApplyResponse{
Applied: oci.DescriptorToProto(ocidesc),
}, nil
}
func (l *local) Diff(ctx context.Context, dr *diffapi.DiffRequest, _ ...grpc.CallOption) (*diffapi.DiffResponse, error) {
var (
ocidesc ocispec.Descriptor
err error
aMounts = mount.FromProto(dr.Left)
bMounts = mount.FromProto(dr.Right)
)
var opts []diff.Opt
if dr.MediaType != "" {
opts = append(opts, diff.WithMediaType(dr.MediaType))
}
if dr.Ref != "" {
opts = append(opts, diff.WithReference(dr.Ref))
}
if dr.Labels != nil {
opts = append(opts, diff.WithLabels(dr.Labels))
}
if dr.SourceDateEpoch != nil {
tm := dr.SourceDateEpoch.AsTime()
opts = append(opts, diff.WithSourceDateEpoch(&tm))
}
for _, d := range l.differs {
ocidesc, err = d.Compare(ctx, aMounts, bMounts, opts...)
if !errdefs.IsNotImplemented(err) {
break
}
}
if err != nil {
return nil, errdefs.ToGRPC(err)
}
return &diffapi.DiffResponse{
Diff: oci.DescriptorToProto(ocidesc),
}, nil
}