oci: use readonly mount to read user/group info
In linux kernel, the umount writable-mountpoint will try to do sync-fs
to make sure that the dirty pages to the underlying filesystems. The many
number of umount actions in the same time maybe introduce performance
issue in IOPS limited disk.
When CRI-plugin creates container, it will temp-mount rootfs to read
that UID/GID info for entrypoint. Basically, the rootfs is writable
snapshotter and then after read, umount will invoke sync-fs action.
For example, using overlayfs on ext4 and use bcc-tools to monitor
ext4_sync_fs call.
```
// uname -a
Linux chaofan 5.13.0-27-generic #29~20.04.1-Ubuntu SMP Fri Jan 14 00:32:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
// open terminal 1
kubectl run --image=nginx --image-pull-policy=IfNotPresent nginx-pod
// open terminal 2
/usr/share/bcc/tools/stackcount ext4_sync_fs -i 1 -v -P
ext4_sync_fs
sync_filesystem
ovl_sync_fs
__sync_filesystem
sync_filesystem
generic_shutdown_super
kill_anon_super
deactivate_locked_super
deactivate_super
cleanup_mnt
__cleanup_mnt
task_work_run
exit_to_user_mode_prepare
syscall_exit_to_user_mode
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall.Syscall.abi0
github.com/containerd/containerd/mount.unmount
github.com/containerd/containerd/mount.UnmountAll
github.com/containerd/containerd/mount.WithTempMount.func2
github.com/containerd/containerd/mount.WithTempMount
github.com/containerd/containerd/oci.WithUserID.func1
github.com/containerd/containerd/oci.WithUser.func1
github.com/containerd/containerd/oci.ApplyOpts
github.com/containerd/containerd.WithSpec.func1
github.com/containerd/containerd.(*Client).NewContainer
github.com/containerd/containerd/pkg/cri/server.(*criService).CreateContainer
github.com/containerd/containerd/pkg/cri/server.(*instrumentedService).CreateContainer
k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler.func1
github.com/containerd/containerd/services/server.unaryNamespaceInterceptor
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1
k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler
google.golang.org/grpc.(*Server).processUnaryRPC
google.golang.org/grpc.(*Server).handleStream
google.golang.org/grpc.(*Server).serveStreams.func1.2
runtime.goexit.abi0
containerd [34771]
1
```
If there are comming several create requestes, umount actions might
bring high IO pressure on the /var/lib/containerd's underlying disk.
After checkout the kernel code[1], the kernel will not call
__sync_filesystem if the mount is readonly. Based on this, containerd
should use readonly mount to get UID/GID information.
Reference:
* [1] https://elixir.bootlin.com/linux/v5.13/source/fs/sync.c#L61
Closes: #4604
Signed-off-by: Wei Fu <fuweid89@gmail.com>
This commit is contained in:
@@ -602,6 +602,8 @@ func WithUser(userstr string) SpecOpts {
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
mounts = tryReadonlyMounts(mounts)
|
||||
return mount.WithTempMount(ctx, mounts, f)
|
||||
default:
|
||||
return fmt.Errorf("invalid USER value %s", userstr)
|
||||
@@ -655,6 +657,8 @@ func WithUserID(uid uint32) SpecOpts {
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
mounts = tryReadonlyMounts(mounts)
|
||||
return mount.WithTempMount(ctx, mounts, func(root string) error {
|
||||
user, err := UserFromPath(root, func(u user.User) bool {
|
||||
return u.Uid == int(uid)
|
||||
@@ -706,6 +710,8 @@ func WithUsername(username string) SpecOpts {
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
mounts = tryReadonlyMounts(mounts)
|
||||
return mount.WithTempMount(ctx, mounts, func(root string) error {
|
||||
user, err := UserFromPath(root, func(u user.User) bool {
|
||||
return u.Name == username
|
||||
@@ -790,6 +796,8 @@ func WithAdditionalGIDs(userstr string) SpecOpts {
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
mounts = tryReadonlyMounts(mounts)
|
||||
return mount.WithTempMount(ctx, mounts, setAdditionalGids)
|
||||
}
|
||||
}
|
||||
@@ -1278,3 +1286,21 @@ func WithDevShmSize(kb int64) SpecOpts {
|
||||
return ErrNoShmMount
|
||||
}
|
||||
}
|
||||
|
||||
// tryReadonlyMounts is used by the options which are trying to get user/group
|
||||
// information from container's rootfs. Since the option does read operation
|
||||
// only, this helper will append ReadOnly mount option to prevent linux kernel
|
||||
// from syncing whole filesystem in umount syscall.
|
||||
//
|
||||
// TODO(fuweid):
|
||||
//
|
||||
// Currently, it only works for overlayfs. I think we can apply it to other
|
||||
// kinds of filesystem. Maybe we can return `ro` option by `snapshotter.Mount`
|
||||
// API, when the caller passes that experimental annotation
|
||||
// `containerd.io/snapshot/readonly.mount` something like that.
|
||||
func tryReadonlyMounts(mounts []mount.Mount) []mount.Mount {
|
||||
if len(mounts) == 1 && mounts[0].Type == "overlay" {
|
||||
mounts[0].Options = append(mounts[0].Options, "ro")
|
||||
}
|
||||
return mounts
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user