The Linux kernel never sets the Inheritable capability flag to
anything other than empty. Non-empty values are always exclusively
set by userspace code.
[The kernel stopped defaulting this set of capability values to the
full set in 2000 after a privilege escalation with Capabilities
affecting Sendmail and others.]
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Commit fb0688362c implemented the Normalize()
function, but marked these fields as deprecated.
It's unclear what the motivation was for this, as the fields are part of the OCI
Image spec. On Windows, the OSVersion field specifically is important when matching
images (as kernel versions may not be compatible).
This patch updates platforms.Normalize() to preserve the OSVersion and OSFeatures
fields.
As a follow-up, we should look at defining an appropriate string-representation
for these fields (possibly as part of the OCI Spec), and update platforms.Parse()
accordingly.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Allow the following syscalls by default:
- `landlock_add_rule`
- `landlock_create_ruleset`
- `landlock_restrict_self`
See https://landlock.io/
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Unlike the other shims, containerd-shim did not have a -v (version) flag:
./bin/containerd-shim-runc-v1 -v
./bin/containerd-shim-runc-v1:
Version: v1.6.0-rc.1
Revision: ad771115b82a70cfd8018d72ae489c707e63de16.m
Go version: go1.17.2
./bin/containerd-shim -v
flag provided but not defined: -v
Usage of ./bin/containerd-shim:
This patch adds a `-v` flag to be consistent with the other shims. The code was
slightly refactored to match the implementation in the other shims, taking the
same approach as 77d53d2d23/runtime/v2/shim/shim.go (L240-L256)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
POSIX guidelines describes; https://www.gnu.org/prep/standards/html_node/_002d_002dversion.html#g_t_002d_002dversion
> The program’s name should be a constant string; don’t compute it from argv[0].
> The idea is to state the standard or canonical name for the program, not its
> file name.
We don't have a const for this, but let's make a start and just remove the path info.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
I noticed that path information showed up in the version output:
./bin/containerd-shim-runc-v1 -v
./bin/containerd-shim-runc-v1:
Version: v1.6.0-rc.1
Revision: ad771115b82a70cfd8018d72ae489c707e63de16.m
Go version: go1.17.2
POSIX guidelines describes; https://www.gnu.org/prep/standards/html_node/_002d_002dversion.html#g_t_002d_002dversion
> The program’s name should be a constant string; don’t compute it from argv[0].
> The idea is to state the standard or canonical name for the program, not its
> file name.
Unfortunately, this code is used by multiple binaries, so we can't fully remove
the use of os.Args[0], but let's make a start and just remove the path info.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
In linux kernel, the umount writable-mountpoint will try to do sync-fs
to make sure that the dirty pages to the underlying filesystems. The many
number of umount actions in the same time maybe introduce performance
issue in IOPS limited disk.
When CRI-plugin creates container, it will temp-mount rootfs to read
that UID/GID info for entrypoint. Basically, the rootfs is writable
snapshotter and then after read, umount will invoke sync-fs action.
For example, using overlayfs on ext4 and use bcc-tools to monitor
ext4_sync_fs call.
```
// uname -a
Linux chaofan 5.13.0-27-generic #29~20.04.1-Ubuntu SMP Fri Jan 14 00:32:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
// open terminal 1
kubectl run --image=nginx --image-pull-policy=IfNotPresent nginx-pod
// open terminal 2
/usr/share/bcc/tools/stackcount ext4_sync_fs -i 1 -v -P
ext4_sync_fs
sync_filesystem
ovl_sync_fs
__sync_filesystem
sync_filesystem
generic_shutdown_super
kill_anon_super
deactivate_locked_super
deactivate_super
cleanup_mnt
__cleanup_mnt
task_work_run
exit_to_user_mode_prepare
syscall_exit_to_user_mode
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall.Syscall.abi0
github.com/containerd/containerd/mount.unmount
github.com/containerd/containerd/mount.UnmountAll
github.com/containerd/containerd/mount.WithTempMount.func2
github.com/containerd/containerd/mount.WithTempMount
github.com/containerd/containerd/oci.WithUserID.func1
github.com/containerd/containerd/oci.WithUser.func1
github.com/containerd/containerd/oci.ApplyOpts
github.com/containerd/containerd.WithSpec.func1
github.com/containerd/containerd.(*Client).NewContainer
github.com/containerd/containerd/pkg/cri/server.(*criService).CreateContainer
github.com/containerd/containerd/pkg/cri/server.(*instrumentedService).CreateContainer
k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler.func1
github.com/containerd/containerd/services/server.unaryNamespaceInterceptor
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1
k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler
google.golang.org/grpc.(*Server).processUnaryRPC
google.golang.org/grpc.(*Server).handleStream
google.golang.org/grpc.(*Server).serveStreams.func1.2
runtime.goexit.abi0
containerd [34771]
1
```
If there are comming several create requestes, umount actions might
bring high IO pressure on the /var/lib/containerd's underlying disk.
After checkout the kernel code[1], the kernel will not call
__sync_filesystem if the mount is readonly. Based on this, containerd
should use readonly mount to get UID/GID information.
Reference:
* [1] https://elixir.bootlin.com/linux/v5.13/source/fs/sync.c#L61Closes: #4604
Signed-off-by: Wei Fu <fuweid89@gmail.com>
This change addresses the following issues:
* Fix fetching the public IP of the windows instance.
* Fix generation of repolist.toml.
* Resource cleanup is now run even if tests fail.
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
As reported, running import twice without using the compress import
option means that the content store will have existing layers during the
second import and the existing code hardcodes existing layer media type
to compressed. This fixes the issue by actually reading the header bytes
from the store and setting the media type appropriately.
Signed-off-by: Phil Estes <estesp@amazon.com>
This causes sigint/sigterm to trigger a shutdown of the shim.
It is needed because otherwise the v2 shim hangs system shutdown.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
ShimV2 has shim.Delete command to cleanup task's temporary resource,
like bundle folder. Since the shim server exits and no persistent store
is for task's exit code, the result of shim.Delete is always 137 exit
code, like the task has been killed.
And the result of shim.Delete can be used as task event only when the
shim server is killed somehow after container is running. Therefore,
dockerd, which watches task exit event to update status of container,
can report correct status.
Back to the issue #6429, the container is not running because the
entrypoint is not found. Based on this design, we should not send
137 exitcode event to subscriber.
This commit is aimed to remove shim instance first and then the
`cleanupAfterDeadShim` should not send event.
Similar Issue: #4769Fix#6429
Signed-off-by: Wei Fu <fuweid89@gmail.com>
If containerd-shim-runc-v2 process dead abnormally, such as received
kill 9 signal, panic or other unkown reasons, the containerd-shim-runc-v2
server can not reap runc container and forward init process exit event.
This will lead the container leaked in dockerd. When shim dead, containerd
will clean dead shim, here read init process pid and forward exit event
with pid at the same time.
Signed-off-by: Jeff Zvier <zvier20@gmail.com>
The build number used to determine whether we need to pull the Windows
Server 2022 image for the integration tests was previously hardcoded as there
wasn't an hcsshim release with the build number. Now that there is and it's
vendored in, this change just swaps to it.
Signed-off-by: Daniel Canter <dcanter@microsoft.com>
Correctly matches optional variants for amd64
arch. These should be used for standardized values
v1-v4 from https://en.wikipedia.org/wiki/X86-64#Microarchitecture_levels.
V1 remains the default and is cleared by default.
Pulling a higher variant will match the highest
available platform lower or equal to the provided one
when platformVector is used.
Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
This tag brings in some bug fixes related to waiting for containers to terminate and
trying to kill an already terminated process, as well as tty support (exec -it) for
Windows Host Process Containers.
Signed-off-by: Daniel Canter <dcanter@microsoft.com>