For some runtimes, the container is not ready for exec until the
initial container task has been started (as opposed to just having the task created).
More specifically, running containerd-stress with --exec would break
with Kata Container shim, since the sandbox is not created until a
start is issued. By starting the container's primary task before adding
exec's, we can avoid:
```
error="cannot enter container exec-container-1, with err Sandbox not running, impossible to enter the container: unknown"
```
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
With the release of hcsshim v0.9.2, this test should pass without
issues on Windows.
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
We were not properly ignoring errors from
gorestrl.rdt.ContainerClassFromAnnotations() causing the config option
to be ineffective, in practice.
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
With the cgroupv2 configuration employed by Kubernetes, the pod cgroup (slice)
and container cgroup (scope) will both have the same memory limit applied. In
that situation, the kernel will consider an OOM event to be triggered by the
parent cgroup (slice), and increment 'oom' there. The child cgroup (scope) only
sees an oom_kill increment. Since we monitor child cgroups for oom events,
check the OOMKill field so that we don't miss events.
This is not visible when running containers through docker or ctr, because they
set the limits differently (only container level). An alternative would be to
not configure limits at the pod level - that way the container limit will be
hit and the OOM will be correctly generated. An interesting consequence is that
when spawning a pod with multiple containers, the oom events also work
correctly, because:
a) if one of the containers has no limit, the pod has no limit so OOM events in
another container report correctly.
b) if all of the containers have limits then the pod limit will be a sum of
container events, so a container will be able to hit its limit first.
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
When the cgroup is removed, EventChan is closed (this was pulled in by
8d69c041c5). This results in a nil error
being received. Don't log an error in that case but instead return.
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
The Linux kernel never sets the Inheritable capability flag to
anything other than empty. Non-empty values are always exclusively
set by userspace code.
[The kernel stopped defaulting this set of capability values to the
full set in 2000 after a privilege escalation with Capabilities
affecting Sendmail and others.]
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Commit fb0688362c implemented the Normalize()
function, but marked these fields as deprecated.
It's unclear what the motivation was for this, as the fields are part of the OCI
Image spec. On Windows, the OSVersion field specifically is important when matching
images (as kernel versions may not be compatible).
This patch updates platforms.Normalize() to preserve the OSVersion and OSFeatures
fields.
As a follow-up, we should look at defining an appropriate string-representation
for these fields (possibly as part of the OCI Spec), and update platforms.Parse()
accordingly.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Allow the following syscalls by default:
- `landlock_add_rule`
- `landlock_create_ruleset`
- `landlock_restrict_self`
See https://landlock.io/
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Unlike the other shims, containerd-shim did not have a -v (version) flag:
./bin/containerd-shim-runc-v1 -v
./bin/containerd-shim-runc-v1:
Version: v1.6.0-rc.1
Revision: ad771115b82a70cfd8018d72ae489c707e63de16.m
Go version: go1.17.2
./bin/containerd-shim -v
flag provided but not defined: -v
Usage of ./bin/containerd-shim:
This patch adds a `-v` flag to be consistent with the other shims. The code was
slightly refactored to match the implementation in the other shims, taking the
same approach as 77d53d2d23/runtime/v2/shim/shim.go (L240-L256)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
POSIX guidelines describes; https://www.gnu.org/prep/standards/html_node/_002d_002dversion.html#g_t_002d_002dversion
> The program’s name should be a constant string; don’t compute it from argv[0].
> The idea is to state the standard or canonical name for the program, not its
> file name.
We don't have a const for this, but let's make a start and just remove the path info.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
I noticed that path information showed up in the version output:
./bin/containerd-shim-runc-v1 -v
./bin/containerd-shim-runc-v1:
Version: v1.6.0-rc.1
Revision: ad771115b82a70cfd8018d72ae489c707e63de16.m
Go version: go1.17.2
POSIX guidelines describes; https://www.gnu.org/prep/standards/html_node/_002d_002dversion.html#g_t_002d_002dversion
> The program’s name should be a constant string; don’t compute it from argv[0].
> The idea is to state the standard or canonical name for the program, not its
> file name.
Unfortunately, this code is used by multiple binaries, so we can't fully remove
the use of os.Args[0], but let's make a start and just remove the path info.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
In linux kernel, the umount writable-mountpoint will try to do sync-fs
to make sure that the dirty pages to the underlying filesystems. The many
number of umount actions in the same time maybe introduce performance
issue in IOPS limited disk.
When CRI-plugin creates container, it will temp-mount rootfs to read
that UID/GID info for entrypoint. Basically, the rootfs is writable
snapshotter and then after read, umount will invoke sync-fs action.
For example, using overlayfs on ext4 and use bcc-tools to monitor
ext4_sync_fs call.
```
// uname -a
Linux chaofan 5.13.0-27-generic #29~20.04.1-Ubuntu SMP Fri Jan 14 00:32:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
// open terminal 1
kubectl run --image=nginx --image-pull-policy=IfNotPresent nginx-pod
// open terminal 2
/usr/share/bcc/tools/stackcount ext4_sync_fs -i 1 -v -P
ext4_sync_fs
sync_filesystem
ovl_sync_fs
__sync_filesystem
sync_filesystem
generic_shutdown_super
kill_anon_super
deactivate_locked_super
deactivate_super
cleanup_mnt
__cleanup_mnt
task_work_run
exit_to_user_mode_prepare
syscall_exit_to_user_mode
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall.Syscall.abi0
github.com/containerd/containerd/mount.unmount
github.com/containerd/containerd/mount.UnmountAll
github.com/containerd/containerd/mount.WithTempMount.func2
github.com/containerd/containerd/mount.WithTempMount
github.com/containerd/containerd/oci.WithUserID.func1
github.com/containerd/containerd/oci.WithUser.func1
github.com/containerd/containerd/oci.ApplyOpts
github.com/containerd/containerd.WithSpec.func1
github.com/containerd/containerd.(*Client).NewContainer
github.com/containerd/containerd/pkg/cri/server.(*criService).CreateContainer
github.com/containerd/containerd/pkg/cri/server.(*instrumentedService).CreateContainer
k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler.func1
github.com/containerd/containerd/services/server.unaryNamespaceInterceptor
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1
k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler
google.golang.org/grpc.(*Server).processUnaryRPC
google.golang.org/grpc.(*Server).handleStream
google.golang.org/grpc.(*Server).serveStreams.func1.2
runtime.goexit.abi0
containerd [34771]
1
```
If there are comming several create requestes, umount actions might
bring high IO pressure on the /var/lib/containerd's underlying disk.
After checkout the kernel code[1], the kernel will not call
__sync_filesystem if the mount is readonly. Based on this, containerd
should use readonly mount to get UID/GID information.
Reference:
* [1] https://elixir.bootlin.com/linux/v5.13/source/fs/sync.c#L61Closes: #4604
Signed-off-by: Wei Fu <fuweid89@gmail.com>
This change addresses the following issues:
* Fix fetching the public IP of the windows instance.
* Fix generation of repolist.toml.
* Resource cleanup is now run even if tests fail.
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
As reported, running import twice without using the compress import
option means that the content store will have existing layers during the
second import and the existing code hardcodes existing layer media type
to compressed. This fixes the issue by actually reading the header bytes
from the store and setting the media type appropriately.
Signed-off-by: Phil Estes <estesp@amazon.com>
This causes sigint/sigterm to trigger a shutdown of the shim.
It is needed because otherwise the v2 shim hangs system shutdown.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
ShimV2 has shim.Delete command to cleanup task's temporary resource,
like bundle folder. Since the shim server exits and no persistent store
is for task's exit code, the result of shim.Delete is always 137 exit
code, like the task has been killed.
And the result of shim.Delete can be used as task event only when the
shim server is killed somehow after container is running. Therefore,
dockerd, which watches task exit event to update status of container,
can report correct status.
Back to the issue #6429, the container is not running because the
entrypoint is not found. Based on this design, we should not send
137 exitcode event to subscriber.
This commit is aimed to remove shim instance first and then the
`cleanupAfterDeadShim` should not send event.
Similar Issue: #4769Fix#6429
Signed-off-by: Wei Fu <fuweid89@gmail.com>
If containerd-shim-runc-v2 process dead abnormally, such as received
kill 9 signal, panic or other unkown reasons, the containerd-shim-runc-v2
server can not reap runc container and forward init process exit event.
This will lead the container leaked in dockerd. When shim dead, containerd
will clean dead shim, here read init process pid and forward exit event
with pid at the same time.
Signed-off-by: Jeff Zvier <zvier20@gmail.com>