Commit Graph

11307 Commits

Author SHA1 Message Date
Eric Ernst
8b9571e348 containerd-stress: start task ctr before starting execs
For some runtimes, the container is not ready for exec until the
initial container task has been started (as opposed to just having the task created).

More specifically, running containerd-stress with --exec would break
with Kata Container shim, since the sandbox is not created until a
start is issued. By starting the container's primary task before adding
exec's, we can avoid:
```
error="cannot enter container exec-container-1, with err Sandbox not running, impossible to enter the container: unknown"
```

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
2022-02-04 16:08:44 -08:00
Gabriel Adrian Samfira
b63000c65d
[Windows][Integration] Enable TestRestartMonitor
With the release of hcsshim v0.9.2, this test should pass without
issues on Windows.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2022-02-04 17:27:14 +02:00
Markus Lehtonen
9b1fb82584 cri: fix handling of ignore_rdt_not_enabled_errors config option
We were not properly ignoring errors from
gorestrl.rdt.ContainerClassFromAnnotations() causing the config option
to be ineffective, in practice.

Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
2022-02-04 13:54:03 +02:00
Akihiro Suda
4f5ce5615a
Merge pull request #6501 from henry118/issue6499
Document fs_type and fs_options in snapshots/devmapper/README.md
2022-02-04 18:04:29 +09:00
Maksym Pavlenko
a5d093991a
Merge pull request #6510 from smira/adoption-talos 2022-02-03 12:36:49 -08:00
Andrey Smirnov
dcbe3e4713
docs: add Talos Linux to the list of adopters
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-02-03 21:10:28 +03:00
Derek McGowan
943ca856ad
Merge pull request #6502 from dmcgowan/prepare-1.6.0-rc.2
Prepare 1.6.0-rc.2
2022-02-03 08:54:18 -08:00
Jeremi Piotrowski
7275411ec8 cgroup2: monitor OOMKill instead of OOM to prevent missing container OOM events
With the cgroupv2 configuration employed by Kubernetes, the pod cgroup (slice)
and container cgroup (scope) will both have the same memory limit applied. In
that situation, the kernel will consider an OOM event to be triggered by the
parent cgroup (slice), and increment 'oom' there. The child cgroup (scope) only
sees an oom_kill increment. Since we monitor child cgroups for oom events,
check the OOMKill field so that we don't miss events.

This is not visible when running containers through docker or ctr, because they
set the limits differently (only container level). An alternative would be to
not configure limits at the pod level - that way the container limit will be
hit and the OOM will be correctly generated. An interesting consequence is that
when spawning a pod with multiple containers, the oom events also work
correctly, because:

a) if one of the containers has no limit, the pod has no limit so OOM events in
   another container report correctly.
b) if all of the containers have limits then the pod limit will be a sum of
   container events, so a container will be able to hit its limit first.

Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
2022-02-03 13:39:16 +01:00
Jeremi Piotrowski
821c961c86 pkg/oom/v2: handle EventChan routine shutdown quietly
When the cgroup is removed, EventChan is closed (this was pulled in by
8d69c041c5). This results in a nil error
being received. Don't log an error in that case but instead return.

Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
2022-02-03 13:20:46 +01:00
Henry Wang
2d9d5fddbd Document fs_type and fs_options in snapshots/devmapper/README.md
Resolves: #6499

Signed-off-by: Henry Wang <henwang@amazon.com>
2022-02-02 21:57:44 +00:00
Derek McGowan
604c462d7b
Merge pull request #6497 from thaJeztah/platform_keep_osversion_osfeatures
platforms.Normalize(): do not reset OSVersion and OSFeatures
2022-02-02 12:06:09 -08:00
Michael Crosby
9a08d6fcde
Merge pull request #6457 from kzys/otel-http
tracing: use OTLP/HTTP in addition to OTLP/gRPC
2022-02-02 14:24:15 -05:00
Derek McGowan
a31e28e2c2
Prepare release notes for v1.6.0-rc.2
Signed-off-by: Derek McGowan <derek@mcg.dev>
2022-02-02 11:01:31 -08:00
Derek McGowan
8944c12f56
Update releases document
Move 1.4 EOL after 1.6 release.
Update latest 1.4 and 1.5 versions.

Signed-off-by: Derek McGowan <derek@mcg.dev>
2022-02-02 11:00:45 -08:00
Phil Estes
75d594834d
Merge pull request #6498 from dmcgowan/update-cgroups-1_0_3
Update cgroups to v1.0.3
2022-02-02 08:55:40 -05:00
Derek McGowan
d6a576ae6e
Merge pull request #6494 from AkihiroSuda/seccomp-5.16
seccomp: kernel 5.11 -> 5.16
2022-02-01 18:13:36 -08:00
Derek McGowan
05177ab5cd
Merge pull request #6243 from ktock/pusher-abort
remotes: fix dockerPusher to handle abort correctly
2022-02-01 18:07:46 -08:00
Derek McGowan
8d69c041c5
Update cgroups to v1.0.3
Pull in latest cgroups to pick up leak fixes

Signed-off-by: Derek McGowan <derek@mcg.dev>
2022-02-01 16:57:51 -08:00
Andrew G. Morgan
6906b57c72
Fix the Inheritable capability defaults.
The Linux kernel never sets the Inheritable capability flag to
anything other than empty. Non-empty values are always exclusively
set by userspace code.

[The kernel stopped defaulting this set of capability values to the
 full set in 2000 after a privilege escalation with Capabilities
 affecting Sendmail and others.]

Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
2022-02-01 13:55:46 -08:00
Sebastiaan van Stijn
bec6e4dd67
platforms.Normalize(): do not reset OSVersion and OSFeatures
Commit fb0688362c implemented the Normalize()
function, but marked these fields as deprecated.

It's unclear what the motivation was for this, as the fields are part of the OCI
Image spec. On Windows, the OSVersion field specifically is important when matching
images (as kernel versions may not be compatible).

This patch updates platforms.Normalize() to preserve the OSVersion and OSFeatures
fields.

As a follow-up, we should look at defining an appropriate string-representation
for these fields (possibly as part of the OCI Spec), and update platforms.Parse()
accordingly.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2022-02-01 17:19:28 +01:00
Akihiro Suda
34f7173491
seccomp: kernel 5.16 (futex_waitv)
Allow `futex_waitv` by default.
See https://www.phoronix.com/scan.php?page=news_item&px=FUTEX2-futex-waiv-More-Archs

Note: libseccomp does not cover kernel 5.16 at this moment:
51b50f95e1/src/syscalls.csv

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2022-02-01 09:08:06 +09:00
Akihiro Suda
8632bdcb7b
seccomp: kernel 5.15 (process_mrelease)
Allow `process_mrelease` by default.

See https://lwn.net/Articles/864184/

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2022-02-01 09:08:05 +09:00
Akihiro Suda
c013db6965
seccomp: kernel 5.14 (quotactl_fd, memfd_secret)
- Allow `quotactl_fd` when `CAP_SYS_ADMIN` is granted.
  See https://lwn.net/Articles/859679/

- Allow `memfd_secret` by default.
  See https://lwn.net/Articles/865256/

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2022-02-01 09:08:01 +09:00
Akihiro Suda
17a2831f70
seccomp: kernel 5.13 (landlock_{add_rule,create_ruleset,restrict_self})
Allow the following syscalls by default:
- `landlock_add_rule`
- `landlock_create_ruleset`
- `landlock_restrict_self`

See https://landlock.io/

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2022-02-01 09:07:33 +09:00
Akihiro Suda
1329ea3716
seccomp: kernel 5.12 (mount_setattr)
Allow `mount_setattr` when `CAP_SYS_ADMIN` is granted.

See https://man7.org/linux/man-pages/man2/mount_setattr.2.html

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2022-02-01 09:06:41 +09:00
Michael Crosby
52b8ca5545
Merge pull request #6411 from nmeum/swapcontext
seccomp: add support for "swapcontext" syscall in default policy
2022-01-31 16:11:55 -05:00
Sebastiaan van Stijn
fdbfde5d81
cmd/containerd-shim: add -v (version) flag
Unlike the other shims, containerd-shim did not have a -v (version) flag:

    ./bin/containerd-shim-runc-v1 -v
    ./bin/containerd-shim-runc-v1:
    Version: v1.6.0-rc.1
    Revision: ad771115b82a70cfd8018d72ae489c707e63de16.m
    Go version: go1.17.2

    ./bin/containerd-shim -v
    flag provided but not defined: -v
    Usage of ./bin/containerd-shim:

This patch adds a `-v` flag to be consistent with the other shims. The code was
slightly refactored to match the implementation in the other shims, taking the
same approach as 77d53d2d23/runtime/v2/shim/shim.go (L240-L256)

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2022-01-31 21:09:50 +01:00
Sebastiaan van Stijn
e79aba10d4
integration/images/volume-ownership: strip path information from usage output
POSIX guidelines describes; https://www.gnu.org/prep/standards/html_node/_002d_002dversion.html#g_t_002d_002dversion

> The program’s name should be a constant string; don’t compute it from argv[0].
> The idea is to state the standard or canonical name for the program, not its
> file name.

We don't have a const for this, but let's make a start and just remove the path info.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2022-01-31 21:07:00 +01:00
Sebastiaan van Stijn
b8cadf7539
runtime/v2/shim: strip path information from version output
I noticed that path information showed up in the version output:

    ./bin/containerd-shim-runc-v1 -v
    ./bin/containerd-shim-runc-v1:
    Version:  v1.6.0-rc.1
    Revision: ad771115b82a70cfd8018d72ae489c707e63de16.m
    Go version: go1.17.2

POSIX guidelines describes; https://www.gnu.org/prep/standards/html_node/_002d_002dversion.html#g_t_002d_002dversion

> The program’s name should be a constant string; don’t compute it from argv[0].
> The idea is to state the standard or canonical name for the program, not its
> file name.

Unfortunately, this code is used by multiple binaries, so we can't fully remove
the use of os.Args[0], but let's make a start and just remove the path info.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2022-01-31 21:01:01 +01:00
Derek McGowan
c2cb589221
Merge pull request #6478 from fuweid/enhance-no-sync-during-create
oci: use readonly mount to read user/group info
2022-01-31 10:35:51 -08:00
Michael Crosby
e178d831ef
Merge pull request #6475 from estesp/import-correct-media-type
Fix possibly incorrect media type default on import
2022-01-31 11:47:24 -05:00
Michael Crosby
82af36e59b
Merge pull request #5828 from cpuguy83/shimv2_exit_on_signals
shimv2: handle sigint/sigterm
2022-01-31 10:47:39 -05:00
Kazuyoshi Kato
cc59ae4d98 tracing: return (ctx, span) from StartSpan
OpenTelemetry's Tracer#Start() returns (ctx, span). We have no reasons
to swap them.

Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
2022-01-29 00:41:21 +00:00
Kazuyoshi Kato
e751f1f44f tracing: support OTLP/HTTP in addition to gRPC
This change adds OTLP/HTTP, specifically http/protobuf support.

http/protobuf is recommended in
https://github.com/open-telemetry/opentelemetry-specification/blob/v1.8.0/specification/protocol/exporter.md.

However kube-apiserver and CRI-O use gRPC, kubelet may support
gRPC in future. So we should support gRPC as well.

Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
2022-01-29 00:41:18 +00:00
Michael Crosby
9c676e98dd
Merge pull request #6481 from Junnplus/acr-400
Fix acr fetch token 400
2022-01-28 11:53:51 -05:00
Wei Fu
813a061fe1 oci: use readonly mount to read user/group info
In linux kernel, the umount writable-mountpoint will try to do sync-fs
to make sure that the dirty pages to the underlying filesystems. The many
number of umount actions in the same time maybe introduce performance
issue in IOPS limited disk.

When CRI-plugin creates container, it will temp-mount rootfs to read
that UID/GID info for entrypoint. Basically, the rootfs is writable
snapshotter and then after read, umount will invoke sync-fs action.

For example, using overlayfs on ext4 and use bcc-tools to monitor
ext4_sync_fs call.

```
// uname -a
Linux chaofan 5.13.0-27-generic #29~20.04.1-Ubuntu SMP Fri Jan 14 00:32:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

// open terminal 1
kubectl run --image=nginx --image-pull-policy=IfNotPresent nginx-pod

// open terminal 2
/usr/share/bcc/tools/stackcount ext4_sync_fs -i 1 -v -P

  ext4_sync_fs
  sync_filesystem
  ovl_sync_fs
  __sync_filesystem
  sync_filesystem
  generic_shutdown_super
  kill_anon_super
  deactivate_locked_super
  deactivate_super
  cleanup_mnt
  __cleanup_mnt
  task_work_run
  exit_to_user_mode_prepare
  syscall_exit_to_user_mode
  do_syscall_64
  entry_SYSCALL_64_after_hwframe
  syscall.Syscall.abi0
  github.com/containerd/containerd/mount.unmount
  github.com/containerd/containerd/mount.UnmountAll
  github.com/containerd/containerd/mount.WithTempMount.func2
  github.com/containerd/containerd/mount.WithTempMount
  github.com/containerd/containerd/oci.WithUserID.func1
  github.com/containerd/containerd/oci.WithUser.func1
  github.com/containerd/containerd/oci.ApplyOpts
  github.com/containerd/containerd.WithSpec.func1
  github.com/containerd/containerd.(*Client).NewContainer
  github.com/containerd/containerd/pkg/cri/server.(*criService).CreateContainer
  github.com/containerd/containerd/pkg/cri/server.(*instrumentedService).CreateContainer
  k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler.func1
  github.com/containerd/containerd/services/server.unaryNamespaceInterceptor
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
  github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
  go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1
  k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler
  google.golang.org/grpc.(*Server).processUnaryRPC
  google.golang.org/grpc.(*Server).handleStream
  google.golang.org/grpc.(*Server).serveStreams.func1.2
  runtime.goexit.abi0
    containerd [34771]
    1
```

If there are comming several create requestes, umount actions might
bring high IO pressure on the /var/lib/containerd's underlying disk.

After checkout the kernel code[1], the kernel will not call
__sync_filesystem if the mount is readonly. Based on this, containerd
should use readonly mount to get UID/GID information.

Reference:

* [1] https://elixir.bootlin.com/linux/v5.13/source/fs/sync.c#L61

Closes: #4604

Signed-off-by: Wei Fu <fuweid89@gmail.com>
2022-01-28 23:36:04 +08:00
Phil Estes
a43703fcba
Merge pull request #6455 from tonistiigi/amd64-variants
platforms: add support for matching amd64 variants
2022-01-27 10:07:49 -05:00
ye.sijun
c0e00f19ab fix acr fetch token 400
Signed-off-by: ye.sijun <junnplus@gmail.com>
2022-01-27 17:34:45 +08:00
Derek McGowan
3f5d789dfb
Merge pull request #6476 from gabriel-samfira/various-periodic-fixes
Fix windows periodic workflow
2022-01-25 16:43:11 -08:00
Gabriel Adrian Samfira
4cd9f37f56
Fix windows periodic workflow
This change addresses the following issues:

  * Fix fetching the public IP of the windows instance.
  * Fix generation of repolist.toml.
  * Resource cleanup is now run even if tests fail.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
2022-01-25 21:54:16 +02:00
Phil Estes
4aff7431fe
Fix possibly incorrect media type default on import
As reported, running import twice without using the compress import
option means that the content store will have existing layers during the
second import and the existing code hardcodes existing layer media type
to compressed. This fixes the issue by actually reading the header bytes
from the store and setting the media type appropriately.

Signed-off-by: Phil Estes <estesp@amazon.com>
2022-01-25 14:11:20 -05:00
Brian Goff
3ffb6a6113 shimv2: handle sigint/sigterm
This causes sigint/sigterm to trigger a shutdown of the shim.
It is needed because otherwise the v2 shim hangs system shutdown.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2022-01-25 17:57:28 +00:00
Fu Wei
2986d5b077
Merge pull request #6473 from kzys/gc-docs 2022-01-25 13:25:49 +08:00
Kazuyoshi Kato
f048a25938 docs: add doc-comments on GC-related methods
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
2022-01-24 14:26:14 -08:00
Maksym Pavlenko
731518417e
Merge pull request #6465 from fuweid/fix-issue-6429
fix: should not send 137 code event if cmd is notfound
2022-01-21 14:50:46 -08:00
Wei Fu
31a710c492 fix: should not send 137 code event if cmd is notfound
ShimV2 has shim.Delete command to cleanup task's temporary resource,
like bundle folder. Since the shim server exits and no persistent store
is for task's exit code, the result of shim.Delete is always 137 exit
code, like the task has been killed.

And the result of shim.Delete can be used as task event only when the
shim server is killed somehow after container is running. Therefore,
dockerd, which watches task exit event to update status of container,
can report correct status.

Back to the issue #6429, the container is not running because the
entrypoint is not found. Based on this design, we should not send
137 exitcode event to subscriber.

This commit is aimed to remove shim instance first and then the
`cleanupAfterDeadShim` should not send event.

Similar Issue: #4769
Fix #6429

Signed-off-by: Wei Fu <fuweid89@gmail.com>
2022-01-22 00:58:33 +08:00
Phil Estes
ab8d99cf4b
Merge pull request #6463 from Junnplus/empty-scope
Fix empty scopes return
2022-01-20 15:34:11 -05:00
Jeff Zvier
356ca75757 containerd-shim-runc-v2: return init pid when clean dead shim
If containerd-shim-runc-v2 process dead abnormally, such as received
kill 9 signal, panic or other unkown reasons, the containerd-shim-runc-v2
server can not reap runc container and forward init process exit event.
This will lead the container leaked in dockerd. When shim dead, containerd
will clean dead shim, here read init process pid and forward exit event
with pid at the same time.

Signed-off-by: Jeff Zvier <zvier20@gmail.com>
2022-01-20 17:06:55 +08:00
ye.sijun
936faf9c98 fix empty scopes return
Signed-off-by: ye.sijun <junnplus@gmail.com>
2022-01-20 15:16:44 +08:00
Derek McGowan
ad771115b8
Merge pull request #6462 from dmcgowan/prepare-1.6.0-rc.1
Prepare release notes for v1.6.0-rc.1
2022-01-19 19:13:47 -08:00