Commit Graph

181 Commits

Author SHA1 Message Date
Sebastiaan van Stijn
cb2c3ec8f8 oci: partially restore comment on read-only mounts for uid/gid uses
Commit cab056226f removed the tryReadonlyMounts
utility, in favor of mounts.ReadOnlyMounts() that was added in commit
daa3a7665e.

That change made part of the comment redundant, because mounts.ReadOnlyMounts
handles both overlayfs read-only mounts (by skipping the workdir mounts), and
sets the "ro" option for other mount-types, but the reason why we're using a
read-only mount is still relevant, so restoring that part of the comment.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2023-04-15 13:54:23 +02:00
Djordje Lukic
cab056226f oci: Use WithReadonlyTempMount when adding users/groups
Signed-off-by: Djordje Lukic <djordje.lukic@docker.com>
2023-04-05 12:09:36 +02:00
Maksym Pavlenko
87346df54f Defer uid lookups on Darwin
Signed-off-by: Maksym Pavlenko <pavlenko.maksym@gmail.com>
2023-03-27 10:24:01 -07:00
Fu Wei
5ae3a7f417 Merge pull request #8198 from kiashok/argsEscapedSupportInCri
Add ArgsEscaped support for CRI
2023-03-07 16:12:24 +08:00
Kirtana Ashok
8137e41c48 Add ArgsEscaped support for CRI
This commit adds supports for the ArgsEscaped
value for the image got from the dockerfile.
It is used to evaluate and process the image
entrypoint/cmd and container entrypoint/cmd
options got from the podspec.

Signed-off-by: Kirtana Ashok <Kirtana.Ashok@microsoft.com>
2023-03-03 13:38:06 -08:00
Akihiro Suda
3eda46af12 oci: fix additional GIDs
Test suite:
```yaml

---
apiVersion: v1
kind: Pod
metadata:
  name: test-no-option
  annotations:
    description: "Equivalent of `docker run` (no option)"
spec:
  restartPolicy: Never
  containers:
    - name: main
      image: ghcr.io/containerd/busybox:1.28
      args: ['sh', '-euxc',
             '[ "$(id)" = "uid=0(root) gid=0(root) groups=0(root),10(wheel)" ]']
---
apiVersion: v1
kind: Pod
metadata:
  name: test-group-add-1-group-add-1234
  annotations:
    description: "Equivalent of `docker run --group-add 1 --group-add 1234`"
spec:
  restartPolicy: Never
  containers:
    - name: main
      image: ghcr.io/containerd/busybox:1.28
      args: ['sh', '-euxc',
             '[ "$(id)" = "uid=0(root) gid=0(root) groups=0(root),1(daemon),10(wheel),1234" ]']
  securityContext:
    supplementalGroups: [1, 1234]
---
apiVersion: v1
kind: Pod
metadata:
  name: test-user-1234
  annotations:
    description: "Equivalent of `docker run --user 1234`"
spec:
  restartPolicy: Never
  containers:
    - name: main
      image: ghcr.io/containerd/busybox:1.28
      args: ['sh', '-euxc',
             '[ "$(id)" = "uid=1234 gid=0(root) groups=0(root)" ]']
  securityContext:
    runAsUser: 1234
---
apiVersion: v1
kind: Pod
metadata:
  name: test-user-1234-1234
  annotations:
    description: "Equivalent of `docker run --user 1234:1234`"
spec:
  restartPolicy: Never
  containers:
    - name: main
      image: ghcr.io/containerd/busybox:1.28
      args: ['sh', '-euxc',
             '[ "$(id)" = "uid=1234 gid=1234 groups=1234" ]']
  securityContext:
    runAsUser: 1234
    runAsGroup: 1234
---
apiVersion: v1
kind: Pod
metadata:
  name: test-user-1234-group-add-1234
  annotations:
    description: "Equivalent of `docker run --user 1234 --group-add 1234`"
spec:
  restartPolicy: Never
  containers:
    - name: main
      image: ghcr.io/containerd/busybox:1.28
      args: ['sh', '-euxc',
             '[ "$(id)" = "uid=1234 gid=0(root) groups=0(root),1234" ]']
  securityContext:
    runAsUser: 1234
    supplementalGroups: [1234]
```

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2023-02-10 15:53:00 +09:00
Akihiro Suda
ef2560d166 oci: fix loop iterator aliasing
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2023-02-10 15:53:00 +09:00
Maksym Pavlenko
1ade777c24 Add basic spec and mounts for Darwin
Signed-off-by: Maksym Pavlenko <pavlenko.maksym@gmail.com>
2023-01-12 17:00:40 -08:00
Maksym Pavlenko
40be96efa9 Have separate spec builder for each platform
Signed-off-by: Maksym Pavlenko <pavlenko.maksym@gmail.com>
2023-01-11 13:12:25 -08:00
Maksym Pavlenko
0ae0399b16 Make OCI spec opts available on all platforms
Signed-off-by: Maksym Pavlenko <pavlenko.maksym@gmail.com>
2023-01-11 13:03:58 -08:00
Derek McGowan
729206f6d0 Merge pull request #7874 from thaJeztah/appendOSMounts_error
oci: appendOSMounts(): remove unused error, and move
2022-12-28 20:04:06 -08:00
Fu Wei
4fe2d14e1b Merge pull request #7869 from dcantah/domainname-oci
oci: Add WithDomainname
2022-12-27 19:18:12 +08:00
Sebastiaan van Stijn
94c68aa001 oci: appendOSMounts(): remove unused error, and move
This function was added in ae22854e2b, but never
returned an error, and the error-return was not handled on the callsite. This
patch removes the unused error return, and moves it to a file related to mounts,
which allowed for some of the stubs to be removed and shared between non-FreeBSD
platforms.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2022-12-27 10:23:26 +01:00
yanggang
b10536d64f Reused errdefs define error
Signed-off-by: yanggang <gang.yang@daocloud.io>
2022-12-27 14:09:40 +08:00
Danny Canter
229779a4e5 oci: Add WithDomainname
A domainname field was recently added to the OCI spec. Prior to this
folks would need to set this with a sysctl, but now runtimes should be
able to setdomainname(2). There's an open change to runc at the moment
to add support for this so I've just left testing as a couple spec
validations in CRI until that's in and usable.

Signed-off-by: Danny Canter <danny@dcantah.dev>
2022-12-26 04:03:45 -05:00
Maksym Pavlenko
3bc8fc4d30 Cleanup build constraints
Signed-off-by: Maksym Pavlenko <pavlenko.maksym@gmail.com>
2022-12-08 09:36:20 -08:00
Kazuyoshi Kato
1e200b149a Merge pull request #7523 from ginglis13/symlink-device
add option to resolve symlinks in WithLinuxDevice
2022-11-14 16:32:46 -08:00
Kazuyoshi Kato
02484f5e05 Merge pull request #7631 from thaJeztah/strings_cut
replace strings.Split(N) for strings.Cut() or alternatives
2022-11-10 15:28:22 -08:00
Merlin Ran
99ac7a7714 add oci.WithCPURT
to set realtime scheduling options.

Signed-off-by: Merlin Ran <merlinran@gmail.com>
2022-11-08 09:39:27 -05:00
Sebastiaan van Stijn
eaedadbed0 replace strings.Split(N) for strings.Cut() or alternatives
Go 1.18 and up now provides a strings.Cut() which is better suited for
splitting key/value pairs (and similar constructs), and performs better:

```go
func BenchmarkSplit(b *testing.B) {
        b.ReportAllocs()
        data := []string{"12hello=world", "12hello=", "12=hello", "12hello"}
        for i := 0; i < b.N; i++ {
                for _, s := range data {
                        _ = strings.SplitN(s, "=", 2)[0]
                }
        }
}

func BenchmarkCut(b *testing.B) {
        b.ReportAllocs()
        data := []string{"12hello=world", "12hello=", "12=hello", "12hello"}
        for i := 0; i < b.N; i++ {
                for _, s := range data {
                        _, _, _ = strings.Cut(s, "=")
                }
        }
}
```

    BenchmarkSplit
    BenchmarkSplit-10            8244206               128.0 ns/op           128 B/op          4 allocs/op
    BenchmarkCut
    BenchmarkCut-10             54411998                21.80 ns/op            0 B/op          0 allocs/op

While looking at occurrences of `strings.Split()`, I also updated some for alternatives,
or added some constraints; for cases where an specific number of items is expected, I used `strings.SplitN()`
with a suitable limit. This prevents (theoretical) unlimited splits.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2022-11-07 10:02:25 +01:00
Gavin Inglis
81bbd9daca add option to resolve symlinks to linux device
This change modifies WithLinuxDevice to take an option `followSymlink`
and be unexported as `withLinuxDevice`. An option
`WithLinuxDeviceFollowSymlinks` will call this unexported option to
follow a symlink, which will resolve a symlink before calling
`DeviceFromPath`. `WithLinuxDevice` has been changed to call
`withLinuxDevice` without following symlinks.

Signed-off-by: Gavin Inglis <giinglis@amazon.com>
2022-11-03 20:23:25 +00:00
Fu Wei
9b54eee718 Merge pull request #7419 from bart0sh/PR005-configure-CDI-registry-on-start 2022-10-22 08:17:33 +08:00
Sebastiaan van Stijn
8b5df7d347 update golangci-lint to v1.49.0
Also remove "nolint" comments for deadcode, which is deprecated, and removed
from the defaults.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2022-10-12 14:41:01 +02:00
Sebastiaan van Stijn
f9c80be1bb remove unneeded nolint-comments (nolintlint), disable deprecated linters
Remove nolint-comments that weren't hit by linters, and remove the "structcheck"
and "varcheck" linters, as they have been deprecated:

    WARN [runner] The linter 'structcheck' is deprecated (since v1.49.0) due to: The owner seems to have abandoned the linter.  Replaced by unused.
    WARN [runner] The linter 'varcheck' is deprecated (since v1.49.0) due to: The owner seems to have abandoned the linter.  Replaced by unused.
    WARN [linters context] structcheck is disabled because of generics. You can track the evolution of the generics support by following the https://github.com/golangci/golangci-lint/issues/2649.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2022-10-12 14:41:01 +02:00
Sebastiaan van Stijn
29c7fc9520 clean-up "nolint" comments, remove unused ones
- fix "nolint" comments to be in the correct format (`//nolint:<linters>[,<linter>`
  no leading space, required colon (`:`) and linters.
- remove "nolint" comments for errcheck, which is disabled in our config.
- remove "nolint" comments that were no longer needed (nolintlint).
- where known, add a comment describing why a "nolint" was applied.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2022-10-12 14:40:59 +02:00
Ed Bartosh
eec7a76ecd move WithCDI to pkg/cri/opts
As WithCDI is CRI-only API it makes sense to move it
out of oci module.

This move can also fix possible issues with this API when
CRI plugin is disabled.

Signed-off-by: Ed Bartosh <eduard.bartosh@intel.com>
2022-10-12 13:45:20 +03:00
Justin Terry
d4b9dade13 Updates oci image config to support upstream ArgsEscaped
ArgsEscaped has now been merged into upstream OCI image spec.
This change removes the workaround we were doing in containerd
to deserialize the extra json outside of the spec and instead
just uses the formal spec types.

Signed-off-by: Justin Terry <jlterry@amazon.com>
2022-10-11 13:29:56 -07:00
Abirdcfly
dcfaa30ba2 chore: remove duplicate word in comments
Signed-off-by: Abirdcfly Fu <fp544037857@gmail.com>
2022-08-29 13:05:32 +08:00
Maksym Pavlenko
23f66ece59 Merge pull request #7254 from mxpv/go
Switch to Go 1.19
2022-08-10 12:12:49 -07:00
Ye Sijun
2dbff1dbca oci: skip checking gid for WithAppendAdditionalGroups
Signed-off-by: Ye Sijun <junnplus@gmail.com>
2022-08-06 00:57:19 +08:00
Maksym Pavlenko
ca3b9b50fe Run gofmt 1.19
Signed-off-by: Maksym Pavlenko <pavlenko.maksym@gmail.com>
2022-08-04 18:18:33 -07:00
Ye Sijun
1ab42be15d refactor: reduce duplicate code
Signed-off-by: Ye Sijun <junnplus@gmail.com>
2022-06-25 18:55:03 +08:00
Fu Wei
a860e6680e Merge pull request #7072 from junnplus/test-specopts 2022-06-22 15:12:00 +08:00
Ye Sijun
72b87ad004 add WithAdditionalGIDs test
Signed-off-by: Ye Sijun <junnplus@gmail.com>
2022-06-21 23:58:19 +08:00
Ye Sijun
5bf705255d add WithAppendAdditionalGroups helper
Signed-off-by: Ye Sijun <junnplus@gmail.com>
2022-06-21 23:21:04 +08:00
Samuel Karp
ad8e598060 oci: Remove empty mount option slice for FreeBSD
Mount options are marked `json:omitempty`. An empty slice in the default
object caused TestWithSpecFromFile to fail.

Signed-off-by: Samuel Karp <me@samuelkarp.com>
2022-06-09 18:54:10 -07:00
Samuel Karp
c15f0cdaf0 oci: FreeBSD devices may have major number 0
Signed-off-by: Samuel Karp <me@samuelkarp.com>
2022-06-09 18:54:09 -07:00
Gijs Peskens
ae22854e2b Linux containers on FreeBSD
This allows running Linux containers on FreeBSD and modifies the
mounts so that they represent the linux emulated filesystems, as per:
https://wiki.freebsd.org/LinuxJails

Co-authored-by: Gijs Peskens <gijs@peskens.net>, Samuel Karp <samuelkarp@users.noreply.github.com>
Signed-off-by: Artem Khramov <akhramov@pm.me>
2022-06-01 00:56:24 +02:00
Sebastiaan van Stijn
a3ac156007 oci: WithDefaultUnixDevices(): remove tun/tap from the default devices
A container should not have access to tun/tap device, unless it is explicitly
specified in configuration.

This device was already removed from docker's default, and runc's default;

- 2ce40b6ad7
- 9c4570a958

Per the commit message in runc, this should also fix these messages;

> Apr 26 03:46:56 foo.bar systemd[1]: Couldn't stat device /dev/char/10:200: No such file or directory

coming from systemd on every container start, when the systemd cgroup driver
is used, and the system runs an old (< v240) version of systemd
(the message was presumably eliminated by [1]).

[1]: d5aecba6e0

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2022-05-11 00:31:59 +02:00
Phil Estes
65627f568f Merge pull request #6809 from jterry75/main
Add ctr support for CPUMax and CPUShares
2022-04-29 16:39:52 +01:00
Mike Brown
6b35307594 Merge pull request #5490 from askervin/5Bu_blockio
Support for cgroups blockio
2022-04-29 10:07:56 -05:00
Antti Kervinen
10576c298e cri: support blockio class in pod and container annotations
This patch adds support for a container annotation and two separate
pod annotations for controlling the blockio class of containers.

The container annotation can be used by a CRI client:
  "io.kubernetes.cri.blockio-class"

Pod annotations specify the blockio class in the K8s pod spec level:
  "blockio.resources.beta.kubernetes.io/pod"
  (pod-wide default for all containers within)

  "blockio.resources.beta.kubernetes.io/container.<container_name>"
  (container-specific overrides)

Correspondingly, this patch adds support for --blockio-class and
--blockio-config-file to ctr, too.

This implementation follows the resource class annotation pattern
introduced in RDT and merged in commit 893701220.

Signed-off-by: Antti Kervinen <antti.kervinen@intel.com>
2022-04-29 11:44:09 +03:00
Justin Terry
227156dac6 Add ctr support for CPUMax and CPUShares
Adds CPU.Maximum and CPU.Shares support to the ctr
cmdline for testing

Signed-off-by: Justin Terry <jlterry@amazon.com>
2022-04-28 13:17:16 -07:00
Ed Bartosh
ff5c55847a move CDI calls to the linux-only code
Signed-off-by: Ed Bartosh <eduard.bartosh@intel.com>
2022-04-06 13:10:59 +03:00
Derek McGowan
551516a18d Merge pull request from GHSA-c9cp-9c75-9v8c
Fix the Inheritable capability defaults.
2022-03-23 10:50:56 -07:00
Henry Wang
b8bf504e94 Enable gosec linter for golangci-lint
`gosec` linter is able to identify issues described in #6584

e.g.

$ git revert 54e95e6b88
[gosec dfc8ca1ec] Revert "fix Implicit memory aliasing in for loop"
 2 files changed, 2 deletions(-)

$ make check
+ proto-fmt
+ check
GOGC=75 golangci-lint run
containerstore.go:192:54: G601: Implicit memory aliasing in for loop. (gosec)
		containers = append(containers, containerFromProto(&container))
		                                                   ^
image_store.go:132:42: G601: Implicit memory aliasing in for loop. (gosec)
		images = append(images, imageFromProto(&image))
		                                       ^
make: *** [check] Error 1

I also disabled following two settings which prevent the linter to show a complete list of issues.

* max-issues-per-linter (default 50)
* max-same-issues (default 3)

Furthermore enabling gosec revealed many other issues. For now I blacklisted the ones except G601.

Will create separate tasks to address them one by one moving next.

Signed-off-by: Henry Wang <henwang@amazon.com>
2022-03-14 22:50:54 +00:00
Paul "TBBle" Hampson
39d52118f5 Plumb CRI Devices through to OCI WindowsDevices
There's two mappings of hostpath to IDType and ID in the wild:
- dockershim and dockerd-cri (implicitly via docker) use class/ID
-- The only supported IDType in Docker is 'class'.
-- https://github.com/aarnaud/k8s-directx-device-plugin generates this form
- https://github.com/jterry75/cri (windows_port branch) uses IDType://ID
-- hcsshim's CRI test suite generates this form

`://` is much more easily distinguishable, so I've gone with that one as
the generic separator, with `class/` as a special-case.

Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
2022-03-12 08:16:43 +11:00
Justin Terry
de3d9993f5 Adds support for Windows ArgsEscaped images
Adds support for Windows container images built by Docker
that contain the ArgsEscaped boolean in the ImageConfig. This
is a non-OCI entry that tells the runtime that the Entrypoint
and/or Cmd are a single element array with the args pre-escaped
into a single CommandLine that should be passed directly to
Windows rather than passed as an args array which will be
additionally escaped.

Signed-off-by: Justin Terry <jlterry@amazon.com>
2022-03-01 13:40:44 -08:00
Andrew G. Morgan
6906b57c72 Fix the Inheritable capability defaults.
The Linux kernel never sets the Inheritable capability flag to
anything other than empty. Non-empty values are always exclusively
set by userspace code.

[The kernel stopped defaulting this set of capability values to the
 full set in 2000 after a privilege escalation with Capabilities
 affecting Sendmail and others.]

Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
2022-02-01 13:55:46 -08:00
Wei Fu
813a061fe1 oci: use readonly mount to read user/group info
In linux kernel, the umount writable-mountpoint will try to do sync-fs
to make sure that the dirty pages to the underlying filesystems. The many
number of umount actions in the same time maybe introduce performance
issue in IOPS limited disk.

When CRI-plugin creates container, it will temp-mount rootfs to read
that UID/GID info for entrypoint. Basically, the rootfs is writable
snapshotter and then after read, umount will invoke sync-fs action.

For example, using overlayfs on ext4 and use bcc-tools to monitor
ext4_sync_fs call.

```
// uname -a
Linux chaofan 5.13.0-27-generic #29~20.04.1-Ubuntu SMP Fri Jan 14 00:32:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

// open terminal 1
kubectl run --image=nginx --image-pull-policy=IfNotPresent nginx-pod

// open terminal 2
/usr/share/bcc/tools/stackcount ext4_sync_fs -i 1 -v -P

  ext4_sync_fs
  sync_filesystem
  ovl_sync_fs
  __sync_filesystem
  sync_filesystem
  generic_shutdown_super
  kill_anon_super
  deactivate_locked_super
  deactivate_super
  cleanup_mnt
  __cleanup_mnt
  task_work_run
  exit_to_user_mode_prepare
  syscall_exit_to_user_mode
  do_syscall_64
  entry_SYSCALL_64_after_hwframe
  syscall.Syscall.abi0
  github.com/containerd/containerd/mount.unmount
  github.com/containerd/containerd/mount.UnmountAll
  github.com/containerd/containerd/mount.WithTempMount.func2
  github.com/containerd/containerd/mount.WithTempMount
  github.com/containerd/containerd/oci.WithUserID.func1
  github.com/containerd/containerd/oci.WithUser.func1
  github.com/containerd/containerd/oci.ApplyOpts
  github.com/containerd/containerd.WithSpec.func1
  github.com/containerd/containerd.(*Client).NewContainer
  github.com/containerd/containerd/pkg/cri/server.(*criService).CreateContainer
  github.com/containerd/containerd/pkg/cri/server.(*instrumentedService).CreateContainer
  k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler.func1
  github.com/containerd/containerd/services/server.unaryNamespaceInterceptor
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
  github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
  go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1
  k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler
  google.golang.org/grpc.(*Server).processUnaryRPC
  google.golang.org/grpc.(*Server).handleStream
  google.golang.org/grpc.(*Server).serveStreams.func1.2
  runtime.goexit.abi0
    containerd [34771]
    1
```

If there are comming several create requestes, umount actions might
bring high IO pressure on the /var/lib/containerd's underlying disk.

After checkout the kernel code[1], the kernel will not call
__sync_filesystem if the mount is readonly. Based on this, containerd
should use readonly mount to get UID/GID information.

Reference:

* [1] https://elixir.bootlin.com/linux/v5.13/source/fs/sync.c#L61

Closes: #4604

Signed-off-by: Wei Fu <fuweid89@gmail.com>
2022-01-28 23:36:04 +08:00