With the introduction of Windows Server 2022, some images have been updated
to support WS2022 in their manifest list. This commit updates the test images
accordingly.
Signed-off-by: Adelina Tuvenie <atuvenie@cloudbasesolutions.com>
Go 1.15.7 contained a security fix for CVE-2021-3115, which allowed arbitrary
code to be executed at build time when using cgo on Windows. This issue also
affects Unix users who have “.” listed explicitly in their PATH and are running
“go get” outside of a module or with module mode disabled.
This issue is not limited to the go command itself, and can also affect binaries
that use `os.Command`, `os.LookPath`, etc.
From the related blogpost (ttps://blog.golang.org/path-security):
> Are your own programs affected?
>
> If you use exec.LookPath or exec.Command in your own programs, you only need to
> be concerned if you (or your users) run your program in a directory with untrusted
> contents. If so, then a subprocess could be started using an executable from dot
> instead of from a system directory. (Again, using an executable from dot happens
> always on Windows and only with uncommon PATH settings on Unix.)
>
> If you are concerned, then we’ve published the more restricted variant of os/exec
> as golang.org/x/sys/execabs. You can use it in your program by simply replacing
This patch replaces all uses of `os/exec` with `golang.org/x/sys/execabs`. While
some uses of `os/exec` should not be problematic (e.g. part of tests), it is
probably good to be consistent, in case code gets moved around.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
go1.16.7 (released 2021-08-05) includes a security fix to the net/http/httputil
package, as well as bug fixes to the compiler, the linker, the runtime, the go
command, and the net/http package. See the Go 1.16.7 milestone on the issue
tracker for details:
https://github.com/golang/go/issues?q=milestone%3AGo1.16.7+label%3ACherryPickApproved
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Looking at how this image is used, I think we don't even need the
source in the final image, so we can build containerd in a separate
stage, and copy the binaries.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Not critical for intermediate stages, but a minor optimization to
reduce the image cache. Ideally, this would use cache-mounts for this,
but those may not be supported by podman, so taking the traditional
approach.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Building critools only requires the install script and the critools-version
file (to determin the version to build). Moving it to a separate stage
prevents rebuilding it if unrelated changes are made in the code.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Building cni only requires the install script, and the go.mod (to determin
the version to install). Moving it to a separate stage prevents it from
being rebuilt if unrelated changes were made in the codebase.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This allows for easier copying artifacts from stages, by just copying
the directory content to the stage where it's used. These stages are
not used to be run individually so do not have to be "runnable".
Each stage is "responsible" for colllecting all aftifacts in the directory,
so that "consumer" stages do not have to be aware of what needs to be copied.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This makes the following changes:
- The containerd/config.toml, and docker-entrypoint.sh only occasionally change,
so copy them before copying the source code to allow them to be cached.
- The cri-in-userns stage does not need files from proto3, so do not copy them
- The dev environment does need the file from the proto3 stage, so copy them there.
- Change the order of stages. Our CI uses `podman build` which (I think) does not
skips stages that are not used for the specified target (like BuildKit does).
So I moved stages that are not used for the `cri-in-userns` after that stage.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The `cri-in-userns` stage is for testing "CRI-in-UserNS", which should be used in conjunction with "Kubelet-in-UserNS":
https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2033-kubelet-in-userns-aka-rootless
This feature is mostly expected to be used for `kind` and `minikube`.
Requires Rootless Docker/Podman/nerdctl with cgroup v2 delegation: https://rootlesscontaine.rs/getting-started/common/cgroup2/
(Rootless Docker/Podman/nerdctl prepares the UserNS, so we do not need to create UserNS by ourselves)
Usage:
```
podman build --target cri-in-userns -t cri-in-userns -f contrib/Dockerfile.test .
podman run -it --rm --privileged cri-in-userns
```
The stage is tested on CI with Rootless Podman on Fedora 34 on Vagrant.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Note that this is the code in containerd that uses runc (as almost
a library). Please see the other commit for the update to runc binary
itself.
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
This moves the runc version to build to scripts/setup/runc-version,
which makes it easier for packagers to find the default version
to use.
The RUNC_VERSION environment variable can still be used to override
the version, which can be used (e.g.) to test against different versions
in our CI.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Now that the dependency on runc (libcontaienr) code has been reduced
considerably, it is probbaly ok to cut the version dependency between
libcontainer and the runc binary that is supported.
This patch separates the runc binary version from the version of
libcontainer that is defined in go.mod, and updates the documentation
accordingly.
The RUNC_COMMIT variable in the install-runc script is renamed to
RUNC_VERSION to encourage using tagged versions, and the Dockerfile
in contrib is updated to allow building with a custom version.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This function will be used by nerdctl for printing the default AppArmor
profile: `nerdctl system inspect apparmor-profile`
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
On newer kernels and systems, AppArmor will block sending signals in
many scenarios by default resulting in strange behaviours (container
programs cannot signal each other, or host processes like containerd
cannot signal containers).
The reason this happens only on some distributions (and is not a kernel
regression) is that the kernel doesn't enforce signal mediation unless
the profile contains signal rules. However because our profies #include
the distribution-managed <abstractions/base>, some distributions added
signal rules -- which results in AppArmor enforcing signal mediation and
thus a regression. On these systems, containers cannot send and receive
signals at all -- meaning they cannot signal each other and the
container runtime cannot kill them either.
This issue was fixed in Docker in 2018[1] but this code was copied
before then and thus the patches weren't carried. It also contains a new
fix for a more esoteric case[2]. Ideally this code should live in a
project like "containerd/apparmor" so that Docker, libpod, and
containerd can share it, but that's probably something to do separately.
In addition, the copyright header is updated to reference that the code
is copied from Docker (and thus was not written entirely by the
containerd authors).
[1]: https://github.com/docker/docker/pull/37831
[2]: https://github.com/docker/docker/pull/41337
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
These syscalls (some of which have been in Linux for a while but were
missing from the profile) fall into a few buckets:
* close_range(2), epoll_wait2(2) are just extensions of existing "safe
for everyone" syscalls.
* The mountv2 API syscalls (fs*(2), move_mount(2), open_tree(2)) are
all equivalent to aspects of mount(2) and thus go into the
CAP_SYS_ADMIN category.
* process_madvise(2) is similar to the other process_*(2) syscalls and
thus goes in the CAP_SYS_PTRACE category.
Co-authored-by: Aleksa Sarai <asarai@suse.de>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Adds the io-uring related system call introduced in kernel 5.1 to the
seccomp whitelist. With older kernels or older versions of libseccomp,
this configure will be omitted.
Note that io_uring will grow support for more syscalls in the future
so we should keep an eye on this.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
On a ppc64le host, running postgres (tried with 9.4 to 9.6) gives the following
warning when trying to flush data to disks (which happens very frequently):
WARNING: could not flush dirty data: Operation not permitted.
A quick dig in postgres source code indicate it uses sync_file_range(2) to
flush data; which on ppe64le and arm64 is translated to sync_file_range2(2)
for alignements reasons.
The profile did not allow sync_file_range2(2), making postgres sad because
it can not flush its buffers. arm_sync_file_range(2) is an ancient alias to
sync_file_range2(2), the syscall was renamed in Linux 2.6.22 when the same
syscall was added for PowerPC.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
From personality(2):
Have uname(2) report a 2.6.40+ version number rather than a 3.x version
number. Added as a stopgap measure to support broken applications that
could not handle the kernel version-numbering switch from 2.6.x to 3.x.
This allows both "UNAME26|PER_LINUX" and "UNAME26|PER_LINUX32".
Fixes: "setarch broken in docker packages from Debian stretch"
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Add the membarrier syscall to the default seccomp profile.
It is for example used in the implementation of dlopen() in
the musl libc of Alpine images.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Enabled adjtimex in the default profile without requiring CAP_SYS_TIME privilege.
The kernel will check CAP_SYS_TIME and won't allow setting the time.
Fixes: Getting the system time with ntptime returns an error in an unprivileged
container
To verify, inside a CentOS 7 container:
yum install -y ntp
ntptime
# ntp_gettime() returns code 0 (OK)
ntpdate -v time.nist.gov
# ntpdate[84]: Can't adjust the time of day: Operation not permitted
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This call is what is used to implement `dmesg` to get kernel messages
about the host. This can leak substantial information about the host.
It is normally available to unprivileged users on the host, unless
the sysctl `kernel.dmesg_restrict = 1` is set, but this is not set
by standard on the majority of distributions. Blocking this to restrict
leaks about the configuration seems correct.
Relates to docker/docker#37897 "docker exposes dmesg to containers by default"
See also https://googleprojectzero.blogspot.com/2018/09/a-cache-invalidation-bug-in-linux.html
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Expose environment variables in the GCE containerd configuration
script for configuring an additional runtime handler. This unblocks
E2E testing of custom runtime handlers.
Signed-off-by: Tim Allclair <tallclair@google.com>
Format the cni.template, use `space` instead of some `tab`. Avoid indent issue in text editor.
Signed-off-by: bingshen.wbs <bingshen.wbs@alibaba-inc.com>
Restartable Sequences (rseq) are a kernel-based mechanism for fast
update operations on per-core data in user-space. Some libraries, like
the newest version of Google's TCMalloc, depend on it [1].
This also makes dockers default seccomp profile on par with systemd's,
which enabled 'rseq' in early 2019 [2].
1: https://google.github.io/tcmalloc/design.html
2: systemd/systemd@6fee3be
Signed-off-by: Florian Schmaus <flo@geekplace.eu>