Note that this is the code in containerd that uses runc (as almost
a library). Please see the other commit for the update to runc binary
itself.
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
This moves the runc version to build to scripts/setup/runc-version,
which makes it easier for packagers to find the default version
to use.
The RUNC_VERSION environment variable can still be used to override
the version, which can be used (e.g.) to test against different versions
in our CI.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Now that the dependency on runc (libcontaienr) code has been reduced
considerably, it is probbaly ok to cut the version dependency between
libcontainer and the runc binary that is supported.
This patch separates the runc binary version from the version of
libcontainer that is defined in go.mod, and updates the documentation
accordingly.
The RUNC_COMMIT variable in the install-runc script is renamed to
RUNC_VERSION to encourage using tagged versions, and the Dockerfile
in contrib is updated to allow building with a custom version.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This function will be used by nerdctl for printing the default AppArmor
profile: `nerdctl system inspect apparmor-profile`
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
On newer kernels and systems, AppArmor will block sending signals in
many scenarios by default resulting in strange behaviours (container
programs cannot signal each other, or host processes like containerd
cannot signal containers).
The reason this happens only on some distributions (and is not a kernel
regression) is that the kernel doesn't enforce signal mediation unless
the profile contains signal rules. However because our profies #include
the distribution-managed <abstractions/base>, some distributions added
signal rules -- which results in AppArmor enforcing signal mediation and
thus a regression. On these systems, containers cannot send and receive
signals at all -- meaning they cannot signal each other and the
container runtime cannot kill them either.
This issue was fixed in Docker in 2018[1] but this code was copied
before then and thus the patches weren't carried. It also contains a new
fix for a more esoteric case[2]. Ideally this code should live in a
project like "containerd/apparmor" so that Docker, libpod, and
containerd can share it, but that's probably something to do separately.
In addition, the copyright header is updated to reference that the code
is copied from Docker (and thus was not written entirely by the
containerd authors).
[1]: https://github.com/docker/docker/pull/37831
[2]: https://github.com/docker/docker/pull/41337
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
These syscalls (some of which have been in Linux for a while but were
missing from the profile) fall into a few buckets:
* close_range(2), epoll_wait2(2) are just extensions of existing "safe
for everyone" syscalls.
* The mountv2 API syscalls (fs*(2), move_mount(2), open_tree(2)) are
all equivalent to aspects of mount(2) and thus go into the
CAP_SYS_ADMIN category.
* process_madvise(2) is similar to the other process_*(2) syscalls and
thus goes in the CAP_SYS_PTRACE category.
Co-authored-by: Aleksa Sarai <asarai@suse.de>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Adds the io-uring related system call introduced in kernel 5.1 to the
seccomp whitelist. With older kernels or older versions of libseccomp,
this configure will be omitted.
Note that io_uring will grow support for more syscalls in the future
so we should keep an eye on this.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
On a ppc64le host, running postgres (tried with 9.4 to 9.6) gives the following
warning when trying to flush data to disks (which happens very frequently):
WARNING: could not flush dirty data: Operation not permitted.
A quick dig in postgres source code indicate it uses sync_file_range(2) to
flush data; which on ppe64le and arm64 is translated to sync_file_range2(2)
for alignements reasons.
The profile did not allow sync_file_range2(2), making postgres sad because
it can not flush its buffers. arm_sync_file_range(2) is an ancient alias to
sync_file_range2(2), the syscall was renamed in Linux 2.6.22 when the same
syscall was added for PowerPC.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
From personality(2):
Have uname(2) report a 2.6.40+ version number rather than a 3.x version
number. Added as a stopgap measure to support broken applications that
could not handle the kernel version-numbering switch from 2.6.x to 3.x.
This allows both "UNAME26|PER_LINUX" and "UNAME26|PER_LINUX32".
Fixes: "setarch broken in docker packages from Debian stretch"
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Add the membarrier syscall to the default seccomp profile.
It is for example used in the implementation of dlopen() in
the musl libc of Alpine images.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Enabled adjtimex in the default profile without requiring CAP_SYS_TIME privilege.
The kernel will check CAP_SYS_TIME and won't allow setting the time.
Fixes: Getting the system time with ntptime returns an error in an unprivileged
container
To verify, inside a CentOS 7 container:
yum install -y ntp
ntptime
# ntp_gettime() returns code 0 (OK)
ntpdate -v time.nist.gov
# ntpdate[84]: Can't adjust the time of day: Operation not permitted
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This call is what is used to implement `dmesg` to get kernel messages
about the host. This can leak substantial information about the host.
It is normally available to unprivileged users on the host, unless
the sysctl `kernel.dmesg_restrict = 1` is set, but this is not set
by standard on the majority of distributions. Blocking this to restrict
leaks about the configuration seems correct.
Relates to docker/docker#37897 "docker exposes dmesg to containers by default"
See also https://googleprojectzero.blogspot.com/2018/09/a-cache-invalidation-bug-in-linux.html
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>