Follow up to 94faa70df4. The commit referenced allowed `ptrace` calls in the default seccomp profile following the usual tracing security checks in for Kernels newer than 4.8. Kernels prior to this version are susceptible to [CVE-2019-2054](https://github.com/advisories/GHSA-qgfr-27qf-f323). Moby's default had allowed for `ptrace` for kernels newer than 4.8 at the time the commit was created. The current [seccomp default](https://github.com/moby/moby/blob/master/profiles/seccomp/default_linux.go#L405-L417) has been updated to include `process_vm_read` and `process_vm_write`. Mirror that policy to complete the classic ptrace set of APIs.
Signed-off-by: Juan Hoyos <juan.s.hoyos@outlook.com>
Add pkey_alloc(2), pkey_free(2) and pkey_mprotect(2) in seccomp default profile.
pkey_alloc(2), pkey_free(2) and pkey_mprotect(2) can only configure
the calling process's own memory, so they are existing "safe for everyone" syscalls.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Allow the following syscalls by default:
- `landlock_add_rule`
- `landlock_create_ruleset`
- `landlock_restrict_self`
See https://landlock.io/
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
This system call is only available on 32- and 64-bit PowerPC, it is used
by modern programming language implementations to implement coroutine
features through userspace context switches.
moby [1] and systemd nspawn [2] already whitelist this system call so it
makes sense to whitelist it in containerd as well.
[1]: https://github.com/moby/moby/pull/43092
[2]: https://github.com/systemd/systemd/pull/9487
Signed-off-by: Sören Tempel <soeren+git@soeren-tempel.net>
clone3 is explicitly requested to give ENOSYS instead of the default EPERM, when CAP_SYS_ADMIN is unset.
See moby/moby PR 42681 (thanks to berrange).
Without this commit, rawhide image does not work:
```console
$ sudo ctr run --rm --net-host --seccomp registry.fedoraproject.org/fedora:rawhide foo /usr/bin/curl google.com
curl: (6) getaddrinfo() thread failed to start
```
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
These syscalls (some of which have been in Linux for a while but were
missing from the profile) fall into a few buckets:
* close_range(2), epoll_wait2(2) are just extensions of existing "safe
for everyone" syscalls.
* The mountv2 API syscalls (fs*(2), move_mount(2), open_tree(2)) are
all equivalent to aspects of mount(2) and thus go into the
CAP_SYS_ADMIN category.
* process_madvise(2) is similar to the other process_*(2) syscalls and
thus goes in the CAP_SYS_PTRACE category.
Co-authored-by: Aleksa Sarai <asarai@suse.de>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Adds the io-uring related system call introduced in kernel 5.1 to the
seccomp whitelist. With older kernels or older versions of libseccomp,
this configure will be omitted.
Note that io_uring will grow support for more syscalls in the future
so we should keep an eye on this.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
On a ppc64le host, running postgres (tried with 9.4 to 9.6) gives the following
warning when trying to flush data to disks (which happens very frequently):
WARNING: could not flush dirty data: Operation not permitted.
A quick dig in postgres source code indicate it uses sync_file_range(2) to
flush data; which on ppe64le and arm64 is translated to sync_file_range2(2)
for alignements reasons.
The profile did not allow sync_file_range2(2), making postgres sad because
it can not flush its buffers. arm_sync_file_range(2) is an ancient alias to
sync_file_range2(2), the syscall was renamed in Linux 2.6.22 when the same
syscall was added for PowerPC.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
From personality(2):
Have uname(2) report a 2.6.40+ version number rather than a 3.x version
number. Added as a stopgap measure to support broken applications that
could not handle the kernel version-numbering switch from 2.6.x to 3.x.
This allows both "UNAME26|PER_LINUX" and "UNAME26|PER_LINUX32".
Fixes: "setarch broken in docker packages from Debian stretch"
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Add the membarrier syscall to the default seccomp profile.
It is for example used in the implementation of dlopen() in
the musl libc of Alpine images.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Enabled adjtimex in the default profile without requiring CAP_SYS_TIME privilege.
The kernel will check CAP_SYS_TIME and won't allow setting the time.
Fixes: Getting the system time with ntptime returns an error in an unprivileged
container
To verify, inside a CentOS 7 container:
yum install -y ntp
ntptime
# ntp_gettime() returns code 0 (OK)
ntpdate -v time.nist.gov
# ntpdate[84]: Can't adjust the time of day: Operation not permitted
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This call is what is used to implement `dmesg` to get kernel messages
about the host. This can leak substantial information about the host.
It is normally available to unprivileged users on the host, unless
the sysctl `kernel.dmesg_restrict = 1` is set, but this is not set
by standard on the majority of distributions. Blocking this to restrict
leaks about the configuration seems correct.
Relates to docker/docker#37897 "docker exposes dmesg to containers by default"
See also https://googleprojectzero.blogspot.com/2018/09/a-cache-invalidation-bug-in-linux.html
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Restartable Sequences (rseq) are a kernel-based mechanism for fast
update operations on per-core data in user-space. Some libraries, like
the newest version of Google's TCMalloc, depend on it [1].
This also makes dockers default seccomp profile on par with systemd's,
which enabled 'rseq' in early 2019 [2].
1: https://google.github.io/tcmalloc/design.html
2: systemd/systemd@6fee3be
Signed-off-by: Florian Schmaus <flo@geekplace.eu>
Relates to https://patchwork.kernel.org/patch/10756415/
Added to whitelist:
- `clock_getres_time64` (equivalent of `clock_getres`, which was whitelisted)
- `clock_gettime64` (equivalent of `clock_gettime`, which was whitelisted)
- `clock_nanosleep_time64` (equivalent of `clock_nanosleep`, which was whitelisted)
- `futex_time64` (equivalent of `futex`, which was whitelisted)
- `io_pgetevents_time64` (equivalent of `io_pgetevents`, which was whitelisted)
- `mq_timedreceive_time64` (equivalent of `mq_timedreceive`, which was whitelisted)
- `mq_timedsend_time64 ` (equivalent of `mq_timedsend`, which was whitelisted)
- `ppoll_time64` (equivalent of `ppoll`, which was whitelisted)
- `pselect6_time64` (equivalent of `pselect6`, which was whitelisted)
- `recvmmsg_time64` (equivalent of `recvmmsg`, which was whitelisted)
- `rt_sigtimedwait_time64` (equivalent of `rt_sigtimedwait`, which was whitelisted)
- `sched_rr_get_interval_time64` (equivalent of `sched_rr_get_interval`, which was whitelisted)
- `semtimedop_time64` (equivalent of `semtimedop`, which was whitelisted)
- `timer_gettime64` (equivalent of `timer_gettime`, which was whitelisted)
- `timer_settime64` (equivalent of `timer_settime`, which was whitelisted)
- `timerfd_gettime64` (equivalent of `timerfd_gettime`, which was whitelisted)
- `timerfd_settime64` (equivalent of `timerfd_settime`, which was whitelisted)
- `utimensat_time64` (equivalent of `utimensat`, which was whitelisted)
Not added to whitelist:
- `clock_adjtime64` (equivalent of `clock_adjtime`, which was not whitelisted)
- `clock_settime64` (equivalent of `clock_settime`, which was not whitelisted)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This whitelists the statx syscall; libseccomp-2.3.3 or up
is needed for this, older seccomp versions will ignore this.
Equivalent of https://github.com/moby/moby/pull/36417
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
io_pgetevents() is a new Linux system call, similar to the already-whitelisted
io_getevents(). It has no security implications. Whitelist it so applications can
use the new system call.
Fixes#3105.
Signed-off-by: Avi Kivity <avi@scylladb.com>
No capabilities can be granted outside the bounding set, so there
is no point looking at any other set for the largest scope.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>