Commit Graph

33 Commits

Author SHA1 Message Date
Sören Tempel
adee2c7974 seccomp: add support for "swapcontext" syscall in default policy
This system call is only available on 32- and 64-bit PowerPC, it is used
by modern programming language implementations to implement coroutine
features through userspace context switches.

moby [1] and systemd nspawn [2] already whitelist this system call so it
makes sense to whitelist it in containerd as well.

[1]: https://github.com/moby/moby/pull/43092
[2]: https://github.com/systemd/systemd/pull/9487

Signed-off-by: Sören Tempel <soeren+git@soeren-tempel.net>
2022-01-07 18:05:59 +01:00
Akihiro Suda
55923daa9f
seccomp: support "clone3" (return ENOSYS unless SYS_ADMIN is granted)
clone3 is explicitly requested to give ENOSYS instead of the default EPERM, when CAP_SYS_ADMIN is unset.
See moby/moby PR 42681 (thanks to berrange).

Without this commit, rawhide image does not work:
```console
$ sudo ctr run --rm --net-host --seccomp registry.fedoraproject.org/fedora:rawhide foo /usr/bin/curl google.com
curl: (6) getaddrinfo() thread failed to start
```

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2021-09-15 14:44:45 +09:00
Akihiro Suda
d3aa7ee9f0
Run go fmt with Go 1.17
The new `go fmt` adds `//go:build` lines (https://golang.org/doc/go1.17#tools).

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2021-08-22 09:31:50 +09:00
Sebastiaan van Stijn
e1445dff12
profiles: seccomp: update to Linux 5.11 syscall list
These syscalls (some of which have been in Linux for a while but were
missing from the profile) fall into a few buckets:

 * close_range(2), epoll_wait2(2) are just extensions of existing "safe
   for everyone" syscalls.

 * The mountv2 API syscalls (fs*(2), move_mount(2), open_tree(2)) are
   all equivalent to aspects of mount(2) and thus go into the
   CAP_SYS_ADMIN category.

 * process_madvise(2) is similar to the other process_*(2) syscalls and
   thus goes in the CAP_SYS_PTRACE category.

Co-authored-by: Aleksa Sarai <asarai@suse.de>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-01-21 21:30:25 +01:00
Sebastiaan van Stijn
0a1104bcf3
seccomp: add pidfd_getfd syscall (gated by CAP_SYS_PTRACE)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-11-13 13:36:33 +01:00
Sebastiaan van Stijn
2dbbd10fd6
seccomp: add pidfd_open and pidfd_send_signal
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-11-13 13:36:25 +01:00
Michael Crosby
f9d231f660
Merge pull request #4493 from thaJeztah/seccomp_uring
seccomp: allow io-uring related system calls
2020-08-25 11:39:45 -04:00
Michael Crosby
396b863138
Merge pull request #4491 from thaJeztah/seccomp_syslog
seccomp: move the syslog syscall to be gated by CAP_SYS_ADMIN or CAP_SYSLOG
2020-08-25 11:35:28 -04:00
Sebastiaan van Stijn
325bac7c71
seccomp: allow io-uring related system calls
Adds the io-uring related system call introduced in kernel 5.1 to the
seccomp whitelist. With older kernels or older versions of libseccomp,
this configure will be omitted.

Note that io_uring will grow support for more syscalls in the future
so we should keep an eye on this.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-08-24 12:59:53 +02:00
Sebastiaan van Stijn
0a5ee7e6f3
seccomp: allow clock_settime when CAP_SYS_TIME is added
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-08-24 12:43:21 +02:00
Sebastiaan van Stijn
5cdb6e81d2
seccomp: allow quotactl with CAP_SYS_ADMIN
This allows the quotactl syscall in the default seccomp profile, gated by
CAP_SYS_ADMIN.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-08-24 12:40:43 +02:00
Sebastiaan van Stijn
5862285fac
seccomp: allow sync_file_range2 on supported architectures.
On a ppc64le host, running postgres (tried with 9.4 to 9.6) gives the following
warning when trying to flush data to disks (which happens very frequently):

     WARNING: could not flush dirty data: Operation not permitted.

A quick dig in postgres source code indicate it uses sync_file_range(2) to
flush data; which on ppe64le and arm64 is translated to sync_file_range2(2)
for alignements reasons.

The profile did not allow sync_file_range2(2), making postgres sad because
it can not flush its buffers. arm_sync_file_range(2) is an ancient alias to
sync_file_range2(2), the syscall was renamed in Linux 2.6.22 when the same
syscall was added for PowerPC.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-08-24 12:36:53 +02:00
Sebastiaan van Stijn
117d678749
seccomp: allow personality with UNAME26 bit set
From personality(2):

    Have uname(2) report a 2.6.40+ version number rather than a 3.x version
    number.  Added as a stopgap measure to support broken applications that
    could not handle the  kernel  version-numbering  switch  from 2.6.x to 3.x.

This allows both "UNAME26|PER_LINUX" and "UNAME26|PER_LINUX32".

Fixes: "setarch broken in docker packages from Debian stretch"

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-08-24 12:27:14 +02:00
Sebastiaan van Stijn
fc9e5d161a
seccomp: allow syscall membarrier
Add the membarrier syscall to the default seccomp profile.
It is for example used in the implementation of dlopen() in
the musl libc of Alpine images.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-08-24 12:16:26 +02:00
Sebastiaan van Stijn
1746a195e9
seccomp: allow adjtimex get time operation
Enabled adjtimex in the default profile without requiring CAP_SYS_TIME privilege.
The kernel will check CAP_SYS_TIME and won't allow setting the time.

Fixes: Getting the system time with ntptime returns an error in an unprivileged
container

To verify, inside a CentOS 7 container:

    yum install -y ntp
    ntptime
    # ntp_gettime() returns code 0 (OK)

    ntpdate -v time.nist.gov
    # ntpdate[84]: Can't adjust the time of day: Operation not permitted

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-08-24 12:16:23 +02:00
Sebastiaan van Stijn
7e7545e556
seccomp: allow add preadv2 and pwritev2 syscalls
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-08-24 12:16:21 +02:00
Sebastiaan van Stijn
267a0cf68e
seccomp: move the syslog syscall to be gated by CAP_SYS_ADMIN or CAP_SYSLOG
This call is what is used to implement `dmesg` to get kernel messages
about the host. This can leak substantial information about the host.
It is normally available to unprivileged users on the host, unless
the sysctl `kernel.dmesg_restrict = 1` is set, but this is not set
by standard on the majority of distributions. Blocking this to restrict
leaks about the configuration seems correct.

Relates to docker/docker#37897 "docker exposes dmesg to containers by default"

See also https://googleprojectzero.blogspot.com/2018/09/a-cache-invalidation-bug-in-linux.html

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-08-24 11:57:48 +02:00
Jintao Zhang
6a915a1453 seccomp: add faccessat2 syscall.
related to https://patchwork.kernel.org/patch/11545287/

Signed-off-by: Jintao Zhang <zhangjintao9020@gmail.com>
2020-08-17 21:48:21 +08:00
Jintao Zhang
e28e55f455 seccomp: add openat2 syscall.
related to https://patchwork.kernel.org/patch/11167585/

Signed-off-by: Jintao Zhang <zhangjintao9020@gmail.com>
2020-08-16 16:28:21 +08:00
Florian Schmaus
e977564a8b seccomp: allow 'rseq' syscall in default seccomp profile
Restartable Sequences (rseq) are a kernel-based mechanism for fast
update operations on per-core data in user-space. Some libraries, like
the newest version of Google's TCMalloc, depend on it [1].

This also makes dockers default seccomp profile on par with systemd's,
which enabled 'rseq' in early 2019 [2].

1: https://google.github.io/tcmalloc/design.html
2: systemd/systemd@6fee3be

Signed-off-by: Florian Schmaus <flo@geekplace.eu>
2020-06-26 17:10:05 +02:00
Michael Crosby
0f831093ce Update usage of whitelist in project
Signed-off-by: Michael Crosby <michael@thepasture.io>
2020-06-08 12:49:22 -05:00
Kenta Tada
03755821d2 seccomp: remove the unused query_module(2)
query_module(2) is only in kernels before Linux 2.6.

Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>
2020-05-19 10:36:55 +09:00
Stanislav Levin
5765991f2c
seccomp: Whitelist clock_adjtime
This only allows making the syscall. CAP_SYS_TIME is still required
for time adjustment (enforced by the kernel):

```
kernel/time/posix-timers.c:

1112 SYSCALL_DEFINE2(clock_adjtime, const clockid_t, which_clock,
1113                 struct __kernel_timex __user *, utx)
...
1121         err = do_clock_adjtime(which_clock, &ktx);

1100 int do_clock_adjtime(const clockid_t which_clock, struct __kernel_timex * ktx)
1101 {
...
1109         return kc->clock_adj(which_clock, ktx);

1299 static const struct k_clock clock_realtime = {
...
1304         .clock_adj              = posix_clock_realtime_adj,

188 static int posix_clock_realtime_adj(const clockid_t which_clock,
189                                     struct __kernel_timex *t)
190 {
191         return do_adjtimex(t);

kernel/time/timekeeping.c:

2312 int do_adjtimex(struct __kernel_timex *txc)
2313 {
...
2321         /* Validate the data before disabling interrupts */
2322         ret = timekeeping_validate_timex(txc);

2246 static int timekeeping_validate_timex(const struct __kernel_timex *txc)
2247 {
2248         if (txc->modes & ADJ_ADJTIME) {
...
2252                 if (!(txc->modes & ADJ_OFFSET_READONLY) &&
2253                     !capable(CAP_SYS_TIME))
2254                         return -EPERM;
2255         } else {
2256                 /* In order to modify anything, you gotta be super-user! */
2257                 if (txc->modes && !capable(CAP_SYS_TIME))
2258                         return -EPERM;

```

Fixes: moby/moby 40919
Signed-off-by: Stanislav Levin <slev@altlinux.org>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-05-17 23:11:04 +02:00
Sebastiaan van Stijn
9529c69b8a
seccomp: add 64-bit time_t syscalls
Relates to https://patchwork.kernel.org/patch/10756415/

Added to whitelist:

- `clock_getres_time64` (equivalent of `clock_getres`, which was whitelisted)
- `clock_gettime64` (equivalent of `clock_gettime`, which was whitelisted)
- `clock_nanosleep_time64` (equivalent of `clock_nanosleep`, which was whitelisted)
- `futex_time64` (equivalent of `futex`, which was whitelisted)
- `io_pgetevents_time64` (equivalent of `io_pgetevents`, which was whitelisted)
- `mq_timedreceive_time64` (equivalent of `mq_timedreceive`, which was whitelisted)
- `mq_timedsend_time64 ` (equivalent of `mq_timedsend`, which was whitelisted)
- `ppoll_time64` (equivalent of `ppoll`, which was whitelisted)
- `pselect6_time64` (equivalent of `pselect6`, which was whitelisted)
- `recvmmsg_time64` (equivalent of `recvmmsg`, which was whitelisted)
- `rt_sigtimedwait_time64` (equivalent of `rt_sigtimedwait`, which was whitelisted)
- `sched_rr_get_interval_time64` (equivalent of `sched_rr_get_interval`, which was whitelisted)
- `semtimedop_time64` (equivalent of `semtimedop`, which was whitelisted)
- `timer_gettime64` (equivalent of `timer_gettime`, which was whitelisted)
- `timer_settime64` (equivalent of `timer_settime`, which was whitelisted)
- `timerfd_gettime64` (equivalent of `timerfd_gettime`, which was whitelisted)
- `timerfd_settime64` (equivalent of `timerfd_settime`, which was whitelisted)
- `utimensat_time64` (equivalent of `utimensat`, which was whitelisted)

Not added to whitelist:

- `clock_adjtime64` (equivalent of `clock_adjtime`, which was not whitelisted)
- `clock_settime64` (equivalent of `clock_settime`, which was not whitelisted)

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-03-25 14:07:38 +01:00
Michael Crosby
86f8be86e1 Add sigprocmask to default profile
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2019-08-29 11:07:03 -04:00
Kenta Tada
5b9a43d2e7 Fix seccomp contributed profile for clone syscall
All clone flags for namespace should be denied.
Also x/sys should be used instead of syscall.

Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>
2019-06-03 14:23:34 +09:00
Sebastiaan van Stijn
8f8fd3c3a8
seccomp: whitelist statx syscall
This whitelists the statx syscall; libseccomp-2.3.3 or up
is needed for this, older seccomp versions will ignore this.

Equivalent of https://github.com/moby/moby/pull/36417

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2019-03-20 11:59:02 +01:00
Avi Kivity
4506eb45bf seccomp: whitelist io_pgetevents
io_pgetevents() is a new Linux system call, similar to the already-whitelisted
io_getevents(). It has no security implications. Whitelist it so applications can
use the new system call.

Fixes #3105.

Signed-off-by: Avi Kivity <avi@scylladb.com>
2019-03-19 11:56:32 +02:00
Justin Cormack
9435aeeb30
The set of bounding capabilities is the largest group
No capabilities can be granted outside the bounding set, so there
is no point looking at any other set for the largest scope.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2018-03-28 17:36:46 -07:00
Kunal Kushwaha
b12c3215a0 Licence header added
Signed-off-by: Kunal Kushwaha <kushwaha_kunal_v7@lab.ntt.co.jp>
2018-02-19 10:32:26 +09:00
Justin Cormack
35be3d5127 Remove a really confusing fallthrough
This is so confusing, and not needed.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2018-02-08 16:22:29 +00:00
Mike Brown
120bb4cd47 fixes missing default permission
Signed-off-by: Mike Brown <brownwm@us.ibm.com>
2017-09-20 13:15:39 -05:00
Mike Brown
426650f21b adds seccomp helpers
Signed-off-by: Mike Brown <brownwm@us.ibm.com>
2017-09-13 13:11:30 -05:00