Commit Graph

55 Commits

Author SHA1 Message Date
Phil Estes
a2d0ddc88e
Merge pull request #9684 from AkihiroSuda/seccomp-6.7
seccomp: kernel 6.7
2024-01-25 19:07:42 +00:00
Akihiro Suda
eb8981f352
mv contrib/seccomp/kernelversion pkg/kernelversion
The package isn't really relevant to seccomp

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2024-01-24 19:03:53 +09:00
Akihiro Suda
a6e52c74fa
seccomp: kernel 6.7
The following syscalls were added since kernel v5.16:
- v5.17 (libseccomp v2.5.4): set_mempolicy_home_node
- v6.5  (libseccomp v2.5.5): cachestat
- v6.6  (libseccomp v2.5.5): fchmodat2, map_shadow_stack
- v6.7  (libseccomp v2.5.5): futex_wake, futex_wait, futex_requeue

[Not covered in this commit]
- v6.8-rc1: statmount, listmount, lsm_get_self_attr, lsm_set_self_attr, lsm_list_modules

ref:
- `syscalls: update the syscall list for Linux v5.17` (libseccomp v2.5.4)
   d83cb7ac25
- `all: update the syscall table for Linux v6.7-rc3`  (libseccomp v2.5.5)
   53267af3fb

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2024-01-24 18:35:41 +09:00
Vinayak Goyal
a48ddf4a20 Don't allow io_uring related syscalls in the RuntimeDefault seccomp profile.
Signed-off-by: Vinayak Goyal <vinaygo@google.com>
2023-11-02 01:23:58 +00:00
Derek McGowan
5fdf55e493
Update go module to github.com/containerd/containerd/v2
Signed-off-by: Derek McGowan <derek@mcg.dev>
2023-10-29 20:52:21 -07:00
Bjorn Neergaard
9a202e342b
seccomp: always allow name_to_handle_at
This syscall is used by systemd to request unique internal names for
paths in the cgroup hierarchy from the kernel, and is overall innocuous.

Due to [previous][1] [mistakes][2] in moby/moby, it ended up attached to
`CAP_SYS_ADMIN`; however, it should not be filtered at all.

An in-depth analysis is available [at moby/moby][3].

  [1]: a01c4dc8f8 (diff-6c0d906dbef148d2060ed71a7461907e5601fea78866e4183835c60e5d2ff01aR1627-R1639)
  [2]: c1ca124682
  [3]: https://github.com/moby/moby/pull/45766#pullrequestreview-1493908145

Co-authored-by: Vitor Anjos <bartier@users.noreply.github.com>
Signed-off-by: Bjorn Neergaard <bjorn.neergaard@docker.com>
2023-06-28 05:50:24 -06:00
Wei Fu
6b7e237fc7 chore: use go fix to cleanup old +build buildtag
Signed-off-by: Wei Fu <fuweid89@gmail.com>
2022-12-29 14:25:14 +08:00
Craig Ingram
afa19a0a78
Fix process_vm_* syscall names in seccomp
Signed-off-by: Craig Ingram <cjingram@google.com>
2022-12-02 15:27:10 -05:00
Juan Hoyos
e224f77eb7 Add process_vm read and write calls to default seccomp profile
Follow up to 94faa70df4. The commit referenced allowed `ptrace` calls in the default seccomp profile following the usual tracing security checks in for Kernels newer than 4.8. Kernels prior to this version are susceptible to [CVE-2019-2054](https://github.com/advisories/GHSA-qgfr-27qf-f323).  Moby's default had allowed for `ptrace` for kernels newer than 4.8 at the time the commit was created. The current [seccomp default](https://github.com/moby/moby/blob/master/profiles/seccomp/default_linux.go#L405-L417) has been updated to include `process_vm_read` and `process_vm_write`. Mirror that policy to complete the classic ptrace set of APIs.

Signed-off-by: Juan Hoyos <juan.s.hoyos@outlook.com>
2022-11-18 10:51:45 -05:00
Zhuchen Wang
17a9324035
Update the default seccomp to block socket calls to AF_VSOCK
Signed-off-by: Zhuchen Wang <zcwang@google.com>
2022-10-11 15:30:39 -07:00
Henry Wang
43907515b4 adding support of CAP_BPF and CAP_PERFMON
Signed-off-by: Henry Wang <henwang@amazon.com>
2022-08-17 19:59:09 +00:00
Derek McGowan
e95858f93f
Merge pull request #7163 from thaJeztah/seccomp_support_pku
seccomp: seccomp: add syscalls related to PKU in default policy
2022-07-18 15:19:10 -07:00
Sebastiaan van Stijn
bbb8d34704
seccomp: add get_mempolicy, mbind, set_mempolicy, with CAP_SYS_NICE
This aligns the profile with docker's profile, which added this in
47dfff68e4

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2022-07-14 10:39:55 +02:00
Sebastiaan van Stijn
19e8479837
seccomp: seccomp: add syscalls related to PKU in default policy
Add pkey_alloc(2), pkey_free(2) and pkey_mprotect(2) in seccomp default profile.
pkey_alloc(2), pkey_free(2) and pkey_mprotect(2) can only configure
the calling process's own memory, so they are existing "safe for everyone" syscalls.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2022-07-13 23:03:35 +02:00
Akihiro Suda
575095fcd6
seccomp: allow clock_settime64 when CAP_SYS_TIME is added
Port moby/moby PR 43775

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2022-07-11 10:04:13 +09:00
Akihiro Suda
4b412b8003
seccomp: support riscv64
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2022-05-01 02:52:55 +09:00
Henry Wang
94faa70df4 allow ptrace(2) by default for kernel >= 4.8
Signed-off-by: Henry Wang <henwang@amazon.com>
2022-04-18 20:45:29 +00:00
Akihiro Suda
34f7173491
seccomp: kernel 5.16 (futex_waitv)
Allow `futex_waitv` by default.
See https://www.phoronix.com/scan.php?page=news_item&px=FUTEX2-futex-waiv-More-Archs

Note: libseccomp does not cover kernel 5.16 at this moment:
51b50f95e1/src/syscalls.csv

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2022-02-01 09:08:06 +09:00
Akihiro Suda
8632bdcb7b
seccomp: kernel 5.15 (process_mrelease)
Allow `process_mrelease` by default.

See https://lwn.net/Articles/864184/

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2022-02-01 09:08:05 +09:00
Akihiro Suda
c013db6965
seccomp: kernel 5.14 (quotactl_fd, memfd_secret)
- Allow `quotactl_fd` when `CAP_SYS_ADMIN` is granted.
  See https://lwn.net/Articles/859679/

- Allow `memfd_secret` by default.
  See https://lwn.net/Articles/865256/

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2022-02-01 09:08:01 +09:00
Akihiro Suda
17a2831f70
seccomp: kernel 5.13 (landlock_{add_rule,create_ruleset,restrict_self})
Allow the following syscalls by default:
- `landlock_add_rule`
- `landlock_create_ruleset`
- `landlock_restrict_self`

See https://landlock.io/

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2022-02-01 09:07:33 +09:00
Akihiro Suda
1329ea3716
seccomp: kernel 5.12 (mount_setattr)
Allow `mount_setattr` when `CAP_SYS_ADMIN` is granted.

See https://man7.org/linux/man-pages/man2/mount_setattr.2.html

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2022-02-01 09:06:41 +09:00
Sören Tempel
adee2c7974 seccomp: add support for "swapcontext" syscall in default policy
This system call is only available on 32- and 64-bit PowerPC, it is used
by modern programming language implementations to implement coroutine
features through userspace context switches.

moby [1] and systemd nspawn [2] already whitelist this system call so it
makes sense to whitelist it in containerd as well.

[1]: https://github.com/moby/moby/pull/43092
[2]: https://github.com/systemd/systemd/pull/9487

Signed-off-by: Sören Tempel <soeren+git@soeren-tempel.net>
2022-01-07 18:05:59 +01:00
Akihiro Suda
55923daa9f
seccomp: support "clone3" (return ENOSYS unless SYS_ADMIN is granted)
clone3 is explicitly requested to give ENOSYS instead of the default EPERM, when CAP_SYS_ADMIN is unset.
See moby/moby PR 42681 (thanks to berrange).

Without this commit, rawhide image does not work:
```console
$ sudo ctr run --rm --net-host --seccomp registry.fedoraproject.org/fedora:rawhide foo /usr/bin/curl google.com
curl: (6) getaddrinfo() thread failed to start
```

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2021-09-15 14:44:45 +09:00
Akihiro Suda
d3aa7ee9f0
Run go fmt with Go 1.17
The new `go fmt` adds `//go:build` lines (https://golang.org/doc/go1.17#tools).

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2021-08-22 09:31:50 +09:00
Sebastiaan van Stijn
e1445dff12
profiles: seccomp: update to Linux 5.11 syscall list
These syscalls (some of which have been in Linux for a while but were
missing from the profile) fall into a few buckets:

 * close_range(2), epoll_wait2(2) are just extensions of existing "safe
   for everyone" syscalls.

 * The mountv2 API syscalls (fs*(2), move_mount(2), open_tree(2)) are
   all equivalent to aspects of mount(2) and thus go into the
   CAP_SYS_ADMIN category.

 * process_madvise(2) is similar to the other process_*(2) syscalls and
   thus goes in the CAP_SYS_PTRACE category.

Co-authored-by: Aleksa Sarai <asarai@suse.de>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-01-21 21:30:25 +01:00
Sebastiaan van Stijn
0a1104bcf3
seccomp: add pidfd_getfd syscall (gated by CAP_SYS_PTRACE)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-11-13 13:36:33 +01:00
Sebastiaan van Stijn
2dbbd10fd6
seccomp: add pidfd_open and pidfd_send_signal
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-11-13 13:36:25 +01:00
Michael Crosby
f9d231f660
Merge pull request #4493 from thaJeztah/seccomp_uring
seccomp: allow io-uring related system calls
2020-08-25 11:39:45 -04:00
Michael Crosby
396b863138
Merge pull request #4491 from thaJeztah/seccomp_syslog
seccomp: move the syslog syscall to be gated by CAP_SYS_ADMIN or CAP_SYSLOG
2020-08-25 11:35:28 -04:00
Sebastiaan van Stijn
325bac7c71
seccomp: allow io-uring related system calls
Adds the io-uring related system call introduced in kernel 5.1 to the
seccomp whitelist. With older kernels or older versions of libseccomp,
this configure will be omitted.

Note that io_uring will grow support for more syscalls in the future
so we should keep an eye on this.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-08-24 12:59:53 +02:00
Sebastiaan van Stijn
0a5ee7e6f3
seccomp: allow clock_settime when CAP_SYS_TIME is added
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-08-24 12:43:21 +02:00
Sebastiaan van Stijn
5cdb6e81d2
seccomp: allow quotactl with CAP_SYS_ADMIN
This allows the quotactl syscall in the default seccomp profile, gated by
CAP_SYS_ADMIN.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-08-24 12:40:43 +02:00
Sebastiaan van Stijn
5862285fac
seccomp: allow sync_file_range2 on supported architectures.
On a ppc64le host, running postgres (tried with 9.4 to 9.6) gives the following
warning when trying to flush data to disks (which happens very frequently):

     WARNING: could not flush dirty data: Operation not permitted.

A quick dig in postgres source code indicate it uses sync_file_range(2) to
flush data; which on ppe64le and arm64 is translated to sync_file_range2(2)
for alignements reasons.

The profile did not allow sync_file_range2(2), making postgres sad because
it can not flush its buffers. arm_sync_file_range(2) is an ancient alias to
sync_file_range2(2), the syscall was renamed in Linux 2.6.22 when the same
syscall was added for PowerPC.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-08-24 12:36:53 +02:00
Sebastiaan van Stijn
117d678749
seccomp: allow personality with UNAME26 bit set
From personality(2):

    Have uname(2) report a 2.6.40+ version number rather than a 3.x version
    number.  Added as a stopgap measure to support broken applications that
    could not handle the  kernel  version-numbering  switch  from 2.6.x to 3.x.

This allows both "UNAME26|PER_LINUX" and "UNAME26|PER_LINUX32".

Fixes: "setarch broken in docker packages from Debian stretch"

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-08-24 12:27:14 +02:00
Sebastiaan van Stijn
fc9e5d161a
seccomp: allow syscall membarrier
Add the membarrier syscall to the default seccomp profile.
It is for example used in the implementation of dlopen() in
the musl libc of Alpine images.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-08-24 12:16:26 +02:00
Sebastiaan van Stijn
1746a195e9
seccomp: allow adjtimex get time operation
Enabled adjtimex in the default profile without requiring CAP_SYS_TIME privilege.
The kernel will check CAP_SYS_TIME and won't allow setting the time.

Fixes: Getting the system time with ntptime returns an error in an unprivileged
container

To verify, inside a CentOS 7 container:

    yum install -y ntp
    ntptime
    # ntp_gettime() returns code 0 (OK)

    ntpdate -v time.nist.gov
    # ntpdate[84]: Can't adjust the time of day: Operation not permitted

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-08-24 12:16:23 +02:00
Sebastiaan van Stijn
7e7545e556
seccomp: allow add preadv2 and pwritev2 syscalls
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-08-24 12:16:21 +02:00
Sebastiaan van Stijn
267a0cf68e
seccomp: move the syslog syscall to be gated by CAP_SYS_ADMIN or CAP_SYSLOG
This call is what is used to implement `dmesg` to get kernel messages
about the host. This can leak substantial information about the host.
It is normally available to unprivileged users on the host, unless
the sysctl `kernel.dmesg_restrict = 1` is set, but this is not set
by standard on the majority of distributions. Blocking this to restrict
leaks about the configuration seems correct.

Relates to docker/docker#37897 "docker exposes dmesg to containers by default"

See also https://googleprojectzero.blogspot.com/2018/09/a-cache-invalidation-bug-in-linux.html

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-08-24 11:57:48 +02:00
Jintao Zhang
6a915a1453 seccomp: add faccessat2 syscall.
related to https://patchwork.kernel.org/patch/11545287/

Signed-off-by: Jintao Zhang <zhangjintao9020@gmail.com>
2020-08-17 21:48:21 +08:00
Jintao Zhang
e28e55f455 seccomp: add openat2 syscall.
related to https://patchwork.kernel.org/patch/11167585/

Signed-off-by: Jintao Zhang <zhangjintao9020@gmail.com>
2020-08-16 16:28:21 +08:00
Florian Schmaus
e977564a8b seccomp: allow 'rseq' syscall in default seccomp profile
Restartable Sequences (rseq) are a kernel-based mechanism for fast
update operations on per-core data in user-space. Some libraries, like
the newest version of Google's TCMalloc, depend on it [1].

This also makes dockers default seccomp profile on par with systemd's,
which enabled 'rseq' in early 2019 [2].

1: https://google.github.io/tcmalloc/design.html
2: systemd/systemd@6fee3be

Signed-off-by: Florian Schmaus <flo@geekplace.eu>
2020-06-26 17:10:05 +02:00
Michael Crosby
0f831093ce Update usage of whitelist in project
Signed-off-by: Michael Crosby <michael@thepasture.io>
2020-06-08 12:49:22 -05:00
Kenta Tada
03755821d2 seccomp: remove the unused query_module(2)
query_module(2) is only in kernels before Linux 2.6.

Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>
2020-05-19 10:36:55 +09:00
Stanislav Levin
5765991f2c
seccomp: Whitelist clock_adjtime
This only allows making the syscall. CAP_SYS_TIME is still required
for time adjustment (enforced by the kernel):

```
kernel/time/posix-timers.c:

1112 SYSCALL_DEFINE2(clock_adjtime, const clockid_t, which_clock,
1113                 struct __kernel_timex __user *, utx)
...
1121         err = do_clock_adjtime(which_clock, &ktx);

1100 int do_clock_adjtime(const clockid_t which_clock, struct __kernel_timex * ktx)
1101 {
...
1109         return kc->clock_adj(which_clock, ktx);

1299 static const struct k_clock clock_realtime = {
...
1304         .clock_adj              = posix_clock_realtime_adj,

188 static int posix_clock_realtime_adj(const clockid_t which_clock,
189                                     struct __kernel_timex *t)
190 {
191         return do_adjtimex(t);

kernel/time/timekeeping.c:

2312 int do_adjtimex(struct __kernel_timex *txc)
2313 {
...
2321         /* Validate the data before disabling interrupts */
2322         ret = timekeeping_validate_timex(txc);

2246 static int timekeeping_validate_timex(const struct __kernel_timex *txc)
2247 {
2248         if (txc->modes & ADJ_ADJTIME) {
...
2252                 if (!(txc->modes & ADJ_OFFSET_READONLY) &&
2253                     !capable(CAP_SYS_TIME))
2254                         return -EPERM;
2255         } else {
2256                 /* In order to modify anything, you gotta be super-user! */
2257                 if (txc->modes && !capable(CAP_SYS_TIME))
2258                         return -EPERM;

```

Fixes: moby/moby 40919
Signed-off-by: Stanislav Levin <slev@altlinux.org>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-05-17 23:11:04 +02:00
Sebastiaan van Stijn
9529c69b8a
seccomp: add 64-bit time_t syscalls
Relates to https://patchwork.kernel.org/patch/10756415/

Added to whitelist:

- `clock_getres_time64` (equivalent of `clock_getres`, which was whitelisted)
- `clock_gettime64` (equivalent of `clock_gettime`, which was whitelisted)
- `clock_nanosleep_time64` (equivalent of `clock_nanosleep`, which was whitelisted)
- `futex_time64` (equivalent of `futex`, which was whitelisted)
- `io_pgetevents_time64` (equivalent of `io_pgetevents`, which was whitelisted)
- `mq_timedreceive_time64` (equivalent of `mq_timedreceive`, which was whitelisted)
- `mq_timedsend_time64 ` (equivalent of `mq_timedsend`, which was whitelisted)
- `ppoll_time64` (equivalent of `ppoll`, which was whitelisted)
- `pselect6_time64` (equivalent of `pselect6`, which was whitelisted)
- `recvmmsg_time64` (equivalent of `recvmmsg`, which was whitelisted)
- `rt_sigtimedwait_time64` (equivalent of `rt_sigtimedwait`, which was whitelisted)
- `sched_rr_get_interval_time64` (equivalent of `sched_rr_get_interval`, which was whitelisted)
- `semtimedop_time64` (equivalent of `semtimedop`, which was whitelisted)
- `timer_gettime64` (equivalent of `timer_gettime`, which was whitelisted)
- `timer_settime64` (equivalent of `timer_settime`, which was whitelisted)
- `timerfd_gettime64` (equivalent of `timerfd_gettime`, which was whitelisted)
- `timerfd_settime64` (equivalent of `timerfd_settime`, which was whitelisted)
- `utimensat_time64` (equivalent of `utimensat`, which was whitelisted)

Not added to whitelist:

- `clock_adjtime64` (equivalent of `clock_adjtime`, which was not whitelisted)
- `clock_settime64` (equivalent of `clock_settime`, which was not whitelisted)

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-03-25 14:07:38 +01:00
Michael Crosby
86f8be86e1 Add sigprocmask to default profile
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2019-08-29 11:07:03 -04:00
Kenta Tada
5b9a43d2e7 Fix seccomp contributed profile for clone syscall
All clone flags for namespace should be denied.
Also x/sys should be used instead of syscall.

Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>
2019-06-03 14:23:34 +09:00
Sebastiaan van Stijn
8f8fd3c3a8
seccomp: whitelist statx syscall
This whitelists the statx syscall; libseccomp-2.3.3 or up
is needed for this, older seccomp versions will ignore this.

Equivalent of https://github.com/moby/moby/pull/36417

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2019-03-20 11:59:02 +01:00
Avi Kivity
4506eb45bf seccomp: whitelist io_pgetevents
io_pgetevents() is a new Linux system call, similar to the already-whitelisted
io_getevents(). It has no security implications. Whitelist it so applications can
use the new system call.

Fixes #3105.

Signed-off-by: Avi Kivity <avi@scylladb.com>
2019-03-19 11:56:32 +02:00