Commit Graph

31 Commits

Author SHA1 Message Date
Sebastiaan van Stijn
2af6db672e
switch back from golang.org/x/sys/execabs to os/exec (go1.19)
This is effectively a revert of 2ac9968401, which
switched from os/exec to the golang.org/x/sys/execabs package to mitigate
security issues (mainly on Windows) with lookups resolving to binaries in the
current directory.

from the go1.19 release notes https://go.dev/doc/go1.19#os-exec-path

> ## PATH lookups
>
> Command and LookPath no longer allow results from a PATH search to be found
> relative to the current directory. This removes a common source of security
> problems but may also break existing programs that depend on using, say,
> exec.Command("prog") to run a binary named prog (or, on Windows, prog.exe) in
> the current directory. See the os/exec package documentation for information
> about how best to update such programs.
>
> On Windows, Command and LookPath now respect the NoDefaultCurrentDirectoryInExePath
> environment variable, making it possible to disable the default implicit search
> of “.” in PATH lookups on Windows systems.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2023-11-02 21:15:40 +01:00
Wei Fu
3742f7f0db idmapped: use pidfd to avoid pid reuse issue
It's followup for #5890.

The containerd-shim process depends on the mount package to init rootfs
for container. For the container enable user namespace, the mount
package needs to fork child process to get the brand-new user namespace.
However, there are two reapers in one process (described by the
following list) and there are race-condition cases.

1. mount package
2. sys.Reaper as global one which watch all the SIGCHLD.

=== [kill(2)][kill] the wrong process ===

Currently, we use pipe to ensure that child process is alive. However,
the pide file descriptor can be hold by other process, which the child
process cannot exit by self. We should use [kill(2)][kill] to ensure the
child process. But we might kill the wrong process if the child process
might be reaped by containerd-shim and the PID might be reused by other
process.

=== [waitid(2)][waitid] on the wrong child process ===

```
containerd-shim process:

Goroutine 1(GetUsernsFD):                   Goroutine 2(Reaper)

1. Ready to wait for child process X

                                            2. Received SIGCHLD from X

                                            3. Reaped the zombie child process X

                                            (X has been reused by other child process)

4. Wait on process X

The goroutine 1 will be stuck until the process X has been terminated.
```

=== open `/proc/X/ns/user` on the wrong child process ===

There is also pid-reused risk between opening `/proc/$pid/ns/user` and
writing `/proc/$pid/u[g]id_map`.

```
containerd-shim process:

Goroutine 1(GetUsernsFD):                   Goroutine 2(Reaper)

1. Fork child process X

2. Write /proc/X/uid_map,gid_map

                                            3. Received SIGCHLD from X

                                            4. Reaped the zombie child process X

                                            (X has been reused by other process)

5. Open /proc/X/ns/user file as usernsFD

The usernsFD links to the wrong X!!!
```

In order to fix the race-condition, we should use [CLONE_PIDFD][clone2] (Since
Linux v5.2).

When we fork child process `X`, the kernel will return a process file
descriptor `X_PIDFD` referencing to child process `X`. With the pidfd, we can
use [pidfd_send_signal(2)][pidfd_send_signal] (Since Linux v5.1)
to send signal(0) to ensure the child process `X` is alive. If the `X` has
terminated and its PID has been recycled for another process. The
pidfd_send_signal fails with the error ESRCH.

Therefore, we can open `/proc/X/{ns/user,uid_map,gid_map}` file
descriptors as first and then use pidfd_send_signal to check the process
is still alive. If so, we can ensure the file descriptors are valid and
reference to the child process `X`. Even if the `X` PID has been reused
after pidfd_send_signal call, the file descriptors are still valid.

```code
X, pidfd = clone2(CLONE_PIDFD)

usernsFD = open /proc/X/ns/user
uidmapFD = open /proc/X/uid_map
gidmapFD = open /proc/X/gid_map

pidfd_send_signal pidfd, signal(0)
  return err if no such process

== When we arrive here, we can ensure usernsFD/uidmapFD/gidmapFD are correct
== even if X has been reused after pidfd_send_signal call.

update uid/gid mapping by uidmapFD/gidmapFD
return usernsFD
```

And the [waitid(2)][waitid] also supports pidfd type (Since Linux 5.4).
We can use pidfd type waitid to ensure we are waiting for the correct
process. All the PID related race-condition issues can be resolved by
pidfd.

```bash
➜  mount git:(followup-idmapped) pwd
/home/fuwei/go/src/github.com/containerd/containerd/mount

➜  mount git:(followup-idmapped) sudo go test -test.root -run TestGetUsernsFD -count=1000 -failfast -p 100  ./...
PASS
ok      github.com/containerd/containerd/mount  3.446s
```

[kill]: <https://man7.org/linux/man-pages/man2/kill.2.html>
[clone2]: <https://man7.org/linux/man-pages/man2/clone.2.html>
[pidfd_send_signal]: <https://man7.org/linux/man-pages/man2/pidfd_send_signal.2.html>
[waitid]: <https://man7.org/linux/man-pages/man2/waitid.2.html>

Signed-off-by: Wei Fu <fuweid89@gmail.com>
2023-10-13 00:56:55 +08:00
Ilya Hanov
1555a31bf6 mount: support idmapped mount points
This patch introduces idmapped mounts support for
container rootfs.

The idmapped mounts support was merged in Linux kernel 5.12
torvalds/linux@7d6beb7.
This functionality allows to address chown overhead for containers that
use user namespace.

The changes are based on experimental patchset published by
Mauricio Vásquez #4734.
Current version reiplements support of idmapped mounts using Golang.

Performance measurement results:
Image           idmapped mount  recursive chown
BusyBox         00.135          04.964
Ubuntu          00.171          15.713
Fedora          00.143          38.799

Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>
Signed-off-by: Artem Kuzin <artem.kuzin@huawei.com>
Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>
Signed-off-by: Ilya Hanov <ilya.hanov@huawei-partners.com>
2023-09-05 01:23:30 +03:00
Akihiro Suda
98f27e1d9c
Revert "Add support for mounts on Darwin"
This reverts commit 2799b28e61.

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2023-07-19 00:22:20 +09:00
Marat Radchenko
2799b28e61 Add support for mounts on Darwin
Signed-off-by: Marat Radchenko <marat@slonopotamus.org>
2023-07-17 23:27:04 +03:00
Danny Canter
7ef133ad47 Fix mount pkg typo
retired -> retried

Signed-off-by: Danny Canter <danny@dcantah.dev>
2023-07-10 01:45:17 -07:00
Wei Fu
72b7d16505 mount: support direct-io for loopback device
Signed-off-by: Wei Fu <fuweid89@gmail.com>
2023-06-15 23:51:46 +08:00
Craig Ingram
d2605de734 add handling of a '.' commondir and bounds checking to mount_linux
Signed-off-by: Craig Ingram <Cjingram@google.com>
2023-05-30 21:13:16 +00:00
Edgar Lee
34d5878185 Use mount.Target to specify subdirectory of rootfs mount
- Add Target to mount.Mount.
- Add UnmountMounts to unmount a list of mounts in reverse order.
- Add UnmountRecursive to unmount deepest mount first for a given target, using
moby/sys/mountinfo.

Signed-off-by: Edgar Lee <edgarhinshunlee@gmail.com>
2023-01-27 09:51:58 +08:00
Brian Goff
a24ef09937 Replace mount fork hack with CLONE_FS
This change spins up a new goroutine, locks it to a thread, then
unshares CLONE_FS which allows us to `Chdir` from inside the thread
without affecting the rest of the program.

The thread is no longer usable after unshare so it leaves the thread
locked to prevent go from returning the thread to the thread pool.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2022-11-03 22:30:35 +00:00
haoyun
bbe46b8c43 feat: replace github.com/pkg/errors to errors
Signed-off-by: haoyun <yun.hao@daocloud.io>
Co-authored-by: zounengren <zouyee1989@gmail.com>
2022-01-07 10:27:03 +08:00
haoyun
c0d07094be feat: Errorf usage
Signed-off-by: haoyun <yun.hao@daocloud.io>
2021-12-13 14:31:53 +08:00
Sebastiaan van Stijn
2ac9968401
replace uses of os/exec with golang.org/x/sys/execabs
Go 1.15.7 contained a security fix for CVE-2021-3115, which allowed arbitrary
code to be executed at build time when using cgo on Windows. This issue also
affects Unix users who have “.” listed explicitly in their PATH and are running
“go get” outside of a module or with module mode disabled.

This issue is not limited to the go command itself, and can also affect binaries
that use `os.Command`, `os.LookPath`, etc.

From the related blogpost (ttps://blog.golang.org/path-security):

> Are your own programs affected?
>
> If you use exec.LookPath or exec.Command in your own programs, you only need to
> be concerned if you (or your users) run your program in a directory with untrusted
> contents. If so, then a subprocess could be started using an executable from dot
> instead of from a system directory. (Again, using an executable from dot happens
> always on Windows and only with uncommon PATH settings on Unix.)
>
> If you are concerned, then we’ve published the more restricted variant of os/exec
> as golang.org/x/sys/execabs. You can use it in your program by simply replacing

This patch replaces all uses of `os/exec` with `golang.org/x/sys/execabs`. While
some uses of `os/exec` should not be problematic (e.g. part of tests), it is
probably good to be consistent, in case code gets moved around.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-08-25 18:11:09 +02:00
Sebastiaan van Stijn
a964cf0cc4
un-export mount.FMountat
It's only used internally, so we can un-export this utility until
it is needed elsewhere.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-06-23 18:14:53 +02:00
Sebastiaan van Stijn
21f532d518
move sys.FMountat() into mount package
It's the only location this is used, so might as well move it
into that package. I could not find external users of this utility,
so not adding an alias / deprecation.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-06-23 18:14:30 +02:00
Kazuyoshi Kato
05a2e280ac mount: make setupLoop() work with with Autoclear
setupLoop()'s Autoclear (LO_FLAGS_AUTOCLEAR) will destruct the
loopback device when all associated file descriptors are closed.

However this behavior didn't work before since setupLoop() was
returning a file name. The looppack device was destructed at
the end of the function when LoopParams had Autoclear = true.

Fixes #4969.

Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
2021-02-04 11:04:04 -08:00
Maksym Pavlenko
c5fa0298c1 Address loop dev PR comments #4178
Signed-off-by: Maksym Pavlenko <pavlenko.maksym@gmail.com>
2021-01-04 10:44:29 -08:00
Peng Tao
9e42070169 mount: handle loopback mount
If a mount has specified `loop` option, we need to handle it on our
own instead of passing it to the kernel. In such case, create a
loopback device, attach the mount source to it, and mount the loopback
device rather than the mount source.

Signed-off-by: Peng Tao <bergwolf@hyper.sh>
2021-01-04 10:14:55 -08:00
Sebastiaan van Stijn
48f64a18be
mount: extract FUSE unmounting to a function
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-10-01 17:29:40 +02:00
Sebastiaan van Stijn
5b13dcc73a
mount.isFUSE(): remove unused error return
The error itself was unused, so may as well remove it.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-09-28 21:43:21 +02:00
Akihiro Suda
403dc83a29
mount: retry executing the helper binary on ECHILD
`exec.CombinedOutput()` intermittently returns `ECHILD` due to our
signal handling.

`wait(2)`: https://man7.org/linux/man-pages/man2/wait.2.html

> ECHILD (for waitpid() or waitid()) The process specified by pid
>   (waitpid()) or idtype and id (waitid()) does not exist or is
>   not a child of the calling process.  (This can happen for
>   one's own child if the action for SIGCHLD is set to SIG_IGN.
>   See also the Linux Notes section about threads.)

Fix #4387

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-07-22 14:24:08 +09:00
Akihiro Suda
e739314ed4 mount: support FUSE helper
When m.Type starts with either `fuse.` or `fuse3`, the
mount helper binary `mount.fuse` or `mount.fuse3` is executed.

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-01-01 04:16:30 +09:00
Georgi Sabev
ae5ca8177d Refactor mount path check and add comments
Co-authored-by: Danail Branekov <danailster@gmail.com>
Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>
2019-04-09 16:20:05 +03:00
Georgi Sabev
c0f0b21314 Apply PR feedback
* Rootfs dir is created during container creation not during bundle
  creation
* Add support for v2
* UnmountAll is a no-op when the path to unmount (i.e. the rootfs dir)
  does not exist or is invalid

Co-authored-by: Danail Branekov <danailster@gmail.com>
Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>
2019-04-04 18:40:30 +03:00
Wei Fu
67b54c6670 Support >= 128 layers in overlayfs snapshots
Auto-detect longest common dir in lowerdir option and compact it if the
option size is hitting one page size. If does, Use chdir + CLONE to do
mount thing to avoid hitting one page argument buffer in linux kernel
mount.

Signed-off-by: Wei Fu <fhfuwei@163.com>
2018-08-07 10:59:36 +08:00
Kunal Kushwaha
b12c3215a0 Licence header added
Signed-off-by: Kunal Kushwaha <kushwaha_kunal_v7@lab.ntt.co.jp>
2018-02-19 10:32:26 +09:00
Michael Crosby
b0ca685874 Retry unmount on EBUSY and return errors
This is another WIP to fix #1785.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-12-04 11:31:08 -05:00
Michael Crosby
451421b615 Comment more packages to pass go lint
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-10-02 13:54:56 -04:00
Akihiro Suda
a560e5e0ef mount: fix read-only bind (#1368)
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2017-09-04 04:44:56 +00:00
Ian Campbell
d42cb88ba2 Loop umount'ing rootfs until there are no more mounts
This is simpler than trying to count how many successful mounts we made.

Signed-off-by: Ian Campbell <ian.campbell@docker.com>
2017-07-20 10:50:08 +01:00
Michael Crosby
d7af92e00c Move Mount into mount pkg
This moves both the Mount type and mountinfo into a single mount
package.

This also opens up the root of the repo to hold the containerd client
implementation.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-05-22 16:41:12 -07:00