It's followup for #5890.
The containerd-shim process depends on the mount package to init rootfs
for container. For the container enable user namespace, the mount
package needs to fork child process to get the brand-new user namespace.
However, there are two reapers in one process (described by the
following list) and there are race-condition cases.
1. mount package
2. sys.Reaper as global one which watch all the SIGCHLD.
=== [kill(2)][kill] the wrong process ===
Currently, we use pipe to ensure that child process is alive. However,
the pide file descriptor can be hold by other process, which the child
process cannot exit by self. We should use [kill(2)][kill] to ensure the
child process. But we might kill the wrong process if the child process
might be reaped by containerd-shim and the PID might be reused by other
process.
=== [waitid(2)][waitid] on the wrong child process ===
```
containerd-shim process:
Goroutine 1(GetUsernsFD): Goroutine 2(Reaper)
1. Ready to wait for child process X
2. Received SIGCHLD from X
3. Reaped the zombie child process X
(X has been reused by other child process)
4. Wait on process X
The goroutine 1 will be stuck until the process X has been terminated.
```
=== open `/proc/X/ns/user` on the wrong child process ===
There is also pid-reused risk between opening `/proc/$pid/ns/user` and
writing `/proc/$pid/u[g]id_map`.
```
containerd-shim process:
Goroutine 1(GetUsernsFD): Goroutine 2(Reaper)
1. Fork child process X
2. Write /proc/X/uid_map,gid_map
3. Received SIGCHLD from X
4. Reaped the zombie child process X
(X has been reused by other process)
5. Open /proc/X/ns/user file as usernsFD
The usernsFD links to the wrong X!!!
```
In order to fix the race-condition, we should use [CLONE_PIDFD][clone2] (Since
Linux v5.2).
When we fork child process `X`, the kernel will return a process file
descriptor `X_PIDFD` referencing to child process `X`. With the pidfd, we can
use [pidfd_send_signal(2)][pidfd_send_signal] (Since Linux v5.1)
to send signal(0) to ensure the child process `X` is alive. If the `X` has
terminated and its PID has been recycled for another process. The
pidfd_send_signal fails with the error ESRCH.
Therefore, we can open `/proc/X/{ns/user,uid_map,gid_map}` file
descriptors as first and then use pidfd_send_signal to check the process
is still alive. If so, we can ensure the file descriptors are valid and
reference to the child process `X`. Even if the `X` PID has been reused
after pidfd_send_signal call, the file descriptors are still valid.
```code
X, pidfd = clone2(CLONE_PIDFD)
usernsFD = open /proc/X/ns/user
uidmapFD = open /proc/X/uid_map
gidmapFD = open /proc/X/gid_map
pidfd_send_signal pidfd, signal(0)
return err if no such process
== When we arrive here, we can ensure usernsFD/uidmapFD/gidmapFD are correct
== even if X has been reused after pidfd_send_signal call.
update uid/gid mapping by uidmapFD/gidmapFD
return usernsFD
```
And the [waitid(2)][waitid] also supports pidfd type (Since Linux 5.4).
We can use pidfd type waitid to ensure we are waiting for the correct
process. All the PID related race-condition issues can be resolved by
pidfd.
```bash
➜ mount git:(followup-idmapped) pwd
/home/fuwei/go/src/github.com/containerd/containerd/mount
➜ mount git:(followup-idmapped) sudo go test -test.root -run TestGetUsernsFD -count=1000 -failfast -p 100 ./...
PASS
ok github.com/containerd/containerd/mount 3.446s
```
[kill]: <https://man7.org/linux/man-pages/man2/kill.2.html>
[clone2]: <https://man7.org/linux/man-pages/man2/clone.2.html>
[pidfd_send_signal]: <https://man7.org/linux/man-pages/man2/pidfd_send_signal.2.html>
[waitid]: <https://man7.org/linux/man-pages/man2/waitid.2.html>
Signed-off-by: Wei Fu <fuweid89@gmail.com>
This patch introduces idmapped mounts support for
container rootfs.
The idmapped mounts support was merged in Linux kernel 5.12
torvalds/linux@7d6beb7.
This functionality allows to address chown overhead for containers that
use user namespace.
The changes are based on experimental patchset published by
Mauricio Vásquez #4734.
Current version reiplements support of idmapped mounts using Golang.
Performance measurement results:
Image idmapped mount recursive chown
BusyBox 00.135 04.964
Ubuntu 00.171 15.713
Fedora 00.143 38.799
Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>
Signed-off-by: Artem Kuzin <artem.kuzin@huawei.com>
Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>
Signed-off-by: Ilya Hanov <ilya.hanov@huawei-partners.com>
- Add Target to mount.Mount.
- Add UnmountMounts to unmount a list of mounts in reverse order.
- Add UnmountRecursive to unmount deepest mount first for a given target, using
moby/sys/mountinfo.
Signed-off-by: Edgar Lee <edgarhinshunlee@gmail.com>
This change spins up a new goroutine, locks it to a thread, then
unshares CLONE_FS which allows us to `Chdir` from inside the thread
without affecting the rest of the program.
The thread is no longer usable after unshare so it leaves the thread
locked to prevent go from returning the thread to the thread pool.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Go 1.15.7 contained a security fix for CVE-2021-3115, which allowed arbitrary
code to be executed at build time when using cgo on Windows. This issue also
affects Unix users who have “.” listed explicitly in their PATH and are running
“go get” outside of a module or with module mode disabled.
This issue is not limited to the go command itself, and can also affect binaries
that use `os.Command`, `os.LookPath`, etc.
From the related blogpost (ttps://blog.golang.org/path-security):
> Are your own programs affected?
>
> If you use exec.LookPath or exec.Command in your own programs, you only need to
> be concerned if you (or your users) run your program in a directory with untrusted
> contents. If so, then a subprocess could be started using an executable from dot
> instead of from a system directory. (Again, using an executable from dot happens
> always on Windows and only with uncommon PATH settings on Unix.)
>
> If you are concerned, then we’ve published the more restricted variant of os/exec
> as golang.org/x/sys/execabs. You can use it in your program by simply replacing
This patch replaces all uses of `os/exec` with `golang.org/x/sys/execabs`. While
some uses of `os/exec` should not be problematic (e.g. part of tests), it is
probably good to be consistent, in case code gets moved around.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
It's the only location this is used, so might as well move it
into that package. I could not find external users of this utility,
so not adding an alias / deprecation.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
setupLoop()'s Autoclear (LO_FLAGS_AUTOCLEAR) will destruct the
loopback device when all associated file descriptors are closed.
However this behavior didn't work before since setupLoop() was
returning a file name. The looppack device was destructed at
the end of the function when LoopParams had Autoclear = true.
Fixes#4969.
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
If a mount has specified `loop` option, we need to handle it on our
own instead of passing it to the kernel. In such case, create a
loopback device, attach the mount source to it, and mount the loopback
device rather than the mount source.
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
`exec.CombinedOutput()` intermittently returns `ECHILD` due to our
signal handling.
`wait(2)`: https://man7.org/linux/man-pages/man2/wait.2.html
> ECHILD (for waitpid() or waitid()) The process specified by pid
> (waitpid()) or idtype and id (waitid()) does not exist or is
> not a child of the calling process. (This can happen for
> one's own child if the action for SIGCHLD is set to SIG_IGN.
> See also the Linux Notes section about threads.)
Fix#4387
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
When m.Type starts with either `fuse.` or `fuse3`, the
mount helper binary `mount.fuse` or `mount.fuse3` is executed.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
* Rootfs dir is created during container creation not during bundle
creation
* Add support for v2
* UnmountAll is a no-op when the path to unmount (i.e. the rootfs dir)
does not exist or is invalid
Co-authored-by: Danail Branekov <danailster@gmail.com>
Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>
Auto-detect longest common dir in lowerdir option and compact it if the
option size is hitting one page size. If does, Use chdir + CLONE to do
mount thing to avoid hitting one page argument buffer in linux kernel
mount.
Signed-off-by: Wei Fu <fhfuwei@163.com>
This moves both the Mount type and mountinfo into a single mount
package.
This also opens up the root of the repo to hold the containerd client
implementation.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>