containerd

Author	SHA1	Message	Date
Wei Fu	3742f7f0db	idmapped: use pidfd to avoid pid reuse issue It's followup for #5890. The containerd-shim process depends on the mount package to init rootfs for container. For the container enable user namespace, the mount package needs to fork child process to get the brand-new user namespace. However, there are two reapers in one process (described by the following list) and there are race-condition cases. 1. mount package 2. sys.Reaper as global one which watch all the SIGCHLD. === [kill(2)][kill] the wrong process === Currently, we use pipe to ensure that child process is alive. However, the pide file descriptor can be hold by other process, which the child process cannot exit by self. We should use [kill(2)][kill] to ensure the child process. But we might kill the wrong process if the child process might be reaped by containerd-shim and the PID might be reused by other process. === [waitid(2)][waitid] on the wrong child process === ``` containerd-shim process: Goroutine 1(GetUsernsFD): Goroutine 2(Reaper) 1. Ready to wait for child process X 2. Received SIGCHLD from X 3. Reaped the zombie child process X (X has been reused by other child process) 4. Wait on process X The goroutine 1 will be stuck until the process X has been terminated. ``` === open `/proc/X/ns/user` on the wrong child process === There is also pid-reused risk between opening `/proc/$pid/ns/user` and writing `/proc/$pid/u[g]id_map`. ``` containerd-shim process: Goroutine 1(GetUsernsFD): Goroutine 2(Reaper) 1. Fork child process X 2. Write /proc/X/uid_map,gid_map 3. Received SIGCHLD from X 4. Reaped the zombie child process X (X has been reused by other process) 5. Open /proc/X/ns/user file as usernsFD The usernsFD links to the wrong X!!! ``` In order to fix the race-condition, we should use [CLONE_PIDFD][clone2] (Since Linux v5.2). When we fork child process `X`, the kernel will return a process file descriptor `X_PIDFD` referencing to child process `X`. With the pidfd, we can use [pidfd_send_signal(2)][pidfd_send_signal] (Since Linux v5.1) to send signal(0) to ensure the child process `X` is alive. If the `X` has terminated and its PID has been recycled for another process. The pidfd_send_signal fails with the error ESRCH. Therefore, we can open `/proc/X/{ns/user,uid_map,gid_map}` file descriptors as first and then use pidfd_send_signal to check the process is still alive. If so, we can ensure the file descriptors are valid and reference to the child process `X`. Even if the `X` PID has been reused after pidfd_send_signal call, the file descriptors are still valid. ```code X, pidfd = clone2(CLONE_PIDFD) usernsFD = open /proc/X/ns/user uidmapFD = open /proc/X/uid_map gidmapFD = open /proc/X/gid_map pidfd_send_signal pidfd, signal(0) return err if no such process == When we arrive here, we can ensure usernsFD/uidmapFD/gidmapFD are correct == even if X has been reused after pidfd_send_signal call. update uid/gid mapping by uidmapFD/gidmapFD return usernsFD ``` And the [waitid(2)][waitid] also supports pidfd type (Since Linux 5.4). We can use pidfd type waitid to ensure we are waiting for the correct process. All the PID related race-condition issues can be resolved by pidfd. ```bash ➜ mount git:(followup-idmapped) pwd /home/fuwei/go/src/github.com/containerd/containerd/mount ➜ mount git:(followup-idmapped) sudo go test -test.root -run TestGetUsernsFD -count=1000 -failfast -p 100 ./... PASS ok github.com/containerd/containerd/mount 3.446s ``` [kill]: <https://man7.org/linux/man-pages/man2/kill.2.html> [clone2]: <https://man7.org/linux/man-pages/man2/clone.2.html> [pidfd_send_signal]: <https://man7.org/linux/man-pages/man2/pidfd_send_signal.2.html> [waitid]: <https://man7.org/linux/man-pages/man2/waitid.2.html> Signed-off-by: Wei Fu <fuweid89@gmail.com>	2023-10-13 00:56:55 +08:00
Ilya Hanov	1555a31bf6	mount: support idmapped mount points This patch introduces idmapped mounts support for container rootfs. The idmapped mounts support was merged in Linux kernel 5.12 torvalds/linux@7d6beb7. This functionality allows to address chown overhead for containers that use user namespace. The changes are based on experimental patchset published by Mauricio Vásquez #4734. Current version reiplements support of idmapped mounts using Golang. Performance measurement results: Image idmapped mount recursive chown BusyBox 00.135 04.964 Ubuntu 00.171 15.713 Fedora 00.143 38.799 Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io> Signed-off-by: Artem Kuzin <artem.kuzin@huawei.com> Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com> Signed-off-by: Ilya Hanov <ilya.hanov@huawei-partners.com>	2023-09-05 01:23:30 +03:00
Akihiro Suda	98f27e1d9c	Revert "Add support for mounts on Darwin" This reverts commit `2799b28e61`. Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2023-07-19 00:22:20 +09:00
Marat Radchenko	2799b28e61	Add support for mounts on Darwin Signed-off-by: Marat Radchenko <marat@slonopotamus.org>	2023-07-17 23:27:04 +03:00
Danny Canter	7ef133ad47	Fix mount pkg typo retired -> retried Signed-off-by: Danny Canter <danny@dcantah.dev>	2023-07-10 01:45:17 -07:00
Wei Fu	72b7d16505	mount: support direct-io for loopback device Signed-off-by: Wei Fu <fuweid89@gmail.com>	2023-06-15 23:51:46 +08:00
Craig Ingram	d2605de734	add handling of a '.' commondir and bounds checking to mount_linux Signed-off-by: Craig Ingram <Cjingram@google.com>	2023-05-30 21:13:16 +00:00
Edgar Lee	34d5878185	Use mount.Target to specify subdirectory of rootfs mount - Add Target to mount.Mount. - Add UnmountMounts to unmount a list of mounts in reverse order. - Add UnmountRecursive to unmount deepest mount first for a given target, using moby/sys/mountinfo. Signed-off-by: Edgar Lee <edgarhinshunlee@gmail.com>	2023-01-27 09:51:58 +08:00
Brian Goff	a24ef09937	Replace mount fork hack with CLONE_FS This change spins up a new goroutine, locks it to a thread, then unshares CLONE_FS which allows us to `Chdir` from inside the thread without affecting the rest of the program. The thread is no longer usable after unshare so it leaves the thread locked to prevent go from returning the thread to the thread pool. Signed-off-by: Brian Goff <cpuguy83@gmail.com>	2022-11-03 22:30:35 +00:00
haoyun	bbe46b8c43	feat: replace github.com/pkg/errors to errors Signed-off-by: haoyun <yun.hao@daocloud.io> Co-authored-by: zounengren <zouyee1989@gmail.com>	2022-01-07 10:27:03 +08:00
haoyun	c0d07094be	feat: Errorf usage Signed-off-by: haoyun <yun.hao@daocloud.io>	2021-12-13 14:31:53 +08:00
Sebastiaan van Stijn	2ac9968401	replace uses of os/exec with golang.org/x/sys/execabs Go 1.15.7 contained a security fix for CVE-2021-3115, which allowed arbitrary code to be executed at build time when using cgo on Windows. This issue also affects Unix users who have “.” listed explicitly in their PATH and are running “go get” outside of a module or with module mode disabled. This issue is not limited to the go command itself, and can also affect binaries that use `os.Command`, `os.LookPath`, etc. From the related blogpost (ttps://blog.golang.org/path-security): > Are your own programs affected? > > If you use exec.LookPath or exec.Command in your own programs, you only need to > be concerned if you (or your users) run your program in a directory with untrusted > contents. If so, then a subprocess could be started using an executable from dot > instead of from a system directory. (Again, using an executable from dot happens > always on Windows and only with uncommon PATH settings on Unix.) > > If you are concerned, then we’ve published the more restricted variant of os/exec > as golang.org/x/sys/execabs. You can use it in your program by simply replacing This patch replaces all uses of `os/exec` with `golang.org/x/sys/execabs`. While some uses of `os/exec` should not be problematic (e.g. part of tests), it is probably good to be consistent, in case code gets moved around. Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2021-08-25 18:11:09 +02:00
Sebastiaan van Stijn	a964cf0cc4	un-export mount.FMountat It's only used internally, so we can un-export this utility until it is needed elsewhere. Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2021-06-23 18:14:53 +02:00
Sebastiaan van Stijn	21f532d518	move sys.FMountat() into mount package It's the only location this is used, so might as well move it into that package. I could not find external users of this utility, so not adding an alias / deprecation. Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2021-06-23 18:14:30 +02:00
Kazuyoshi Kato	05a2e280ac	mount: make setupLoop() work with with Autoclear setupLoop()'s Autoclear (LO_FLAGS_AUTOCLEAR) will destruct the loopback device when all associated file descriptors are closed. However this behavior didn't work before since setupLoop() was returning a file name. The looppack device was destructed at the end of the function when LoopParams had Autoclear = true. Fixes #4969. Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>	2021-02-04 11:04:04 -08:00
Maksym Pavlenko	c5fa0298c1	Address loop dev PR comments #4178 Signed-off-by: Maksym Pavlenko <pavlenko.maksym@gmail.com>	2021-01-04 10:44:29 -08:00
Peng Tao	9e42070169	mount: handle loopback mount If a mount has specified `loop` option, we need to handle it on our own instead of passing it to the kernel. In such case, create a loopback device, attach the mount source to it, and mount the loopback device rather than the mount source. Signed-off-by: Peng Tao <bergwolf@hyper.sh>	2021-01-04 10:14:55 -08:00
Sebastiaan van Stijn	48f64a18be	mount: extract FUSE unmounting to a function Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2020-10-01 17:29:40 +02:00
Sebastiaan van Stijn	5b13dcc73a	mount.isFUSE(): remove unused error return The error itself was unused, so may as well remove it. Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2020-09-28 21:43:21 +02:00
Akihiro Suda	403dc83a29	mount: retry executing the helper binary on ECHILD `exec.CombinedOutput()` intermittently returns `ECHILD` due to our signal handling. `wait(2)`: https://man7.org/linux/man-pages/man2/wait.2.html > ECHILD (for waitpid() or waitid()) The process specified by pid > (waitpid()) or idtype and id (waitid()) does not exist or is > not a child of the calling process. (This can happen for > one's own child if the action for SIGCHLD is set to SIG_IGN. > See also the Linux Notes section about threads.) Fix #4387 Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2020-07-22 14:24:08 +09:00
Akihiro Suda	e739314ed4	mount: support FUSE helper When m.Type starts with either `fuse.` or `fuse3`, the mount helper binary `mount.fuse` or `mount.fuse3` is executed. Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2020-01-01 04:16:30 +09:00
Georgi Sabev	ae5ca8177d	Refactor mount path check and add comments Co-authored-by: Danail Branekov <danailster@gmail.com> Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>	2019-04-09 16:20:05 +03:00
Georgi Sabev	c0f0b21314	Apply PR feedback * Rootfs dir is created during container creation not during bundle creation * Add support for v2 * UnmountAll is a no-op when the path to unmount (i.e. the rootfs dir) does not exist or is invalid Co-authored-by: Danail Branekov <danailster@gmail.com> Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>	2019-04-04 18:40:30 +03:00
Wei Fu	67b54c6670	Support >= 128 layers in overlayfs snapshots Auto-detect longest common dir in lowerdir option and compact it if the option size is hitting one page size. If does, Use chdir + CLONE to do mount thing to avoid hitting one page argument buffer in linux kernel mount. Signed-off-by: Wei Fu <fhfuwei@163.com>	2018-08-07 10:59:36 +08:00
Kunal Kushwaha	b12c3215a0	Licence header added Signed-off-by: Kunal Kushwaha <kushwaha_kunal_v7@lab.ntt.co.jp>	2018-02-19 10:32:26 +09:00
Michael Crosby	b0ca685874	Retry unmount on EBUSY and return errors This is another WIP to fix #1785. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2017-12-04 11:31:08 -05:00
Michael Crosby	451421b615	Comment more packages to pass go lint Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2017-10-02 13:54:56 -04:00
Akihiro Suda	a560e5e0ef	mount: fix read-only bind (#1368 ) Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>	2017-09-04 04:44:56 +00:00
Ian Campbell	d42cb88ba2	Loop umount'ing rootfs until there are no more mounts This is simpler than trying to count how many successful mounts we made. Signed-off-by: Ian Campbell <ian.campbell@docker.com>	2017-07-20 10:50:08 +01:00
Michael Crosby	d7af92e00c	Move Mount into mount pkg This moves both the Mount type and mountinfo into a single mount package. This also opens up the root of the repo to hold the containerd client implementation. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2017-05-22 16:41:12 -07:00

30 Commits