doc: Add documentation about CRI user namespaces
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
This commit is contained in:
		
							
								
								
									
										146
									
								
								docs/user-namespaces/README.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										146
									
								
								docs/user-namespaces/README.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,146 @@ | ||||
| # Support for user namespaces | ||||
|  | ||||
| Kubernetes supports running pods with user namespace since v1.25. This document explains the | ||||
| containerd support for this feature. | ||||
|  | ||||
| ## What are user namespaces? | ||||
|  | ||||
| A user namespace isolates the user running inside the container from the one in the host. | ||||
|  | ||||
| A process running as root in a container can run as a different (non-root) user in the host; in | ||||
| other words, the process has full privileges for operations inside the user namespace, but is | ||||
| unprivileged for operations outside the namespace. | ||||
|  | ||||
| You can use this feature to reduce the damage a compromised container can do to the host or other | ||||
| pods in the same node. There are several security vulnerabilities rated either HIGH or CRITICAL that | ||||
| were not exploitable when user namespaces is active. It is expected user namespace will mitigate | ||||
| some future vulnerabilities too. | ||||
|  | ||||
| See [the kubernetes documentation][kube-intro] for a high-level introduction to | ||||
| user namespaces. | ||||
|  | ||||
| [kube-intro]: https://kubernetes.io/docs/concepts/workloads/pods/user-namespaces/#introduction | ||||
|  | ||||
| ## Stack requirements | ||||
|  | ||||
| The Kubernetes implementation was redesigned in 1.27, so the requirements are different for versions | ||||
| pre and post Kubernetes 1.27. | ||||
|  | ||||
| Please note that if you try to use user namespaces with containerd 1.6 or older, the `hostUsers: | ||||
| false` setting in your pod.spec will be **silently ignored**. | ||||
|  | ||||
| ### Kubernetes 1.25 and 1.26 | ||||
|  | ||||
|  * Containerd 1.7 or greater | ||||
|  * runc 1.1 or greater | ||||
|  | ||||
| ### Kubernetes 1.27 and greater | ||||
|  | ||||
|  * Linux 6.3 or greater | ||||
|  * Containerd 2.0 or greater | ||||
|  * You can use runc or crun as the OCI runtime: | ||||
|    * runc 1.2 or greater | ||||
|    * crun 1.9 or greater | ||||
|  | ||||
| Furthermore, all the file-systems used by the volumes in the pod need kernel-support for idmap | ||||
| mounts. Some popular file-systems that support idmap mounts in Linux 6.3 are: `btrfs`, `ext4`, `xfs`, | ||||
| `fat`, `tmpfs`, `overlayfs`. | ||||
|  | ||||
| The kubelet is in charge of populating some files to the containers (like configmap, secrets, etc.). | ||||
| The file-system used in that path needs to support idmap mounts too. See [the Kubernetes | ||||
| documentation][kube-req] for more info on that. | ||||
|  | ||||
|  | ||||
| [kube-req]: https://kubernetes.io/docs/concepts/workloads/pods/user-namespaces/#before-you-begin | ||||
|  | ||||
| ## Creating a Kubernetes pod with user namespaces | ||||
|  | ||||
| First check your containerd, Linux and Kubernetes versions. If those are okay, then there is no | ||||
| special configuration needed on conntainerd. You can just follow the steps in the [Kubernetes | ||||
| website][kube-example]. | ||||
|  | ||||
| [kube-example]: https://kubernetes.io/docs/tasks/configure-pod-container/user-namespaces/ | ||||
|  | ||||
| # Limitations | ||||
|  | ||||
| You can check the limitations Kubernetes has [here][kube-limitations]. Note that different | ||||
| Kubernetes versions have different limitations, be sure to check the site for the Kubernetes version | ||||
| you are using. | ||||
|  | ||||
| Different containerd versions have different limitations too, those are highlighted in this section. | ||||
|  | ||||
| [kube-limitations]: https://kubernetes.io/docs/concepts/workloads/pods/user-namespaces/#limitations | ||||
|  | ||||
| ### containerd 1.7 | ||||
|  | ||||
| One limitation present in containerd 1.7 is that it needs to change the ownership of every file and | ||||
| directory inside the container image, during Pod startup. This means it has a storage overhead (the | ||||
| size of the container image is duplicated each time a pod is created) and can significantly impact | ||||
| the container startup latency. | ||||
|  | ||||
| You can mitigate this limitation by switching `/sys/module/overlay/parameters/metacopy` to `Y`. This | ||||
| will significantly reduce the storage and performance overhead, as only the inode for each file of | ||||
| the container image will be duplicated, but not the content of the file. This means it will use less | ||||
| storage and it will be faster. However, it is not a panacea. | ||||
|  | ||||
| If you change the metacopy param, make sure to do it in a way that is persistant across reboots. You | ||||
| should also be aware that this setting will be used for all containers, not just containers with | ||||
| user namespaces enabled. This will affect all the snapshots that you take manually (if you happen to | ||||
| do that). In that case, make sure to use the same value of `/sys/module/overlay/parameters/metacopy` | ||||
| when creating and restoring the snapshot. | ||||
|  | ||||
| ### containerd 2.0 | ||||
|  | ||||
| The storage and latency limitation from containerd 1.7 are not present in container 2.0 and above, | ||||
| if you use the overlay snapshotter (this is used by default). It will not use more storage at all, | ||||
| and there is no startup latency. | ||||
|  | ||||
| This is achieved by using the kernel feature idmap mounts with the container rootfs (the container | ||||
| image). This allows an overlay file-system to expose the image with different UID/GID without copying | ||||
| the files nor the inodes, just using a bind-mount. | ||||
|  | ||||
| You can check if you are using idmap mounts for the container image if you create a pod with user | ||||
| namespaces, exec into it and run: | ||||
|  | ||||
| ``` | ||||
| mount | grep overlay | ||||
| ``` | ||||
|  | ||||
| You should see a reference to the idmap mount in the `lowerdir` parameter, in this case we can see | ||||
| `idmapped` used there: | ||||
|  | ||||
| ``` | ||||
| overlay on / type overlay (rw,relatime,lowerdir=/tmp/ovl-idmapped823885363/0,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/1018/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/1018/work) | ||||
| ``` | ||||
|  | ||||
| ## Creating a container with user namespaces with `ctr` | ||||
|  | ||||
| You can also create a container with user namespaces using `ctr`. This is more low-level, be warned. | ||||
|  | ||||
| Create an OCI bundle as explained [here][runc-bundle]. Then, change the UID/GID to 65536: | ||||
|  | ||||
| ``` | ||||
| sudo chown -R 65536:65536 rootfs/ | ||||
| ``` | ||||
|  | ||||
| Copy [this config.json](./config.json) and replace `XXX-path-to-rootfs` with the | ||||
| absolute path to the rootfs you just chowned. | ||||
|  | ||||
| Then create and start the container with: | ||||
|  | ||||
| ``` | ||||
| sudo ctr create --config <path>/config.json userns-test | ||||
| sudo ctr t start userns-test | ||||
| ``` | ||||
|  | ||||
| This will open a shell inside the container. You can run this, to verify you are inside a user | ||||
| namespace: | ||||
|  | ||||
| ``` | ||||
| root@runc:/# cat /proc/self/uid_map | ||||
|          0      65536      65536 | ||||
| ``` | ||||
|  | ||||
| The output should be exactly the same. | ||||
|  | ||||
| [runc-bundle]: https://github.com/opencontainers/runc#creating-an-oci-bundle | ||||
		Reference in New Issue
	
	Block a user
	 Rodrigo Campos
					Rodrigo Campos