Files in pkg/util/mount that contain significant code implementation for
subpaths are moved to a new package at pkg/volume/util/subpath. This
move is done in order to preserve git history.
- Move from the old github.com/golang/glog to k8s.io/klog
- klog as explicit InitFlags() so we add them as necessary
- we update the other repositories that we vendor that made a similar
change from glog to klog
* github.com/kubernetes/repo-infra
* k8s.io/gengo/
* k8s.io/kube-openapi/
* github.com/google/cadvisor
- Entirely remove all references to glog
- Fix some tests by explicit InitFlags in their init() methods
Change-Id: I92db545ff36fcec83afe98f550c9e630098b3135
IsNotMountPoint should return no error when the checked directory does not
exists - missing directory can't be mounted. Therefore containerized
kubelet should check if the target exists first before resolving symlinks.
EvalHostSymlinks() returns indistinguishible error in case the path does
not exist.
_netdev mount option is a userspace mount option and
isn't copied over when bind mount is created and remount
also does not copies it over and hence must be explicitly
used with bind mount
Automatic merge from submit-queue (batch tested with PRs 63348, 63839, 63143, 64447, 64567). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Containerized subpath
**What this PR does / why we need it**:
Containerized kubelet needs a different implementation of `PrepareSafeSubpath` than kubelet running directly on the host.
On the host we safely open the subpath and then bind-mount `/proc/<pidof kubelet>/fd/<descriptor of opened subpath>`.
With kubelet running in a container, `/proc/xxx/fd/yy` on the host contains path that works only inside the container, i.e. `/rootfs/path/to/subpath` and thus any bind-mount on the host fails.
Solution:
- safely open the subpath and gets its device ID and inode number
- blindly bind-mount the subpath to `/var/lib/kubelet/pods/<uid>/volume-subpaths/<name of container>/<id of mount>`. This is potentially unsafe, because user can change the subpath source to a link to a bad place (say `/run/docker.sock`) just before the bind-mount.
- get device ID and inode number of the destination. Typical users can't modify this file, as it lies on /var/lib/kubelet on the host.
- compare these device IDs and inode numbers.
**Which issue(s) this PR fixes**
Fixes#61456
**Special notes for your reviewer**:
The PR contains some refactoring of `doBindSubPath` to extract the common code. New `doNsEnterBindSubPath` is added for the nsenter related parts.
**Release note**:
```release-note
NONE
```
Kubelet should not resolve symlinks outside of mounter interface.
Only mounter interface knows, how to resolve them properly on the host.
As consequence, declaration of SafeMakeDir changes to simplify the
implementation:
from SafeMakeDir(fullPath string, base string, perm os.FileMode)
to SafeMakeDir(subdirectoryInBase string, base string, perm os.FileMode)
Automatic merge from submit-queue (batch tested with PRs 62657, 63278, 62903, 63375). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Add more volume types in e2e and fix part of them.
**What this PR does / why we need it**:
- Add dir-link/dir-bindmounted/dir-link-bindmounted/bockfs volume types for e2e tests.
- Fix fsGroup related e2e tests partially.
- Return error if we cannot resolve volume path.
- Because we should not fallback to volume path, if it's a symbolic link, we may get wrong results.
To safely set fsGroup on local volume, we need to implement these two methods correctly for all volume types both on the host and in container:
- get volume path kubelet can access
- paths on the host and in container are different
- get mount references
- for directories, we cannot use its mount source (device field) to identify mount references, because directories on same filesystem have same mount source (e.g. tmpfs), we need to check filesystem's major:minor and directory root path on it
Here is current status:
| | (A) volume-path (host) | (B) volume-path (container) | (C) mount-refs (host) | (D) mount-refs (container) |
| --- | --- | --- | --- | --- |
| (1) dir | OK | FAIL | FAIL | FAIL |
| (2) dir-link | OK | FAIL | FAIL | FAIL |
| (3) dir-bindmounted | OK | FAIL | FAIL | FAIL |
| (4) dir-link-bindmounted | OK | FAIL | FAIL | FAIL |
| (5) tmpfs| OK | FAIL | FAIL | FAIL |
| (6) blockfs| OK | FAIL | OK | FAIL |
| (7) block| NOTNEEDED | NOTNEEDED | NOTNEEDED | NOTNEEDED |
| (8) gce-localssd-scsi-fs| NOTTESTED | NOTTESTED | NOTTESTED | NOTTESTED |
- This PR uses `nsenter ... readlink` to resolve path in container as @msau42 @jsafrane [suggested](https://github.com/kubernetes/kubernetes/pull/61489#pullrequestreview-110032850). This fixes B1:B6 and D6, , the rest will be addressed in https://github.com/kubernetes/kubernetes/pull/62102.
- C5:D5 marked `FAIL` because `tmpfs` filesystems can share same mount source, we cannot rely on it to check mount references. e2e tests passes due to we use unique mount source string in tests.
- A7:D7 marked `NOTNEEDED` because we don't set fsGroup on block devices in local plugin. (TODO: Should we set fsGroup on block device?)
- A8:D8 marked `NOTTESTED` because I didn't test it, I leave it to `pull-kubernetes-e2e-gce`. I think it should be same as `blockfs`.
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #
**Special notes for your reviewer**:
**Release note**:
```release-note
NONE
```
In our CI, we run kubekins image for most of the jobs. This is a
debian image with upstart and does not enable systemd. So we should:
* Bailout if any binary is missing other than systemd-run.
* SupportsSystemd should check the binary path to correctly
identify if the systemd-run is present or not
* Pass the errors back to the callers so kubelet is forced to
fail early when there is a problem. We currently assume
that all binaries are in the root directory by default which
is wrong.
Users must not be allowed to step outside the volume with subPath.
Therefore the final subPath directory must be "locked" somehow
and checked if it's inside volume.
On Windows, we lock the directories. On Linux, we bind-mount the final
subPath into /var/lib/kubelet/pods/<uid>/volume-subpaths/<container name>/<subPathName>,
it can't be changed to symlink user once it's bind-mounted.
Kubelet makes sure that /var/lib/kubelet is rshared when it starts.
If not, it bind-mounts it with rshared propagation to containers
that mount volumes to /var/lib/kubelet can benefit from mount propagation.
Automatic merge from submit-queue
Run mount in its own systemd scope.
Kubelet needs to run /bin/mount in its own cgroup.
- When kubelet runs as a systemd service, "systemctl restart kubelet" may kill all processes in the same cgroup and thus terminate fuse daemons that are needed for gluster and cephfs mounts.
- When kubelet runs in a docker container, restart of the container kills all fuse daemons started in the container.
Killing fuse daemons is bad, it basically unmounts volumes from running pods.
This patch runs mount via "systemd-run --scope /bin/mount ...", which makes sure that any fuse daemons are forked in its own systemd scope (= cgroup) and they will survive restart of kubelet's systemd service or docker container.
This helps with #34965
As a downside, each new fuse daemon will run in its own transient systemd service and systemctl output may be cluttered.
@kubernetes/sig-storage-pr-reviews
@kubernetes/sig-node-pr-reviews
```release-note
fuse daemons for GlusterFS and CephFS are now run in their own systemd scope when Kubernetes runs on a system with systemd.
```
Kubelet needs to run /bin/mount in its own cgroup.
- When kubelet runs as a systemd service, "systemctl restart kubelet" may kill
all processes in the same cgroup and thus terminate fuse daemons that are
needed for gluster and cephfs mounts.
- When kubelet runs in a docker container, restart of the container kills all
fuse daemons started in the container.
Killing fuse daemons is bad, it basically unmounts volumes from running pods.
This patch runs mount via "systemd-run --scope /bin/mount ...", which makes
sure that any fuse daemons are forked in its own systemd scope (= cgroup) and
they will survive restart of kubelet's systemd service or docker container.
As a downside, each new fuse daemon will run in its own transient systemd
service and systemctl output may be cluttered.
NsEnterMounter should not stop parsing findmnt output on the first space but
on the last one, just in case the mount point name itself contains a space.
Added IsNotMountPoint method to mount utils (pkg/util/mount/mount.go)
Added UnmountMountPoint method to volume utils (pkg/volume/util/util.go)
Call UnmountMountPoint method from local storage (pkg/volume/local/local.go)
IsLikelyNotMountPoint behavior was not modified, so the logic/behavior for UnmountPath is not modified
Currently kubelet volume management works on the concept of desired
and actual world of states. The volume manager periodically compares the
two worlds and perform volume mount/unmount and/or attach/detach
operations. When kubelet restarts, the cache of those two worlds are
gone. Although desired world can be recovered through apiserver, actual
world can not be recovered which may cause some volumes cannot be cleaned
up if their information is deleted by apiserver. This change adds the
reconstruction of the actual world by reading the pod directories from
disk. The reconstructed volume information is added to both desired
world and actual world if it cannot be found in either world. The rest
logic would be as same as before, desired world populator may clean up
the volume entry if it is no longer in apiserver, and then volume
manager should invoke unmount to clean it up.
In nsenter_mount.go/isLikelyNotMountPoint function, the returned output
from findmnt command misses the last letter. Modify the code to make sure
that output has the full target path. fix#26421#25056#22911
Kubelet was not able to mount volumes when running inside a container and
using nsenter mounter,
NsenterMounter.IsLikelyNotMountPoint() should return ErrNotExist when the
checked directory does not exists as the regular mounted does this and
some volume plugins depend on this behavior.