Automatic merge from submit-queue (batch tested with PRs 59214, 65330). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Migrate cpumanager to use checkpointing manager
**What this PR does / why we need it**:
This PR migrates `cpumanager` to use new kubelet level node checkpointing feature (#56040) to decrease code redundancy and improve consistency.
**Which issue(s) this PR fixes**:
Fixes#58339
**Notes**:
At point of submitting PR the most straightforward approach was used - `state_checkpoint` implementation of `State` interface was added. However, with checkpointing implementation there might be no point to keep `State` interface and just use single implementation with checkpoint backend and in case of different backend than filestore needed just supply `cpumanager` with custom `CheckpointManager` implementation.
/kind feature
/sig node
cc @flyingcougar @ConnorDoyle
Automatic merge from submit-queue (batch tested with PRs 64142, 64426, 62910, 63942, 64548). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Clean up fake mounters.
**What this PR does / why we need it**:
Fixes https://github.com/kubernetes/kubernetes/issues/61502
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #
**Special notes for your reviewer**:
list of fake mounters:
- (keep) pkg/util/mount.FakeMounter
- (removed) pkg/kubelet/cm.fakeMountInterface:
- (inherit from mount.FakeMounter) pkg/util/mount.fakeMounter
- (inherit from mount.FakeMounter) pkg/util/removeall.fakeMounter
- (removed) pkg/volume/host_path.fakeFileTypeChecker
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 65230, 57355, 59174, 63698, 63659). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
TODO has already been implemented
**What this PR does / why we need it**:
TODO has already been implemented, remove the TODO tag.
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #
**Special notes for your reviewer**:
**Release note**:
```release-note
```NONE
Automatic merge from submit-queue (batch tested with PRs 63348, 63839, 63143, 64447, 64567). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Containerized subpath
**What this PR does / why we need it**:
Containerized kubelet needs a different implementation of `PrepareSafeSubpath` than kubelet running directly on the host.
On the host we safely open the subpath and then bind-mount `/proc/<pidof kubelet>/fd/<descriptor of opened subpath>`.
With kubelet running in a container, `/proc/xxx/fd/yy` on the host contains path that works only inside the container, i.e. `/rootfs/path/to/subpath` and thus any bind-mount on the host fails.
Solution:
- safely open the subpath and gets its device ID and inode number
- blindly bind-mount the subpath to `/var/lib/kubelet/pods/<uid>/volume-subpaths/<name of container>/<id of mount>`. This is potentially unsafe, because user can change the subpath source to a link to a bad place (say `/run/docker.sock`) just before the bind-mount.
- get device ID and inode number of the destination. Typical users can't modify this file, as it lies on /var/lib/kubelet on the host.
- compare these device IDs and inode numbers.
**Which issue(s) this PR fixes**
Fixes#61456
**Special notes for your reviewer**:
The PR contains some refactoring of `doBindSubPath` to extract the common code. New `doNsEnterBindSubPath` is added for the nsenter related parts.
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Correct kill logic for pod processes
Correct the kill logic for processes in the pod's cgroup. os.FindProcess() does not check whether the process exists on POSIX systems.
Automatic merge from submit-queue (batch tested with PRs 60200, 63623, 63406). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Apply pod name and namespace labels for pod cgroup for cadvisor metrics
**What this PR does / why we need it**:
1. Enable Prometheus users to determine usage by pod name and namespace for pod cgroup sandbox.
1. Label cAdvisor metrics for pod cgroups by pod name and namespace.
1. Aligns with kubelet stats summary endpoint pod cpu and memory stats.
**Special notes for your reviewer**:
This provides parity with the summary API enhancements done here:
https://github.com/kubernetes/kubernetes/pull/55969
**Release note**:
```release-note
Apply pod name and namespace labels to pod cgroup in cAdvisor metrics
```
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
kubelet - Remove unused code
**What this PR does / why we need it**:
Looks like we have a bunch of unused methods. Let's clean them up
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #
**Special notes for your reviewer**:
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Use a []string for CgroupName, which is a more accurate internal representation
**What this PR does / why we need it**:
This is purely a refactoring and should bring no essential change in behavior.
It does clarify the cgroup handling code quite a bit.
It is preparation for further changes we might want to do in the cgroup hierarchy. (But it's useful on its own, so even if we don't do any, it should still be considered.)
**Special notes for your reviewer**:
The slice of strings more precisely captures the hierarchic nature of the cgroup paths we use to represent pods and their groupings.
It also ensures we're reducing the chances of passing an incorrect path format to a cgroup driver that requires a different path naming, since now explicit conversions are always needed.
The new constructor `NewCgroupName` starts from an existing `CgroupName`, which enforces a hierarchy where a root is always needed. It also performs checking on the component names to ensure invalid characters ("/" and "_") are not in use.
A `RootCgroupName` for the top of the cgroup hierarchy tree is introduced.
This refactor results in a net reduction of around 30 lines of code,
mainly with the demise of ConvertCgroupNameToSystemd which had fairly
complicated logic in it and was doing just too many things.
There's a small TODO in a helper `updateSystemdCgroupInfo` that was introduced to make this commit possible. That logic really belongs in libcontainer, I'm planning to send a PR there to include it there. (The API already takes a field with that information, only that field is only processed in cgroupfs and not systemd driver, we should fix that.)
Tested: By running the e2e-node tests on both Ubuntu 16.04 (with cgroupfs driver) and CentOS 7 (with systemd driver.)
**NOTE**: I only tested this with dockershim, we should double-check that this works with the CRI endpoints too, both in cgroupfs and systemd modes.
/assign @derekwaynecarr
/assign @dashpole
/assign @Random-Liu
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
refactor device plugin grpc dial with dialcontext
**What this PR does / why we need it**:
Refactor grpc `dial` with `dialContext` as `grpc.WithTimeout` has been deprecated by:
> use DialContext and context.WithTimeout instead.
**Special notes for your reviewer**:
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 62657, 63278, 62903, 63375). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Add more volume types in e2e and fix part of them.
**What this PR does / why we need it**:
- Add dir-link/dir-bindmounted/dir-link-bindmounted/bockfs volume types for e2e tests.
- Fix fsGroup related e2e tests partially.
- Return error if we cannot resolve volume path.
- Because we should not fallback to volume path, if it's a symbolic link, we may get wrong results.
To safely set fsGroup on local volume, we need to implement these two methods correctly for all volume types both on the host and in container:
- get volume path kubelet can access
- paths on the host and in container are different
- get mount references
- for directories, we cannot use its mount source (device field) to identify mount references, because directories on same filesystem have same mount source (e.g. tmpfs), we need to check filesystem's major:minor and directory root path on it
Here is current status:
| | (A) volume-path (host) | (B) volume-path (container) | (C) mount-refs (host) | (D) mount-refs (container) |
| --- | --- | --- | --- | --- |
| (1) dir | OK | FAIL | FAIL | FAIL |
| (2) dir-link | OK | FAIL | FAIL | FAIL |
| (3) dir-bindmounted | OK | FAIL | FAIL | FAIL |
| (4) dir-link-bindmounted | OK | FAIL | FAIL | FAIL |
| (5) tmpfs| OK | FAIL | FAIL | FAIL |
| (6) blockfs| OK | FAIL | OK | FAIL |
| (7) block| NOTNEEDED | NOTNEEDED | NOTNEEDED | NOTNEEDED |
| (8) gce-localssd-scsi-fs| NOTTESTED | NOTTESTED | NOTTESTED | NOTTESTED |
- This PR uses `nsenter ... readlink` to resolve path in container as @msau42 @jsafrane [suggested](https://github.com/kubernetes/kubernetes/pull/61489#pullrequestreview-110032850). This fixes B1:B6 and D6, , the rest will be addressed in https://github.com/kubernetes/kubernetes/pull/62102.
- C5:D5 marked `FAIL` because `tmpfs` filesystems can share same mount source, we cannot rely on it to check mount references. e2e tests passes due to we use unique mount source string in tests.
- A7:D7 marked `NOTNEEDED` because we don't set fsGroup on block devices in local plugin. (TODO: Should we set fsGroup on block device?)
- A8:D8 marked `NOTTESTED` because I didn't test it, I leave it to `pull-kubernetes-e2e-gce`. I think it should be same as `blockfs`.
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #
**Special notes for your reviewer**:
**Release note**:
```release-note
NONE
```
The slice of strings more precisely captures the hierarchic nature of
the cgroup paths we use to represent pods and their groupings.
It also ensures we're reducing the chances of passing an incorrect path
format to a cgroup driver that requires a different path naming, since
now explicit conversions are always needed.
The new constructor NewCgroupName starts from an existing CgroupName,
which enforces a hierarchy where a root is always needed. It also
performs checking on the component names to ensure invalid characters
("/" and "_") are not in use.
A RootCgroupName for the top of the cgroup hierarchy tree is introduced.
This refactor results in a net reduction of around 30 lines of code,
mainly with the demise of ConvertCgroupNameToSystemd which had fairly
complicated logic in it and was doing just too many things.
There's a small TODO in a helper updateSystemdCgroupInfo that was
introduced to make this commit possible. That logic really belongs in
libcontainer, I'm planning to send a PR there to include it there.
(The API already takes a field with that information, only that field is
only processed in cgroupfs and not systemd driver, we should fix that.)
Tested by running the e2e-node tests on both Ubuntu 16.04 (with cgroupfs
driver) and CentOS 7 (with systemd driver.)
Automatic merge from submit-queue (batch tested with PRs 58474, 60034, 62101, 63198). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
avoid race condition in device manager and plugin startup/shutdown: wait for goroutines
**What this PR does / why we need it**:
Commit 1325c2f worked around issue #59488, but it is still worthwhile to fix the underlying root cause properly.
**Which issue(s) this PR fixes**:
Fixes#59488
**Special notes for your reviewer**:
This is an alternative to PR #59861, which used a different approach. Personally I tend to prefer this one now.
**Release note**:
```release-note
NONE
```
/sig node
/area hw-accelerators
/assign vikaschoudhary16
Automatic merge from submit-queue (batch tested with PRs 61962, 58972, 62509, 62606). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
kubelet: move QOSReserved from experimental to alpha feature gate
Fixes https://github.com/kubernetes/kubernetes/issues/61665
**Release note**:
```release-note
The --experimental-qos-reserve kubelet flags is replaced by the alpha level --qos-reserved flag or QOSReserved field in the kubeletconfig and requires the QOSReserved feature gate to be enabled.
```
/sig node
/assign @derekwaynecarr
/cc @mtaufen
A flaky test exposed a race condition where shutting down one server
instance broke the startup of the next instance when using the same
socket path. Commit 1325c2f8be removed the reuse of the same socket
path and thus avoided the issue.
But the real fix is to ensure that the listening socket is really
closed once Stop returns. Two solutions were proposed in
https://github.com/grpc/grpc-go/issues/1861:
- waiting for the goroutine to complete
- closing the socket
The former is done here because it's cleaner to not keep lingering
goroutines. While at it, the Stop methods are made idempotent (similar
to e.g. Close on a socket) and no longer crash when called without
prior Start.
Fixes https://github.com/kubernetes/kubernetes/issues/59488
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
move the const to the place it should be
**What this PR does / why we need it**:
move the const to the place it should be
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #
**Special notes for your reviewer**:
**Release note**:
```release-note
```