Automatic merge from submit-queue (batch tested with PRs 61203, 61071). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Fix issue with race condition during pod deletion
This PR fixes two issues
1. When desired_state_populator removes podvolume state, it should check
whether the actual state already has the volume before deleting it to
make sure actual state has a chance to add the volume into the state
2. When checking podVolume still exists, it not only checks the actual
state, but also the volume disk directory because actual state might not
reflect the real world when kubelet starts.
fixes issue #60645
This PR fixes two issues
1. When desired_state_populator removes podvolume state, it should check
whether the actual state already has the volume before deleting it to
make sure actual state has a chance to add the volume into the state
2. When checking podVolume still exists, it not only checks the actual
state, but also the volume disk directory because actual state might not
reflect the real world when kubelet starts.
Automatic merge from submit-queue (batch tested with PRs 60888, 61225). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Mark reconstructed volumes as reported InUse
When a newly started kubelet finds a directory where a volume should be,
it can be fairly confident that the volume was mounted by previous kubelet
and therefore the volume must have been in node.status.volumesInUse.
Therefore we can mark reconstructed volumes as already reported so
subsequent reconcile() can fix the directory and put the mounted volume
into actual state of world.
Fixes: #60645
**Release note**:
```release-note
NONE
```
/sig storage
/sig node
cc: @gnufied @jingxu97
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Add missing container-runtime "remote" option
**What this PR does / why we need it**:
Added the "remote" option to the auto-generated documentation for the
`--container-runtime` flag.
The kubelet flag `--container-runtime` lists the possible values as part of the auto-generated documentation but is missing the "remote" possibility.
**Which issue(s) this PR fixes** :
Fixes#60992
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Fixes the races around devicemanager Allocate() and endpoint deletion.
There is a race in predicateAdmitHandler Admit() that getNodeAnyWayFunc()
could get Node with non-zero deviceplugin resource allocatable for a
non-existing endpoint. That race can happen when a device plugin fails,
but is more likely when kubelet restarts as with the current registration
model, there is a time gap between kubelet restart and device plugin
re-registration. During this time window, even though devicemanager could
have removed the resource initially during GetCapacity() call, Kubelet
may overwrite the device plugin resource capacity/allocatable with the
old value when node update from the API server comes in later. This
could cause a pod to be started without proper device runtime config set.
To solve this problem, introduce endpointStopGracePeriod. When a device
plugin fails, don't immediately remove the endpoint but set stopTime in
its endpoint. During kubelet restart, create endpoints with stopTime set
for any checkpointed registered resource. The endpoint is considered to be
in stopGracePeriod if its stoptime is set. This allows us to track what
resources should be handled by devicemanager during the time gap.
When an endpoint's stopGracePeriod expires, we remove the endpoint and
its resource. This allows the resource to be exported through other channels
(e.g., by directly updating node status through API server) if there is such
use case. Currently endpointStopGracePeriod is set as 5 minutes.
Given that an endpoint is no longer immediately removed upon disconnection,
mark all its devices unhealthy so that we can signal the resource allocatable
change to the scheduler to avoid scheduling more pods to the node.
When a device plugin endpoint is in stopGracePeriod, pods requesting the
corresponding resource will fail admission handler.
Tested:
Ran GPUDevicePlugin e2e_node test 100 times and all passed now.
**What this PR does / why we need it**:
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes https://github.com/kubernetes/kubernetes/issues/60176
**Special notes for your reviewer**:
**Release note**:
```release-note
Fixes the races around devicemanager Allocate() and endpoint deletion.
```
There is a race in predicateAdmitHandler Admit() that getNodeAnyWayFunc()
could get Node with non-zero deviceplugin resource allocatable for a
non-existing endpoint. That race can happen when a device plugin fails,
but is more likely when kubelet restarts as with the current registration
model, there is a time gap between kubelet restart and device plugin
re-registration. During this time window, even though devicemanager could
have removed the resource initially during GetCapacity() call, Kubelet
may overwrite the device plugin resource capacity/allocatable with the
old value when node update from the API server comes in later. This
could cause a pod to be started without proper device runtime config set.
To solve this problem, introduce endpointStopGracePeriod. When a device
plugin fails, don't immediately remove the endpoint but set stopTime in
its endpoint. During kubelet restart, create endpoints with stopTime set
for any checkpointed registered resource. The endpoint is considered to be
in stopGracePeriod if its stoptime is set. This allows us to track what
resources should be handled by devicemanager during the time gap.
When an endpoint's stopGracePeriod expires, we remove the endpoint and
its resource. This allows the resource to be exported through other channels
(e.g., by directly updating node status through API server) if there is such
use case. Currently endpointStopGracePeriod is set as 5 minutes.
Given that an endpoint is no longer immediately removed upon disconnection,
mark all its devices unhealthy so that we can signal the resource allocatable
change to the scheduler to avoid scheduling more pods to the node.
When a device plugin endpoint is in stopGracePeriod, pods requesting the
corresponding resource will fail admission handler.
When a newly started kubelet finds a directory where a volume should be,
it can be fairly confident that the volume was mounted by previous kubelet
and therefore the volume must have been in node.status.volumesInUse.
Therefore we can mark reconstructed volumes as already reported so
subsequent reconcile() can fix the directory and put the mounted volume
into actual state of world.
Users must not be allowed to step outside the volume with subPath.
Therefore the final subPath directory must be "locked" somehow
and checked if it's inside volume.
On Windows, we lock the directories. On Linux, we bind-mount the final
subPath into /var/lib/kubelet/pods/<uid>/volume-subpaths/<container name>/<subPathName>,
it can't be changed to symlink user once it's bind-mounted.
Automatic merge from submit-queue (batch tested with PRs 60159, 60731, 60720, 60736, 60740). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Promote LocalStorageCapacityIsolation feature to beta
The LocalStorageCapacityIsolation feature added a new resource type ResourceEphemeralStorage "ephemeral-storage" so that this resource can be allocated, limited, and consumed as the same way as CPU/memory. All the features related to resource management (resource request/limit, quota, limitrange) are available for local ephemeral storage.
This local ephemeral storage represents the storage for root file system, which will be consumed by containers' writtable layer and logs. Some volumes such as emptyDir might also consume this storage.
Fixes issue #60160
This PR also fixes data race issues discovered after open the feature gate. Basically setNodeStatus function in kubelet could be called by multiple threads so the data needs lock protection. Put the fix with this PR for easy testing.
**Release note**:
```release-note
ACTION REQUIRED: LocalStorageCapacityIsolation feature is beta and enabled by default.
```
The LocalStorageCapacityIsolation feature added a new resource type
ResourceEphemeralStorage "ephemeral-storage" so that this resource can
be allocated, limited, and consumed as the same way as CPU/memory. All
the features related to resource management (resource request/limit, quota, limitrange) are avaiable for local ephemeral storage.
This local ephemeral storage represents the storage for root file system, which will be consumed by containers' writtable layer and logs. Some volumes such as emptyDir might also consume this storage.
Automatic merge from submit-queue (batch tested with PRs 58171, 58036, 60540). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
dockershim: Return Labels as Info in ImageStatus.
c6ddc749e8 added an Info field to
ImageStatusResponse when Verbose is true. This makes the image's
Labels available in that field, rather than unconditionally returning
an empty map.
**What this PR does / why we need it**:
This PR exposes an image's `Labels` through the CRI. In particular, I want this so I can write an `ImageService` wrapper that delegates all operations to a real `ImageService` but also, when the right `Labels`, ensures any needed [nix store](https://nixos.org/nix/) paths are present on the system when an image is pulled, enabling users to use nix for package distribution while still using containers for isolation and kubernetes for orchestration. In general, though, this should be useful for anything that wants to know about an image's `Labels`
**Special notes for your reviewer**:
I'd prefer to put this change into the `Image` protobuf type instead of putting it into `Info` (gated by `Verbose` or not, available in other requests like `ListImages` or not), but that would be a change to the protocol and it seems `Info` was introduced exactly for this purpose. If it's acceptable to put this into `Image`, I'll rework this.
If this change is acceptable, I will also do the work for `cri-o`, `rktlet`, `frakti`, and `cri-containerd` where applicable.
I have started the process for my employer to sign on to the CLA. I don't have reason to expect it to take long, but because there is more work to do if this change is desired I'd prefer if we can start review before that is completed.
**Release note**:
```release-note
dockershim now makes an Image's Labels available in the Info field of ImageStatusResponse
```
Automatic merge from submit-queue (batch tested with PRs 60342, 60505, 59218, 52900, 60486). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Fixed log calls in VolumeManager
Use glog.Infof() instead of glog.Info()
**Release note**:
```release-note
NONE
```
/sig storage
/sig node
Automatic merge from submit-queue (batch tested with PRs 60376, 55584, 60358, 54631, 60291). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
fix "make test"
Before this pr, we get this in linux:
```
$ make test
Running tests for APIVersion: v1,admissionregistration.k8s.io/v1alpha1,admissionregistration.k8s.io/v1beta1,admission.k8s.io/v1beta1,apps/v1beta1,apps/v1beta2,apps/v1,authentication.k8s.io/v1,authentication.k8s.io/v1beta1,authorization.k8s.io/v1,authorization.k8s.io/v1beta1,autoscaling/v1,autoscaling/v2beta1,batch/v1,batch/v1beta1,batch/v2alpha1,certificates.k8s.io/v1beta1,extensions/v1beta1,events.k8s.io/v1beta1,imagepolicy.k8s.io/v1alpha1,networking.k8s.io/v1,policy/v1beta1,rbac.authorization.k8s.io/v1,rbac.authorization.k8s.io/v1beta1,rbac.authorization.k8s.io/v1alpha1,scheduling.k8s.io/v1alpha1,settings.k8s.io/v1alpha1,storage.k8s.io/v1beta1,storage.k8s.io/v1,storage.k8s.io/v1alpha1,
+++ [0224 16:10:13] Running tests without code coverage
can't load package: package k8s.io/kubernetes/pkg/kubelet/winstats: build constraints exclude all Go files in /home/fujitsu/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/kubelet/winstats
!!! [0224 16:10:15] Call tree:
!!! [0224 16:10:15] 1: hack/make-rules/test.sh:402 runTests(...)
Makefile:182: recipe for target 'test' failed
make: *** [test] Error 1
```
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 60376, 55584, 60358, 54631, 60291). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
move pod-check forward
**What this PR does / why we need it**:
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #
**Special notes for your reviewer**:
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 60236, 60332, 57375, 60451, 57408). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
improve code comment
**What this PR does / why we need it**:
improve code comment
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #
**Special notes for your reviewer**:
**Release note**:
```NONE
```
Automatic merge from submit-queue (batch tested with PRs 60236, 60332, 57375, 60451, 57408). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
kube-scheduler: Support extender managed extended resources in kube-scheduler
**What this PR does / why we need it**:
This is the continuation of https://github.com/kubernetes/kubernetes/pull/58851.
- This PR adds new extender configurations in scheduler policy config.
- A set of extended resource names can be specified in an extender config. They are the resources that are managed by the extender. The scheduler will only send pods to the extender if the pod requests at least one of the extended resources in the set.
- An `IgnoredByScheduler` flag can be set along with each of such resources. If this flag is set to true, the scheduler will not check the resource in the `PodFitsResources` predicate.
- This PR also changes the default behavior of the `PodFitsResources` predicate. Now, by default, `PodFitsResources` will ignore the extended resources that are not in node status. This is required to support extender managed extended resources (including cluster-level resources) on node. Note that in kube-scheduler we override the default behavior by not ignoring such missing extended resources.
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes https://github.com/kubernetes/kubernetes/issues/53616https://github.com/kubernetes/kubernetes/issues/58850
**Special notes for your reviewer**:
**Release note**:
```
Support extender managed extended resources in kube-scheduler
```
Automatic merge from submit-queue (batch tested with PRs 53689, 56880, 55856, 59289, 60249). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Add test/typecheck, a fast typecheck for all build platforms.
Add test/typecheck, a fast typecheck for all build platforms.
Most of the time spent compiling is spent optimizing and linking
binary code. Most errors occur at the syntax or semantic (type) layers.
Go's compiler is importable as a normal package, so we can do fast
syntax and type checking for the 10 platforms we build on.
This currently takes ~6 minutes of CPU time (parallelized).
This makes presubmit cross builds superfluous, since it should catch
most cross-build breaks (generally Unix and 64-bit assumptions).
Example output:
```$ time go run test/typecheck/main.go
type-checking: linux/amd64, windows/386, darwin/amd64, linux/arm,
linux/386, windows/amd64, linux/arm64, linux/ppc64le, linux/s390x, darwin/386
ERROR(windows/amd64) pkg/proxy/ipvs/proxier.go:1708:27: ENXIO not declared by package unix
ERROR(windows/386) pkg/proxy/ipvs/proxier.go:1708:27: ENXIO not declared by package unix
real 0m45.083s
user 6m15.504s
sys 1m14.000s
```
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 53689, 56880, 55856, 59289, 60249). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Harden kube-proxy for unmatched IP versions
**What this PR does / why we need it**:
This PR makes kube-proxy omits & logs & emits event for unmatched IP versions configuration (IPv6 address in IPv4 mode or IPv4 address in IPv6 mode).
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes#57219
**Special notes for your reviewer**:
**Release note**:
```release-note
Fix the issue in kube-proxy iptables/ipvs mode to properly handle incorrect IP version.
```
Automatic merge from submit-queue (batch tested with PRs 53689, 56880, 55856, 59289, 60249). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
simplify defer statement
**What this PR does / why we need it**:
simplify defer statement
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
kubelet: setup WindowsContainerResources for windows containers
**What this PR does / why we need it**:
This PR setups WindowsContainerResources for windows containers. It implements proposal here: https://github.com/kubernetes/community/pull/1510.
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes#56734
**Special notes for your reviewer**:
**Release note**:
```release-note
WindowsContainerResources is set now for windows containers
```
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Add FailedPostStartHook error message.
**What this PR does / why we need it**:
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes#54671
**Special notes for your reviewer**:
/cc @derekwaynecarr
cc @lovejoy @OJezu
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 60157, 60337, 60246, 59714, 60467). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
backoff runtime errors in kubelet sync loop
The runtime health check can race with PLEG's first relist, and this
often results in an unnecessary 5 second wait during Kubelet bootstrap.
This change aims to improve the performance.
```release-note
NONE
```
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
code-gen: output golint compliant 'Generated by' comment
New PR instead of reopening #58115 because /reopen did not work.
This won't be ready to merge until the upstream https://github.com/kubernetes/gengo/pull/94 merges. Once that merges, the second commit will be changed to godep-save.sh and update-staging-godeps.sh, and the last commit will be changed to update-all.sh
The failing test is due to the upstream changes not being merged yet
```devel-release-note
Go code generated by the code generators will now have a comment which allows them to be easily identified by golint
```
Fixes#56489
Automatic merge from submit-queue (batch tested with PRs 60011, 59256, 59293, 60328, 60367). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Add CPU/Memory pod stats for CRI stats.
For https://github.com/kubernetes/features/issues/286.
Add CPU and memory stats for pod.
@kubernetes/sig-node-pr-reviews
/cc @dashpole @yujuhong @abhi @yguo0905
Signed-off-by: Lantao Liu <lantaol@google.com>
**What this PR does / why we need it**:
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #
**Special notes for your reviewer**:
**Release note**:
```release-note
Summary API will include pod CPU and Memory stats for CRI container runtime.
```
Automatic merge from submit-queue (batch tested with PRs 50724, 59025, 59710, 59404, 59958). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
some typo
**What this PR does / why we need it**:
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #
**Special notes for your reviewer**:
**Release note**:
```release-note
```
Automatic merge from submit-queue (batch tested with PRs 50724, 59025, 59710, 59404, 59958). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Add detailed err in ensure docker process error
Signed-off-by: Peng Gao <peng.gao.dut@gmail.com>
**What this PR does / why we need it**:
Add detailed error.
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
**Special notes for your reviewer**:
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
use GetUniqueVolumeNameFromSpec instead of implementing it manually for kubelet volume manager
**What this PR does / why we need it**:
`volumeName` is only used for attachable plugin, so we should resolve it inside the `if` statement. Besides, we can use the already exist `GetUniqueVolumeNameFromSpec` mthod instead of invoking `GetVolumeName` and `GetUniqueVolumeName` manually.
**Release note**:
```release-note
NONE
```
/sig storage
/kind cleanup
Automatic merge from submit-queue (batch tested with PRs 60435, 60334, 60458, 59301, 60125). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
dockershim: don't check pod IP in StopPodSandbox
We're about to tear the container down, there's no point. It also suppresses
an annoying error message due to kubelet stupidity that causes multiple
parallel calls to StopPodSandbox for the same sandbox.
docker_sandbox.go:355] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "docker-registry-1-deploy_default": Unexpected command output nsenter: cannot open /proc/22646/ns/net: No such file or directory
1) A first StopPodSandbox() request triggered by SyncLoop(PLEG) for
a ContainerDied event calls into TearDownPod() and thus the network
plugin. Until this completes, networkReady=true for the
sandbox.
2) A second StopPodSandbox() request triggered by SyncLoop(REMOVE)
calls PodSandboxStatus() and calls into the network plugin to read
the IP address because networkReady=true
3) The first request exits the network plugin, sets networReady=false,
and calls StopContainer() on the sandbox. This destroys the network
namespace.
4) The second request finally gets around to running nsenter but
the network namespace is already destroyed. It returns an error
which is logged by getIP().
```release-note
NONE
```
@yujuhong @freehan