Automatic merge from submit-queue (batch tested with PRs 51423, 53880). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Disable ImageGC when high threshold is set to 100
**What this PR does / why we need it**:
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*:
fixes#51268
**Special notes for your reviewer**:
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 61351, 61294). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Fix cpu cfs quota flag with pod cgroups
**What this PR does / why we need it**:
It fixes the cpu-cfs-quota flag in the kubelet when pod cgroups are enabled.
**Which issue(s) this PR fixes**
Fixes#61293
**Special notes for your reviewer**:
This is a regression reported by some of our users that disable cpu quota enforcement.
**Release note**:
```release-note
Fix regression where kubelet --cpu-cfs-quota flag did not work when --cgroups-per-qos was enabled
```
Automatic merge from submit-queue (batch tested with PRs 61284, 61119, 61201). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Prevent garbage collector from attempting to sync with 0 resources
**What this PR does / why we need it**:
As of #55259 we enabled garbagecollector.GetDeletableResources to return partial discovery results (including an empty set of discovery results).
This had the unintended consequence of allowing the garbage collector to enter a blocked state that can only be fixed by restarting.
From [this comment](https://github.com/kubernetes/kubernetes/issues/60037#issuecomment-372801088):
> 1. The Sync function periodically calls GetDeletableResources
>
> 2. According to the comment above GetDeletableResources, All discovery errors are considered temporary. Upon encountering any error, GetDeletableResources will log and return any discovered resources it was able to process (which may be none)., an error in discovery causes the discovery client to no longer discover resources in the cluster, but instead of failing and returning an error, it simply logs the error as garbagecollector.go:601] failed to discover preferred resources: %vthe server was unable to return a response in the time allotted, but may still be processing the request and returns an empty list of resources
>
> 3. The Sync function, upon recieving an empty resource list from discovery, detects that the resources have changed, and calls resyncMonitors, which calls dependencyGraphBuilder.syncMonitors with map[] as the argument as shown in the log as garbagecollector.go:189] syncing garbage collector with updated resources from discovery: map[], which sets the list of monitors to an empty list because it thinks there are no resources to monitor.
>
> 4. Lastly the Sync function calls controller.WaitForCacheSync, which calls cache.WaitForCacheSync, which will continually retry the garbagecollector.IsSynced function until it returns true, but it will always return false because len(gb.monitors) is 0.
This PR prevents that specific race condition from arising.
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes#60037
**Release note**:
```release-note
Fix bug allowing garbage collector to enter a broken state that could only be fixed by restarting the controller-manager.
```
Automatic merge from submit-queue (batch tested with PRs 61284, 61119, 61201). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Fix creation of subpath with SUID/SGID directories.
SafeMakeDir() should apply SUID/SGID/sticky bits to the directory it creates.
Fixes#61283
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Added unschedulable taint
Signed-off-by: Da K. Ma <klaus1982.cn@gmail.com>
**What this PR does / why we need it**:
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
part of #59194; fixes#61050
**Release note**:
```release-note
When `TaintNodesByCondition` enabled, added `node.kubernetes.io/unschedulable:NoSchedule`
taint to the node if `spec.Unschedulable` is true.
When `ScheduleDaemonSetPods` enabled, `node.kubernetes.io/unschedulable:NoSchedule`
toleration is added automatically to DaemonSet Pods; so the `unschedulable` field of
a node is not respected by the DaemonSet controller.
```
Automatic merge from submit-queue (batch tested with PRs 60978, 60985). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Backoff only when failed pod shows up
**What this PR does / why we need it**:
Upon introducing the backoff policy we started to delay sync runs for the job when it failed several times before. This leads to failed jobs not reporting status right away in cases that are not related to failed pods, eg. a successful run. This PR ensures the backoff is applied only when `updatePod` receives a failed pod.
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes#59918#59527
/assign @janetkuo @kow3ns
**Release note**:
```release-note
None
```
Automatic merge from submit-queue (batch tested with PRs 60978, 60985). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Fix use of "-w" flag to iptables-restore
iptables accepts "-w5" but iptables-restore requires "-w 5", so kube-proxy is currently broken for people with an iptables-restore new enough that kube-proxy tries to use the new flags.
Fixes#58956
**Release note**:
```release-note
Fixed kube-proxy to work correctly with iptables 1.6.2 and later.
```
Automatic merge from submit-queue (batch tested with PRs 61203, 61071). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Fix issue with race condition during pod deletion
This PR fixes two issues
1. When desired_state_populator removes podvolume state, it should check
whether the actual state already has the volume before deleting it to
make sure actual state has a chance to add the volume into the state
2. When checking podVolume still exists, it not only checks the actual
state, but also the volume disk directory because actual state might not
reflect the real world when kubelet starts.
fixes issue #60645
This PR fixes two issues
1. When desired_state_populator removes podvolume state, it should check
whether the actual state already has the volume before deleting it to
make sure actual state has a chance to add the volume into the state
2. When checking podVolume still exists, it not only checks the actual
state, but also the volume disk directory because actual state might not
reflect the real world when kubelet starts.
Automatic merge from submit-queue (batch tested with PRs 60888, 61225). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Mark reconstructed volumes as reported InUse
When a newly started kubelet finds a directory where a volume should be,
it can be fairly confident that the volume was mounted by previous kubelet
and therefore the volume must have been in node.status.volumesInUse.
Therefore we can mark reconstructed volumes as already reported so
subsequent reconcile() can fix the directory and put the mounted volume
into actual state of world.
Fixes: #60645
**Release note**:
```release-note
NONE
```
/sig storage
/sig node
cc: @gnufied @jingxu97
Automatic merge from submit-queue (batch tested with PRs 61118, 60579). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Increase loging verbosity for deleting stateful set pods
We should always log reasons for deleting StatefulSet Pods.
@jdumars - what's the current process for putting such changes into the release? It's literally 0-risk change that helps with debugging.
cc @ttz21
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 61111, 61069). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Use pod UID as cache key instead of namespace/name
UID uniquely identifies pods across lifecycles, while namespace/name
could be 2 different pods across lifecycles. This could result in
tricky scheduler bugs.
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes#60966
**Special notes for your reviewer**: @bsalamat
**Release note**:
```release-note
Fix a bug in scheduler cache by using Pod UID as the cache key instead of namespace/name
```
UID uniquely identifies pods across lifecycles, while namespace/name
could be 2 different pods across lifecycles. This could result in
tricky scheduler bugs.
Fixes#60966
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Added missing error check that could cause kubelet to crash
**What this PR does / why we need it**:
Adds missing error check. An error can happen due to a race condition when watched files change, or become inaccessible. This can happen if a file was added to the driver directory then quickly removed, in which case the callback will be called with non-nil `err` and nil `info`, which is not checked, causing kubelet to crash.
**Which issue(s) this PR fixes**:
Fixes#60861
**Special notes for your reviewer**:
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Add missing container-runtime "remote" option
**What this PR does / why we need it**:
Added the "remote" option to the auto-generated documentation for the
`--container-runtime` flag.
The kubelet flag `--container-runtime` lists the possible values as part of the auto-generated documentation but is missing the "remote" possibility.
**Which issue(s) this PR fixes** :
Fixes#60992
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Set readOnly for CSI mounter
**What this PR does / why we need it**:
Currently the `csiMountMgr .readOnly` field is never set, we should set it to `Spec.ReadOnly`.
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes#61008
**Special notes for your reviewer**:
Currently, most of the volume plugins use a `getVolumeSourceFromSpec` method to fetch `VolumeSource` and `ReadOnly` from `volume.Spec`. If the volume is an inline volume, `ReadOnly` is fetched from `Spec.Volume.<SpecificVolumeSource>.ReadOnly`, and if the volume is a `PersistentVolume`, `ReadOnly` is set to `Spec.Readonly`, which comes from `PersistentVolumeClaimVolumeSource.ReadOnly`.
However, as CSI volume plugin is only supported in `PersistentVolume`, so we can just set `ReadOnly` to `Spec.ReadOnly`.
**Release note**:
```release-note
NONE
```
/sig storage
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Code cleanup: group consts togather
**What this PR does / why we need it**:
This is a code cleanup, which groups all consts togather.
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #
**Special notes for your reviewer**:
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Update documentation for azure-shared-securityrule
**What this PR does / why we need it**:
Azure augmented rules for NSGs has been GA https://azure.microsoft.com/en-us/updates/agumented-rules-ga-nsg/. This PR updates documentation for "service.beta.kubernetes.io/azure-shared-securityrule" to reflect this.
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #
**Special notes for your reviewer**:
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
fix show-all option description
**What this PR does / why we need it**:
The default value of kubectl show-all option has been changed from false to true, but its description didn't change accordingly. This patch fix it.
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #
**Special notes for your reviewer**:
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Fixes the races around devicemanager Allocate() and endpoint deletion.
There is a race in predicateAdmitHandler Admit() that getNodeAnyWayFunc()
could get Node with non-zero deviceplugin resource allocatable for a
non-existing endpoint. That race can happen when a device plugin fails,
but is more likely when kubelet restarts as with the current registration
model, there is a time gap between kubelet restart and device plugin
re-registration. During this time window, even though devicemanager could
have removed the resource initially during GetCapacity() call, Kubelet
may overwrite the device plugin resource capacity/allocatable with the
old value when node update from the API server comes in later. This
could cause a pod to be started without proper device runtime config set.
To solve this problem, introduce endpointStopGracePeriod. When a device
plugin fails, don't immediately remove the endpoint but set stopTime in
its endpoint. During kubelet restart, create endpoints with stopTime set
for any checkpointed registered resource. The endpoint is considered to be
in stopGracePeriod if its stoptime is set. This allows us to track what
resources should be handled by devicemanager during the time gap.
When an endpoint's stopGracePeriod expires, we remove the endpoint and
its resource. This allows the resource to be exported through other channels
(e.g., by directly updating node status through API server) if there is such
use case. Currently endpointStopGracePeriod is set as 5 minutes.
Given that an endpoint is no longer immediately removed upon disconnection,
mark all its devices unhealthy so that we can signal the resource allocatable
change to the scheduler to avoid scheduling more pods to the node.
When a device plugin endpoint is in stopGracePeriod, pods requesting the
corresponding resource will fail admission handler.
Tested:
Ran GPUDevicePlugin e2e_node test 100 times and all passed now.
**What this PR does / why we need it**:
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes https://github.com/kubernetes/kubernetes/issues/60176
**Special notes for your reviewer**:
**Release note**:
```release-note
Fixes the races around devicemanager Allocate() and endpoint deletion.
```
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Task 2: Schedule DaemonSet Pods by default scheduler.
Signed-off-by: Da K. Ma <klaus1982.cn@gmail.com>
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
part of #59194https://github.com/kubernetes/features/issues/548
**Release note**:
```release-note
When ScheduleDaemonSetPods is enabled, the DaemonSet controller will delegate Pods scheduling to default scheduler.
```
There is a race in predicateAdmitHandler Admit() that getNodeAnyWayFunc()
could get Node with non-zero deviceplugin resource allocatable for a
non-existing endpoint. That race can happen when a device plugin fails,
but is more likely when kubelet restarts as with the current registration
model, there is a time gap between kubelet restart and device plugin
re-registration. During this time window, even though devicemanager could
have removed the resource initially during GetCapacity() call, Kubelet
may overwrite the device plugin resource capacity/allocatable with the
old value when node update from the API server comes in later. This
could cause a pod to be started without proper device runtime config set.
To solve this problem, introduce endpointStopGracePeriod. When a device
plugin fails, don't immediately remove the endpoint but set stopTime in
its endpoint. During kubelet restart, create endpoints with stopTime set
for any checkpointed registered resource. The endpoint is considered to be
in stopGracePeriod if its stoptime is set. This allows us to track what
resources should be handled by devicemanager during the time gap.
When an endpoint's stopGracePeriod expires, we remove the endpoint and
its resource. This allows the resource to be exported through other channels
(e.g., by directly updating node status through API server) if there is such
use case. Currently endpointStopGracePeriod is set as 5 minutes.
Given that an endpoint is no longer immediately removed upon disconnection,
mark all its devices unhealthy so that we can signal the resource allocatable
change to the scheduler to avoid scheduling more pods to the node.
When a device plugin endpoint is in stopGracePeriod, pods requesting the
corresponding resource will fail admission handler.
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Change regional PD cloud provider references to use the beta API
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes#59988
**Special notes for your reviewer**: Depends on a version of the GCP Go beta compute client that is not yet available. Also need to rebase with #60337 once it's merged.
/hold
/cc @abgworrall
/assign @saad-ali
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
update Mount propagation version in comment
**What this PR does / why we need it**:
Mount propagation feature was moved to beta in PR [#59252](https://github.com/kubernetes/kubernetes/pull/59252), so update the comment.
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes#60657
**Special notes for your reviewer**:
**Release note**:
```release-note
NONE
```