When a PodAffinityTerm uses TopologyKey=kubernetes.io/hostname, we can
avoid searching the entire cluster for a match by only listing pods on
the given node.
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Avoid string concatenation when comparing pods.
**What this PR does / why we need it**:
Pod comparison in (*NodeInfo).Filter was using GetPodFullName before
comparing pod names. This is a concatenation of pod name and pod
namespace, and it is significantly faster to compare name & namespace
instead.
This is a set of 3 PRs targeting affinity predicate performance. (#57476, #57477, #57478) The key takeaway is approximately 2x speedup in the large affinity benchmark.
The unexpected increase in BenchmarkScheduling/1000Nodes/1000Pods seems to be an outlier, and did not recur on subsequent runs. The benchmarks have a moderate amount of variance to them, and I did not run them enough times to measure mean and standard deviation.
| test | b.N | master | #57476 | #57477 | #57478 | combined |
| ---- | --- | ------ | ------ | ---------- | ---------- | -------- |
| BenchmarkScheduling/100Nodes/0Pods | 100 | 39629010 ns/op | 36898566 ns/op (-6.89%) | 38461530 ns/op (-2.95%) | 36214136 ns/op (-8.62%) | 43090781 ns/op (+8.74%) |
| BenchmarkScheduling/100Nodes/1000Pods | 100 | 85489577 ns/op | 69538016 ns/op (-18.66%) | 70104254 ns/op (-18.00%) | 75015585 ns/op (-12.25%) | 80986960 ns/op (-5.27%) |
| BenchmarkScheduling/1000Nodes/0Pods | 100 | 219356660 ns/op | 200149051 ns/op (-8.76%) | 192867469 ns/op (-12.08%) | 196896770 ns/op (-10.24%) | 212563662 ns/op (-3.10%) |
| BenchmarkScheduling/1000Nodes/1000Pods | 100 | 380368238 ns/op | 381786369 ns/op (+0.37%) | 387224973 ns/op (+1.80%) | 417974358 ns/op (+9.89%) | 411140230 ns/op (+8.09%) |
| BenchmarkSchedulingAntiAffinity/500Nodes/250Pods | 250 | 124399176 ns/op | 97568988 ns/op (-21.57%) | 112027363 ns/op (-9.95%) | 129134326 ns/op (+3.81%) | 98607941 ns/op (-20.73%) |
| BenchmarkSchedulingAntiAffinity/500Nodes/5000Pods | 250 | 491677096 ns/op | 441562422 ns/op (-10.19%) | 278127757 ns/op (-43.43%) | 447355609 ns/op (-9.01%) | 226310721 ns/op (-53.97%) |
Combined performance contains all three patches.
Percentages are relative to master.
Methodology:
I ran the tests on each branch with this command.
```
make test-integration WHAT="./test/integration/scheduler_perf" KUBE_TEST_ARGS="-run=xxxx -bench=."
```
The benchmarks have a fair amount of variance to them, and I did not run them enough times to measure mean and standard deviation.
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #
The three PRs in this set should collectively fix#54189.
**Special notes for your reviewer**:
**Release note**:
```release-note
Improve scheduler performance of MatchInterPodAffinity predicate.
```
The method (*schedulerCache).FilteredList builds an array of *v1.Pod
that contains every pod in the cluster except for those filtered out by
a predicate. Today, it starts with a nil slice and appends to it.
Based on current usage, FilteredList is expected to return every pod in
the cluster or omit some pods from a single node. This change reserves
array capacity equal to the total number of pods in the cluster.
Pod comparison in (*NodeInfo).Filter was using GetPodFullName before
comparing pod names. This is a concatenation of pod name and pod
namespace, and it is significantly faster to compare name & namespace
instead.
Automatic merge from submit-queue (batch tested with PRs 57257, 55442). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Merge 3 resource allocation priority functions
**What this PR does / why we need it**: those 3 priority functions are closed related, and share a lot of the same logic, put them together.
**Release note**:
```release-note
None
```
Automatic merge from submit-queue (batch tested with PRs 56375, 56872, 57053, 57165, 57218). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
remove extra level check of glog
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 56250, 56809, 56812, 56792, 56724). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Fix typo
Signed-off-by: Li Yi <denverdino@gmail.com>
**What this PR does / why we need it**:
Fix the typo in /plugin/pkg/scheduler/algorithmprovider/defaults.go
Automatic merge from submit-queue (batch tested with PRs 56480, 56675, 56624, 56648, 56658). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
fix scheduling queue unit test
This change makes sure the Pop() test finish completely.
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 56579, 55236, 56512, 56549, 56538). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Heap is not thread safe in scheduling queue
/cc @bsalamat
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 56413, 56322, 56490, 56460, 56487). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Put process of getting pod controller reference into metadata
**What this PR does / why we need it**:
We should extract our common process/data into metadata just as other map priority functions do, so we could avoid getting same required data repeatedly in every node map process.
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #
None
**Special notes for your reviewer**:
**Release note**:
```release-note
None
```
Automatic merge from submit-queue (batch tested with PRs 57211, 56150, 56368, 56271, 55957). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Skip pods that refer to PVCs that are being deleted
**What this PR does / why we need it**:
New check was added to `Schedule()` to make sure that a scheduled pod refers to existing PVCs that are not being deleted.
In 1.9 we plan to add a new feature that uses finalizers on PVC to protect PVCs that are used by a running pod from being deleted. This finalizer will be removed when all pods that use a PVC are finished or deleted. See https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/postpone-pvc-deletion-if-used-in-a-pod.md for details.
I needed to pass `pvcLister` to `GenericScheduler`.
UX:
```
$ kubectl describe pod
...
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 5s (x4 over 8s) default-scheduler persistentvolumeclaim "myclaim" is being deleted
Warning FailedScheduling 1s (x2 over 1s) default-scheduler persistentvolumeclaim "myclaim" not found
```
**Release note**:
```release-note
Scheduler skips pods that use a PVC that either does not exist or is being deleted.
```
/sig scheduling
/kind feature
Automatic merge from submit-queue (batch tested with PRs 57211, 56150, 56368, 56271, 55957). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Put variable declared in the front.
**What this PR does / why we need it**:
put variable declared in the front.
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #
**Special notes for your reviewer**:
**Release note**:
```release-note
```
Automatic merge from submit-queue (batch tested with PRs 56308, 54304, 56364, 56388, 55853). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
httptest server should be close since Close issue has been fixed
**What this PR does / why we need it**:
per https://github.com/kubernetes/kubernetes/issues/19254, the issue seem to be fix for a long time and `server.Close` is no longer a issue in current related golang version, so it's time to uncomment the server.Close().
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #
None
**Special notes for your reviewer**:
**Release note**:
```release-note
None
```
Automatic merge from submit-queue (batch tested with PRs 56308, 54304, 56364, 56388, 55853). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
clean up failure domain from InterPodAffinityPriority
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 54410, 56184, 56199, 56191, 56231). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
remove useless const
Trivial fix.
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
delete a node from its cache if it gets node not found error
**What this PR does / why we need it**:
delete a node from its cache if it gets node not found error
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes # https://github.com/kubernetes/kubernetes/issues/56261
**Special notes for your reviewer**:
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 56688, 56577). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Add pvc as part of equivalence hash
**What this PR does / why we need it**:
Should add PVC as part of equivalence hash so that `StatefulSe`t and `Operator` will always run the volume predicate, while the `ReplicaSet` can still re-use cached ones.
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes#56265
**Special notes for your reviewer**:
**Release note**:
```release-note
Add pvc as part of equivalence hash
```
resource limits are satisfied by the input node's allocatable resources or not.
If yes, the node is assigned a score of 1, otherwise the node's score is not changed.
Automatic merge from submit-queue (batch tested with PRs 55952, 49112, 55450, 56178, 56151). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Add PodDisruptionBudget support in pod preemption
**What this PR does / why we need it**:
This PR adds the logic to make scheduler preemption aware of PodDisruptionBudget. Preemption tries to avoid preempting pods whose PDBs are violated by preemption. If preemption does not find any other pods to preempt, it will preempt pods despite violating their PDBs.
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes#53913
**Special notes for your reviewer**:
**Release note**:
```release-note
Add PodDisruptionBudget support during pod preemption
```
ref/ #47604
/sig scheduling
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
fixtypo
**What this PR does / why we need it**:
fixtypo
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #
**Special notes for your reviewer**:
**Release note**:
```release-note
None
```
Automatic merge from submit-queue (batch tested with PRs 51321, 55969, 55039, 56183, 55976). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Topology aware volume scheduler and PV controller changes
**What this PR does / why we need it**:
Scheduler and PV controller changes to support volume topology aware scheduling, as specified in kubernetes/community#1168
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes#54435
**Special notes for your reviewer**:
* I've split the PR into logical commits to make it easier to review
* The remaining TODOs I plan to address next release unless you think it needs to be done now
**Release note**:
```release-note
Adds alpha support for volume scheduling, which allows the scheduler to make PersistentVolume binding decisions while respecting the Pod's scheduling requirements. Dynamic provisioning is not supported with this feature yet.
Action required for existing users of the LocalPersistentVolumes alpha feature:
* The VolumeScheduling feature gate also has to be enabled on kube-scheduler and kube-controller-manager.
* The NoVolumeNodeConflict predicate has been removed. For non-default schedulers, update your scheduler policy.
* The CheckVolumeBinding predicate has to be enabled in non-default schedulers.
```
@kubernetes/sig-storage-pr-reviews @kubernetes/sig-scheduling-pr-reviews
Automatic merge from submit-queue (batch tested with PRs 55103, 56036, 56186). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Removed opaque integer resources (deprecated in v1.8)
**What this PR does / why we need it**:
* Remove opaque integer resources (OIR) support from the code base. This feature was deprecated in v1.8 and replaced by Extended Resources (ER).
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes#55102
**Release note**:
```release-note
Remove opaque integer resources (OIR) support (deprecated in v1.8.)
```
Automatic merge from submit-queue (batch tested with PRs 56128, 56004, 56083, 55833, 56042). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Suppress the warning when a pod in binding cannot be expired
**What this PR does / why we need it**:
I have a scheduler extender, which implements the `Bind` call and takes several minutes to respond to that call. The scheduler log was full of the following error.
```
W1120 10:23:09.691188 99720 cache.go:442] Couldn't expire cache for pod default/xxx. Binding is still in progress.
```
The TTL for a pod to be expired in the scheduler cache is 30 seconds. But it's also possible that the binding (which is done asynchronously) can take longer than 30 seconds.
2cbb07a439/plugin/pkg/scheduler/factory/factory.go (L143)
The go routine that checks whether a pod has been expired is triggered every second.
2cbb07a439/plugin/pkg/scheduler/schedulercache/cache.go (L33)
So, it will print the the following warning every seconds until the pod gets expired.
2cbb07a439/plugin/pkg/scheduler/schedulercache/cache.go (L442-L443)
I think it's a valid for the binding to take more than one second, so we should downgrade this to an info to avoid polluting the scheduler log.
**Release note**:
```release-note
None
```
/sig scheduling
/assign @bsalamat
/cc @vishh