This change enriches the testcase of `TestPriorityQueue_MoveAllToActiveOrBackoffQueue`
to cover the case that pods in the unschedulableQ could be moved to activeQ after the
backing off.
Signed-off-by: Dave Chen <dave.chen@arm.com>
Currently, in interpodaffinty plugin, it only processes all nodes when the incoming
pod with affinity. Actually, it only cares about all nodes when the incoming pod
with preferred affinity. Then it will reduces the number of nodes need to be
processed.
This is a performance optimization that reduces the overhead of inter-pod affinity PreFilter calculaitons. Basically
eliminates that overhead when no pods in the cluster use required pod anti-affinity. This offered 20% improvement on 5k clusters for preferred anti-affinity benchmarks.
This commit transforms the next() function of the scheduler node
tree into a listNodes() function that directly returns a list of
nodes, going through the zones in a round-robin fashion. This
removes the flawed logic of the next() function.
The apiserver is expected to send pod deletion events that might arrive at a different time. However, sometimes a node could be recreated without its pods being deleted.
Partial revert of https://github.com/kubernetes/kubernetes/pull/86964
Signed-off-by: Aldo Culquicondor <acondor@google.com>
Change-Id: I51f683e5f05689b711c81ebff34e7118b5337571
Only wait for finished binding or error, but not both
Signed-off-by: Aldo Culquicondor <acondor@google.com>
Change-Id: I13d16e6c7c45c6527591aa05cc79fc5e96d47a68
When check the incoming pod's anti-affinity rules, there is change to
return early when there is no any matched anti-affinity terms in the
whole cluster.
The lack of this validation on incoming pods causes unpredictable cluster outcomes
when later calculating affinity results against existing pods (see #92714). This fix
quickly addresses the main source where these problems should be caught.
It is unfortunately difficult to add this validation directly to the API server due
to the fact that it may break migrations with existing pods that fail this check. This
is a compromise to address the current issue.
- Move "ForgetPod" after "RunReservePluginsUnreserve", so that the cache would hold the pod to
avoid it's being retried simutaneously until Unreserve is completed.
- Move "assume" ahead of "RunReservePluginsReserve". This is based on the fact that "ForgetPod" is
the last step of failure path, so "assume" should be reversly treated as the first step. The
current failure path is like this:
assume -> reserve -> unreserve -> forgetPod -> recordingFailure
- Make subtests of TestReservePluginUnreserve stateless
When create fake data for the nodeTree unittests, The 'append' is invoked
on the common fake data set. That makes the unittests is running with unexpected
fake data after that.
When nodes are added in multiple zones at once, the nodeTree next
function does not return a correct list of nodes but repeats some
This commit resets the index before starting to call next() to
prevent this issue
Special thanks to igraecao for the help in finding the bug
Co-authored-by: igraecao <matvej.yolli@outlook.com>
Latest change on master rename the node name from "machine" to "node"
but haven't update all the affected code, which causes some of testcases
invalid.
Signed-off-by: Dave Chen <dave.chen@arm.com>
The implementation consists of
- identifying all places where VolumeSource.PersistentVolumeClaim has
a special meaning and then ensuring that the same code path is taken
for an ephemeral volume, with the ownership check
- adding a controller that produces the PVCs for each embedded
VolumeSource.EphemeralVolume
- relaxing the PVC protection controller such that it removes
the finalizer already before the pod is deleted (only
if the GenericEphemeralVolume feature is enabled): this is
needed to break a cycle where foreground deletion of the pod
blocks on removing the PVC, which waits for deletion of the pod
The controller was derived from the endpointslices controller.
And give ownership to pkg/scheduler/framework/plugins/volumebinding
Signed-off-by: Aldo Culquicondor <acondor@google.com>
Change-Id: I4bd89b1745a2be0e458601056ab905bdd6692195
If no potential victims could be found, there is no need to evaluate the node
again, since its state didn't change.
It's safe to return and thus prevent scheduling from running the filter plugins
again.
NOTE:
A node that is filtered out by filter plugins could pass the filter plugins if
there is a change on that node, i.e. pods termination on that node.
Previously, this could be either caught by the normal `schedule` or `preempt` (pods
are terminated when the preemption logic tries to find the nodes and re-evaluate
the filter plugins.)
Actually, this shouldn't be taken care by the preemption, consider the routine
of `schedule` is always running when the interval is "zero", let `schedule`
take care of it will release `preempt` from something irrelevant with the `preemption`.
Due to above reason, couple of testcase as well as the logic of checking the existence
of victim pods are removed as it will never happen after the change.
Signed-off-by: Dave Chen <dave.chen@arm.com>
This uses the information provided by a CSI driver deployment for
checking whether a node has access to enough storage to create the
currently unbound volumes, if the CSI driver opts into that checking
with CSIDriver.Spec.VolumeCapacity != false.
This resolves a TODO from commit 95b530366a.
node's labels doesn't contain the required topologyKeys in `Constraints`
cannot be resolved by preempting the pods on that pods.
One use case that could easily reproduce the issue is,
- set `alwaysCheckAllPredicates` to true.
- one node contains all the required topologyKeys but is failed in predicates
such as 'taint'.
- another node doesn't hold all the required topologyKeys, and thus return `Unschedulable`
status code.
- scheduler will try to preempt the pods on the above node with lower priorities.
Signed-off-by: Dave Chen <dave.chen@arm.com>
Using NodeWrapper in the integration tests gives more flexibility when
creating nodes. For instance, tests can create nodes with labels or
with a specific sets of resources.
Also, NodeWrapper initialises a node with a capacity of 32 pods, which
can be overridden by the caller. This makes sure that a node is usable
as soon as it is created.
If a reserve plugin's Reserve method returns an error, there could be
previously allocated resources from successfully completed reserve
plugins that must be unallocated by the corresponding Unreserve
operation. Since Unreserve operations are idempotent, this patch runs
the Unreserve operation of ALL reserve plugins when a Reserve operation
fails.
Refactor genericScheduler and signature of preemption funcs
- remove podNominator from genericScheduler
- simplify signature of preemption functions
Make Preempt() private
Previously, separate interfaces were defined for Reserve and Unreserve
plugins. However, in nearly all cases, a plugin that allocates a
resource using Reserve will likely want to register itself for Unreserve
as well in order to free the allocated resource at the end of a failed
scheduling/binding cycle. Having separate plugins for Reserve and
Unreserve also adds unnecessary config toil. To that end, this patch
aims to merge the two plugins into a single interface called a
ReservePlugin that requires implementing both the Reserve and Unreserve
methods.
and e2e_scheduling_duration_seconds
Also adding result label to e2e_scheduling_duration_seconds. Previously, the metric was only updated for successful attempts
Signed-off-by: Aldo Culquicondor <acondor@google.com>
- Add a defaultpreemption PostFilter plugin
- Make g.Preempt() stateless
- make g.Preempt() stateless
- make g.getLowerPriorityNominatedPods() stateless
- make g.processPreemptionWithExtenders() stateless
`DefaultPodTopologySpread` need't score when the `TopologySpreadConstraints`
is specified.
`PreScore` needn't do this as well, this cut off the cost of `PreScore` if
possible.
Signed-off-by: Dave Chen <dave.chen@arm.com>
This make it easier to catch the issue during the compilation, also,
this also align with other plugins, i.e. plugin of "InterPodAffinity".
Signed-off-by: Dave Chen <dave.chen@arm.com>
- replace error with NodeToStatusMap in Preempt() signature
- eliminate podPreemptor interface and expose its functions statelessly
- move logic in scheduler.go#preempt to generic_scheduler.go#Preempt()