Currently, in interpodaffinty plugin, it only processes all nodes when the incoming
pod with affinity. Actually, it only cares about all nodes when the incoming pod
with preferred affinity. Then it will reduces the number of nodes need to be
processed.
This is a performance optimization that reduces the overhead of inter-pod affinity PreFilter calculaitons. Basically
eliminates that overhead when no pods in the cluster use required pod anti-affinity. This offered 20% improvement on 5k clusters for preferred anti-affinity benchmarks.
When check the incoming pod's anti-affinity rules, there is change to
return early when there is no any matched anti-affinity terms in the
whole cluster.
The lack of this validation on incoming pods causes unpredictable cluster outcomes
when later calculating affinity results against existing pods (see #92714). This fix
quickly addresses the main source where these problems should be caught.
It is unfortunately difficult to add this validation directly to the API server due
to the fact that it may break migrations with existing pods that fail this check. This
is a compromise to address the current issue.
Latest change on master rename the node name from "machine" to "node"
but haven't update all the affected code, which causes some of testcases
invalid.
Signed-off-by: Dave Chen <dave.chen@arm.com>
The implementation consists of
- identifying all places where VolumeSource.PersistentVolumeClaim has
a special meaning and then ensuring that the same code path is taken
for an ephemeral volume, with the ownership check
- adding a controller that produces the PVCs for each embedded
VolumeSource.EphemeralVolume
- relaxing the PVC protection controller such that it removes
the finalizer already before the pod is deleted (only
if the GenericEphemeralVolume feature is enabled): this is
needed to break a cycle where foreground deletion of the pod
blocks on removing the PVC, which waits for deletion of the pod
The controller was derived from the endpointslices controller.
And give ownership to pkg/scheduler/framework/plugins/volumebinding
Signed-off-by: Aldo Culquicondor <acondor@google.com>
Change-Id: I4bd89b1745a2be0e458601056ab905bdd6692195
If no potential victims could be found, there is no need to evaluate the node
again, since its state didn't change.
It's safe to return and thus prevent scheduling from running the filter plugins
again.
NOTE:
A node that is filtered out by filter plugins could pass the filter plugins if
there is a change on that node, i.e. pods termination on that node.
Previously, this could be either caught by the normal `schedule` or `preempt` (pods
are terminated when the preemption logic tries to find the nodes and re-evaluate
the filter plugins.)
Actually, this shouldn't be taken care by the preemption, consider the routine
of `schedule` is always running when the interval is "zero", let `schedule`
take care of it will release `preempt` from something irrelevant with the `preemption`.
Due to above reason, couple of testcase as well as the logic of checking the existence
of victim pods are removed as it will never happen after the change.
Signed-off-by: Dave Chen <dave.chen@arm.com>
This uses the information provided by a CSI driver deployment for
checking whether a node has access to enough storage to create the
currently unbound volumes, if the CSI driver opts into that checking
with CSIDriver.Spec.VolumeCapacity != false.
This resolves a TODO from commit 95b530366a.
node's labels doesn't contain the required topologyKeys in `Constraints`
cannot be resolved by preempting the pods on that pods.
One use case that could easily reproduce the issue is,
- set `alwaysCheckAllPredicates` to true.
- one node contains all the required topologyKeys but is failed in predicates
such as 'taint'.
- another node doesn't hold all the required topologyKeys, and thus return `Unschedulable`
status code.
- scheduler will try to preempt the pods on the above node with lower priorities.
Signed-off-by: Dave Chen <dave.chen@arm.com>
Refactor genericScheduler and signature of preemption funcs
- remove podNominator from genericScheduler
- simplify signature of preemption functions
Make Preempt() private
Previously, separate interfaces were defined for Reserve and Unreserve
plugins. However, in nearly all cases, a plugin that allocates a
resource using Reserve will likely want to register itself for Unreserve
as well in order to free the allocated resource at the end of a failed
scheduling/binding cycle. Having separate plugins for Reserve and
Unreserve also adds unnecessary config toil. To that end, this patch
aims to merge the two plugins into a single interface called a
ReservePlugin that requires implementing both the Reserve and Unreserve
methods.
- Add a defaultpreemption PostFilter plugin
- Make g.Preempt() stateless
- make g.Preempt() stateless
- make g.getLowerPriorityNominatedPods() stateless
- make g.processPreemptionWithExtenders() stateless
`DefaultPodTopologySpread` need't score when the `TopologySpreadConstraints`
is specified.
`PreScore` needn't do this as well, this cut off the cost of `PreScore` if
possible.
Signed-off-by: Dave Chen <dave.chen@arm.com>
This make it easier to catch the issue during the compilation, also,
this also align with other plugins, i.e. plugin of "InterPodAffinity".
Signed-off-by: Dave Chen <dave.chen@arm.com>
This new approach results in better spreading for small number of pods, while still giving meaning to the maxSkew parameter.
Signed-off-by: Aldo Culquicondor <acondor@google.com>
Given the assumption that 90% of images on dockerhub drops into this range (23~1000)MB,
this assumption is based on the container images instead of the pod.
pod might hold multiple container images, it's better to multiply the assumption by the size
of container images.
Signed-off-by: Dave Chen <dave.chen@arm.com>
`BuildArgs` is not used anywhere and the `args` can be directly got from
the instance instead of defining a method to do that.
Signed-off-by: Dave Chen <dave.chen@arm.com>
This change also make it possible to score the resources beyond the "cpu"
and "memory" which is currently listed in "defaultRequestedRatioResources".
Signed-off-by: Dave Chen <dave.chen@arm.com>