The apiserver is expected to send pod deletion events that might arrive at a different time. However, sometimes a node could be recreated without its pods being deleted.
Partial revert of https://github.com/kubernetes/kubernetes/pull/86964
Signed-off-by: Aldo Culquicondor <acondor@google.com>
Change-Id: I51f683e5f05689b711c81ebff34e7118b5337571
The lack of this validation on incoming pods causes unpredictable cluster outcomes
when later calculating affinity results against existing pods (see #92714). This fix
quickly addresses the main source where these problems should be caught.
It is unfortunately difficult to add this validation directly to the API server due
to the fact that it may break migrations with existing pods that fail this check. This
is a compromise to address the current issue.
Latest change on master rename the node name from "machine" to "node"
but haven't update all the affected code, which causes some of testcases
invalid.
Signed-off-by: Dave Chen <dave.chen@arm.com>
The implementation consists of
- identifying all places where VolumeSource.PersistentVolumeClaim has
a special meaning and then ensuring that the same code path is taken
for an ephemeral volume, with the ownership check
- adding a controller that produces the PVCs for each embedded
VolumeSource.EphemeralVolume
- relaxing the PVC protection controller such that it removes
the finalizer already before the pod is deleted (only
if the GenericEphemeralVolume feature is enabled): this is
needed to break a cycle where foreground deletion of the pod
blocks on removing the PVC, which waits for deletion of the pod
The controller was derived from the endpointslices controller.
And give ownership to pkg/scheduler/framework/plugins/volumebinding
Signed-off-by: Aldo Culquicondor <acondor@google.com>
Change-Id: I4bd89b1745a2be0e458601056ab905bdd6692195
If no potential victims could be found, there is no need to evaluate the node
again, since its state didn't change.
It's safe to return and thus prevent scheduling from running the filter plugins
again.
NOTE:
A node that is filtered out by filter plugins could pass the filter plugins if
there is a change on that node, i.e. pods termination on that node.
Previously, this could be either caught by the normal `schedule` or `preempt` (pods
are terminated when the preemption logic tries to find the nodes and re-evaluate
the filter plugins.)
Actually, this shouldn't be taken care by the preemption, consider the routine
of `schedule` is always running when the interval is "zero", let `schedule`
take care of it will release `preempt` from something irrelevant with the `preemption`.
Due to above reason, couple of testcase as well as the logic of checking the existence
of victim pods are removed as it will never happen after the change.
Signed-off-by: Dave Chen <dave.chen@arm.com>
This uses the information provided by a CSI driver deployment for
checking whether a node has access to enough storage to create the
currently unbound volumes, if the CSI driver opts into that checking
with CSIDriver.Spec.VolumeCapacity != false.
This resolves a TODO from commit 95b530366a.
node's labels doesn't contain the required topologyKeys in `Constraints`
cannot be resolved by preempting the pods on that pods.
One use case that could easily reproduce the issue is,
- set `alwaysCheckAllPredicates` to true.
- one node contains all the required topologyKeys but is failed in predicates
such as 'taint'.
- another node doesn't hold all the required topologyKeys, and thus return `Unschedulable`
status code.
- scheduler will try to preempt the pods on the above node with lower priorities.
Signed-off-by: Dave Chen <dave.chen@arm.com>
If a reserve plugin's Reserve method returns an error, there could be
previously allocated resources from successfully completed reserve
plugins that must be unallocated by the corresponding Unreserve
operation. Since Unreserve operations are idempotent, this patch runs
the Unreserve operation of ALL reserve plugins when a Reserve operation
fails.