Currently, in interpodaffinty plugin, it only processes all nodes when the incoming
pod with affinity. Actually, it only cares about all nodes when the incoming pod
with preferred affinity. Then it will reduces the number of nodes need to be
processed.
This commit transforms the next() function of the scheduler node
tree into a listNodes() function that directly returns a list of
nodes, going through the zones in a round-robin fashion. This
removes the flawed logic of the next() function.
The apiserver is expected to send pod deletion events that might arrive at a different time. However, sometimes a node could be recreated without its pods being deleted.
Partial revert of https://github.com/kubernetes/kubernetes/pull/86964
Signed-off-by: Aldo Culquicondor <acondor@google.com>
Change-Id: I51f683e5f05689b711c81ebff34e7118b5337571
Only wait for finished binding or error, but not both
Signed-off-by: Aldo Culquicondor <acondor@google.com>
Change-Id: I13d16e6c7c45c6527591aa05cc79fc5e96d47a68
When check the incoming pod's anti-affinity rules, there is change to
return early when there is no any matched anti-affinity terms in the
whole cluster.
The lack of this validation on incoming pods causes unpredictable cluster outcomes
when later calculating affinity results against existing pods (see #92714). This fix
quickly addresses the main source where these problems should be caught.
It is unfortunately difficult to add this validation directly to the API server due
to the fact that it may break migrations with existing pods that fail this check. This
is a compromise to address the current issue.
- Move "ForgetPod" after "RunReservePluginsUnreserve", so that the cache would hold the pod to
avoid it's being retried simutaneously until Unreserve is completed.
- Move "assume" ahead of "RunReservePluginsReserve". This is based on the fact that "ForgetPod" is
the last step of failure path, so "assume" should be reversly treated as the first step. The
current failure path is like this:
assume -> reserve -> unreserve -> forgetPod -> recordingFailure
- Make subtests of TestReservePluginUnreserve stateless
When create fake data for the nodeTree unittests, The 'append' is invoked
on the common fake data set. That makes the unittests is running with unexpected
fake data after that.
When nodes are added in multiple zones at once, the nodeTree next
function does not return a correct list of nodes but repeats some
This commit resets the index before starting to call next() to
prevent this issue
Special thanks to igraecao for the help in finding the bug
Co-authored-by: igraecao <matvej.yolli@outlook.com>