addUnschedulablePodBackToBackoffQ happened to put the pod into the backoff
queue because
- the pod was not popped earlier and thus not in flight
- the PodInfo had UnschedulablePlugins set
- determineSchedulingHintForInFlightPod has code for "if UnschedulablePlugins
is set and pod not in flight -> internal error, use backoff"
Relying on such special code is not good. A better way to force backoff is by
recording some concurrent event. isPodWorthRequeuing then calls the
queueHintReturnQueueAfterBackoff function and the pod goes to the backoff
queue.
This fixes a test flake:
[sig-node] DRA [Feature:DynamicResourceAllocation] multiple nodes reallocation [It] works
/nvme/gopath/src/k8s.io/kubernetes/test/e2e/dra/dra.go:552
[FAILED] number of deallocations
Expected
<int64>: 2
to equal
<int64>: 1
In [It] at: /nvme/gopath/src/k8s.io/kubernetes/test/e2e/dra/dra.go:651 @ 09/05/23 14:01:54.652
This can be reproduced locally with
stress -p 10 go test ./test/e2e -args -ginkgo.focus=DynamicResourceAllocation.*reallocation.works -ginkgo.no-color -v=4 -ginkgo.v
Log output showed that the sequence of events leading to this was:
- claim gets allocated because of selected node
- a different node has to be used, so PostFilter sets
claim.status.deallocationRequested
- the driver deallocates
- before the scheduler can react and select a different node,
the driver allocates *again* for the original node
- the scheduler asks for deallocation again
- the driver deallocates again (causing the test failure)
- eventually the pod runs
The fix is to disable allocations first by removing the selected node and then
starting to deallocate.
When some plugin was registered as "unschedulable" in some previous scheduling
attempt, it kept that attribute for a pod forever. When that plugin then later
failed with an error that requires backoff, the pod was incorrectly moved to the
"unschedulable" queue where it got stuck until the periodic flushing because
there was no event that the plugin was waiting for.
Here's an example where that happened:
framework.go:1280: E0831 20:03:47.184243] Reserve/DynamicResources: Plugin failed err="Operation cannot be fulfilled on podschedulingcontexts.resource.k8s.io \"test-dragxd5c\": the object has been modified; please apply your changes to the latest version and try again" node="scheduler-perf-dra-7l2v2" plugin="DynamicResources" pod="test/test-dragxd5c"
schedule_one.go:1001: E0831 20:03:47.184345] Error scheduling pod; retrying err="running Reserve plugin \"DynamicResources\": Operation cannot be fulfilled on podschedulingcontexts.resource.k8s.io \"test-dragxd5c\": the object has been modified; please apply your changes to the latest version and try again" pod="test/test-dragxd5c"
...
scheduling_queue.go:745: I0831 20:03:47.198968] Pod moved to an internal scheduling queue pod="test/test-dragxd5c" event="ScheduleAttemptFailure" queue="Unschedulable" schedulingCycle=9576 hint="QueueSkip"
Pop still needs the information about unschedulable plugins to update the
UnschedulableReason metric. It can reset that information before returning the
PodInfo for the next scheduling attempt.
Instead of modifying the PodSchedulingContext and then creating or updating it,
now the required changes (selected node, potential nodes) are tracked and the
actual input for an API call is created if (and only if) needed at the end.
This makes the code easier to read and change. In particular, replacing the
Update call with Patch or Apply is easy.
When filtering fails because a ResourceClass is missing, we can treat the pod
as "unschedulable" as long as we then also register a cluster event that wakes
up the pod. This is more efficient than periodically retrying.
The problematic scenario was having one pod in flight, one event in the list,
and then detecting a concurrent event for a second pod after the first pod is
done. The new test case covers that.
To make it work without assumptions about the implementation, the QueuedPodInfo
returned by Pop must be the one passed to AddUnschedulableIfNotPresent
after (potentially) populating UnschedulablePlugins. This is done via callback
functions which bind to the same shared variable.
The previous approach was based on the assumption that an in-flight pod can use
the head of the received event list as marker for identifying all events that
occur while the pod is in flight. That assumption is incorrect: when that
existing element gets removed from the list because all pods that were
in-flight when it was received are done, that marker's Next method returns nil
and the code which should have seen several concurrent events (if there were
any) missed all of those.
As a result, a pod with concurrent events could incorrectly get moved to the
unschedulable queue where it could got stuck until the next periodic purging
after 5 minutes if there was no other event for it.
The approach with maintaining a single list of concurrent events can be fixed
by inserting each in-flight pod into the list and using that element to
identify "more recent" events for the pod.
Because of a misplaced `append` (should have been inside if clause, not after
it), some handler from a previous loop iteration was added again. This was
harmless because the resulting slice was only used for waiting for cache sync,
but should better get fixed anyway.
PVC and containers shared the same ResourceRequirements struct to define their
API. When resource claims were added, that struct got extended, which
accidentally also changed the PVC API. To avoid such a mistake from happening
again, PVC now uses its own VolumeResourceRequirements struct.
The `Claims` field gets removed because risk of breaking someone is low:
theoretically, YAML files which have a claims field for volumes now
get rejected when validating against the OpenAPI. Such files
have never made sense and should be fixed.
Code that uses the struct definitions needs to be updated.