Currently, there are some unit tests that are failing on
Windows due to various reasons:
- time.Now() is not as precise on Windows, which means that
2 consecutive calls may return the same timestamp.
- Different "File not found" error messages on Windows.
- The default Container Runtime URL scheme on Windows is npipe, not unix.
addUnschedulablePodBackToBackoffQ happened to put the pod into the backoff
queue because
- the pod was not popped earlier and thus not in flight
- the PodInfo had UnschedulablePlugins set
- determineSchedulingHintForInFlightPod has code for "if UnschedulablePlugins
is set and pod not in flight -> internal error, use backoff"
Relying on such special code is not good. A better way to force backoff is by
recording some concurrent event. isPodWorthRequeuing then calls the
queueHintReturnQueueAfterBackoff function and the pod goes to the backoff
queue.
When some plugin was registered as "unschedulable" in some previous scheduling
attempt, it kept that attribute for a pod forever. When that plugin then later
failed with an error that requires backoff, the pod was incorrectly moved to the
"unschedulable" queue where it got stuck until the periodic flushing because
there was no event that the plugin was waiting for.
Here's an example where that happened:
framework.go:1280: E0831 20:03:47.184243] Reserve/DynamicResources: Plugin failed err="Operation cannot be fulfilled on podschedulingcontexts.resource.k8s.io \"test-dragxd5c\": the object has been modified; please apply your changes to the latest version and try again" node="scheduler-perf-dra-7l2v2" plugin="DynamicResources" pod="test/test-dragxd5c"
schedule_one.go:1001: E0831 20:03:47.184345] Error scheduling pod; retrying err="running Reserve plugin \"DynamicResources\": Operation cannot be fulfilled on podschedulingcontexts.resource.k8s.io \"test-dragxd5c\": the object has been modified; please apply your changes to the latest version and try again" pod="test/test-dragxd5c"
...
scheduling_queue.go:745: I0831 20:03:47.198968] Pod moved to an internal scheduling queue pod="test/test-dragxd5c" event="ScheduleAttemptFailure" queue="Unschedulable" schedulingCycle=9576 hint="QueueSkip"
Pop still needs the information about unschedulable plugins to update the
UnschedulableReason metric. It can reset that information before returning the
PodInfo for the next scheduling attempt.