When some plugin was registered as "unschedulable" in some previous scheduling
attempt, it kept that attribute for a pod forever. When that plugin then later
failed with an error that requires backoff, the pod was incorrectly moved to the
"unschedulable" queue where it got stuck until the periodic flushing because
there was no event that the plugin was waiting for.
Here's an example where that happened:
framework.go:1280: E0831 20:03:47.184243] Reserve/DynamicResources: Plugin failed err="Operation cannot be fulfilled on podschedulingcontexts.resource.k8s.io \"test-dragxd5c\": the object has been modified; please apply your changes to the latest version and try again" node="scheduler-perf-dra-7l2v2" plugin="DynamicResources" pod="test/test-dragxd5c"
schedule_one.go:1001: E0831 20:03:47.184345] Error scheduling pod; retrying err="running Reserve plugin \"DynamicResources\": Operation cannot be fulfilled on podschedulingcontexts.resource.k8s.io \"test-dragxd5c\": the object has been modified; please apply your changes to the latest version and try again" pod="test/test-dragxd5c"
...
scheduling_queue.go:745: I0831 20:03:47.198968] Pod moved to an internal scheduling queue pod="test/test-dragxd5c" event="ScheduleAttemptFailure" queue="Unschedulable" schedulingCycle=9576 hint="QueueSkip"
Pop still needs the information about unschedulable plugins to update the
UnschedulableReason metric. It can reset that information before returning the
PodInfo for the next scheduling attempt.
The previous approach was based on the assumption that an in-flight pod can use
the head of the received event list as marker for identifying all events that
occur while the pod is in flight. That assumption is incorrect: when that
existing element gets removed from the list because all pods that were
in-flight when it was received are done, that marker's Next method returns nil
and the code which should have seen several concurrent events (if there were
any) missed all of those.
As a result, a pod with concurrent events could incorrectly get moved to the
unschedulable queue where it could got stuck until the next periodic purging
after 5 minutes if there was no other event for it.
The approach with maintaining a single list of concurrent events can be fixed
by inserting each in-flight pod into the list and using that element to
identify "more recent" events for the pod.
* feature(sscheduling_queue): track events per Pods
* fix typos
* record events in one slice and make each in-flight Pod to refer it
* fix: use Pop() in test before AddUnschedulableIfNotPresent to register in-flight Pods
* eliminate MakeNextPodFuncs
* call Done inside the scheduling queue
* fix comment
* implement done() not to require lock in it
* fix UTs
* improve the receivedEvents implementation based on suggestions
* call DonePod when we don't call AddUnschedulableIfNotPresent
* fix UT
* use queuehint to filter out events for in-flight Pods
* fix based on suggestion from aldo
* fix based on suggestion from Wei
* rename lastEventBefore → previousEvent
* fix based on suggestion
* address comments from aldo
* fix based on the suggestion from Abdullah
* gate in-flight Pods logic by the SchedulingQueueHints feature gate
Generating the name avoids all potential name collisions. It's not clear how
much of a problem that was because users can avoid them and the deterministic
names for generic ephemeral volumes have not led to reports from users. But
using generated names is not too hard either.
What makes it relatively easy is that the new pod.status.resourceClaimStatus
map stores the generated name for kubelet and node authorizer, i.e. the
information in the pod is sufficient to determine the name of the
ResourceClaim.
The resource claim controller becomes a bit more complex and now needs
permission to modify the pod status. The new failure scenario of "ResourceClaim
created, updating pod status fails" is handled with the help of a new special
"resource.kubernetes.io/pod-claim-name" annotation that together with the owner
reference identifies exactly for what a ResourceClaim was generated, so
updating the pod status can be retried for existing ResourceClaims.
The transition from deterministic names is handled with a special case for that
recovery code path: a ResourceClaim with no annotation and a name that follows
the Kubernetes <= 1.27 naming pattern is assumed to be generated for that pod
claim and gets added to the pod status.
There's no immediate need for it, but just in case that it may become relevant,
the name of the generated ResourceClaim may also be left unset to record that
no claim was needed. Components processing such a pod can skip whatever they
normally would do for the claim. To ensure that they do and also cover other
cases properly ("no known field is set", "must check ownership"),
resourceclaim.Name gets extended.
1. Scheduler bug-fix + scheduler-focussed E2E tests
2. Add cgroup v2 support for in-place pod resize
3. Enable full E2E pod resize test for containerd>=1.6.9 and EventedPLEG related changes.
Co-Authored-By: Vinay Kulkarni <vskibum@gmail.com>