* feature(sscheduling_queue): track events per Pods
* fix typos
* record events in one slice and make each in-flight Pod to refer it
* fix: use Pop() in test before AddUnschedulableIfNotPresent to register in-flight Pods
* eliminate MakeNextPodFuncs
* call Done inside the scheduling queue
* fix comment
* implement done() not to require lock in it
* fix UTs
* improve the receivedEvents implementation based on suggestions
* call DonePod when we don't call AddUnschedulableIfNotPresent
* fix UT
* use queuehint to filter out events for in-flight Pods
* fix based on suggestion from aldo
* fix based on suggestion from Wei
* rename lastEventBefore → previousEvent
* fix based on suggestion
* address comments from aldo
* fix based on the suggestion from Abdullah
* gate in-flight Pods logic by the SchedulingQueueHints feature gate
This is a combination of two related enhancements:
- By implementing a PreEnqueue check, the initial pod scheduling
attempt for a pod with a claim template gets avoided when the claim
does not exist yet.
- By implementing cluster event checks, only those pods get
scheduled for which something changed, and they get scheduled
immediately without delay.
Informer callbacks must be prepared to get cache.DeletedFinalStateUnknown as
the deleted object. They can use that as hint that some information may have
been missed, but typically they just retrieve the stored object inside it.
Generating the name avoids all potential name collisions. It's not clear how
much of a problem that was because users can avoid them and the deterministic
names for generic ephemeral volumes have not led to reports from users. But
using generated names is not too hard either.
What makes it relatively easy is that the new pod.status.resourceClaimStatus
map stores the generated name for kubelet and node authorizer, i.e. the
information in the pod is sufficient to determine the name of the
ResourceClaim.
The resource claim controller becomes a bit more complex and now needs
permission to modify the pod status. The new failure scenario of "ResourceClaim
created, updating pod status fails" is handled with the help of a new special
"resource.kubernetes.io/pod-claim-name" annotation that together with the owner
reference identifies exactly for what a ResourceClaim was generated, so
updating the pod status can be retried for existing ResourceClaims.
The transition from deterministic names is handled with a special case for that
recovery code path: a ResourceClaim with no annotation and a name that follows
the Kubernetes <= 1.27 naming pattern is assumed to be generated for that pod
claim and gets added to the pod status.
There's no immediate need for it, but just in case that it may become relevant,
the name of the generated ResourceClaim may also be left unset to record that
no claim was needed. Components processing such a pod can skip whatever they
normally would do for the claim. To ensure that they do and also cover other
cases properly ("no known field is set", "must check ownership"),
resourceclaim.Name gets extended.
This marks the pods with restartable init containers as
`UnschedulableAndUnresolvable` if the feature gate is disabled to avoid
the inconsistency in resource calculation between the scheduler and the
older kubelet.
We only added failed plulgins, but actually this will not work unless
we make the status with a fitError because we only copy the failured plugins
to podInfo if it is a fitError
Signed-off-by: kerthcet <kerthcet@gmail.com>
event is not passed to QueueingHintFn but it exists a comment about it.
event is unnecessary in QueueingHintFn because QueueingHintFn is used in
ClusterEventWithHint and ClusterEventWithHint already have ClusterEvent.
Signed-off-by: Shingo Omura <everpeace@gmail.com>