Blocking API calls during a scheduling cycle like the DRA plugin is doing slow
down overall scheduling, i.e. also affecting pods which don't use DRA.
It is easy to move the blocking calls into a goroutine while the scheduling
cycle ends with "pod unschedulable". The hard part is handling an error when
those API calls then fail in the background. There is a solution for that
(see https://github.com/kubernetes/kubernetes/pull/120963), but it's complex.
Instead, publishing the modified PodSchedulingContext can also be done
later. In the more common case of a pod which is ready for binding except for
its claims, that'll be in PreBind, which runs in a separate goroutine already.
In the less common case that a pod cannot be scheduled, that'll be in
Unreserve which is still blocking.
This moves adding a pod to ReservedFor out of the main scheduling cycle into
PreBind. There it is done concurrently in different goroutines. For claims
which were specifically allocated for a pod (the most common case), that
usually makes no difference because the claim is already reserved.
It starts to matter when that pod then cannot be scheduled for other reasons,
because then the claim gets unreserved to allow deallocating it. It also
matters for claims that are created separately and then get used multiple times
by different pods.
Because multiple pods might get added to the same claim rapidly independently
from each other, it makes sense to do all claim status updates via patching:
then it is no longer necessary to have an up-to-date copy of the claim because
the patch operation will succeed if (and only if) the patched claim is valid.
Server-side-apply cannot be used for this because a client always has to send
the full list of all entries that it wants to be set, i.e. it cannot add one
entry unless it knows the full list.
When dealing with unschedulable pods, the intent was to deallocate only claims
which are allocated and use delayed allocation. That if check wasn't handled
correctly, causing also claims with immediate allocation to be considered as
candidates.
Found during code reading, probably has never occurred in practice yet.
During scheduler_perf testing, roughly 10% of the PodSchedulingContext update
operations failed with a conflict error. Using SSA would avoid that, but
performance measurements showed that this causes a considerable
slowdown (primarily because of the slower encoding with JSON instead of
protobuf, but also because server-side processing is more expensive).
Therefore a normal update is tried first and SSA only gets used when there has
been a conflict. Using SSA in that case instead of giving up outright is better
because it avoids another scheduling attempt.
This fixes a test flake:
[sig-node] DRA [Feature:DynamicResourceAllocation] multiple nodes reallocation [It] works
/nvme/gopath/src/k8s.io/kubernetes/test/e2e/dra/dra.go:552
[FAILED] number of deallocations
Expected
<int64>: 2
to equal
<int64>: 1
In [It] at: /nvme/gopath/src/k8s.io/kubernetes/test/e2e/dra/dra.go:651 @ 09/05/23 14:01:54.652
This can be reproduced locally with
stress -p 10 go test ./test/e2e -args -ginkgo.focus=DynamicResourceAllocation.*reallocation.works -ginkgo.no-color -v=4 -ginkgo.v
Log output showed that the sequence of events leading to this was:
- claim gets allocated because of selected node
- a different node has to be used, so PostFilter sets
claim.status.deallocationRequested
- the driver deallocates
- before the scheduler can react and select a different node,
the driver allocates *again* for the original node
- the scheduler asks for deallocation again
- the driver deallocates again (causing the test failure)
- eventually the pod runs
The fix is to disable allocations first by removing the selected node and then
starting to deallocate.
Instead of modifying the PodSchedulingContext and then creating or updating it,
now the required changes (selected node, potential nodes) are tracked and the
actual input for an API call is created if (and only if) needed at the end.
This makes the code easier to read and change. In particular, replacing the
Update call with Patch or Apply is easy.
When filtering fails because a ResourceClass is missing, we can treat the pod
as "unschedulable" as long as we then also register a cluster event that wakes
up the pod. This is more efficient than periodically retrying.
This is a combination of two related enhancements:
- By implementing a PreEnqueue check, the initial pod scheduling
attempt for a pod with a claim template gets avoided when the claim
does not exist yet.
- By implementing cluster event checks, only those pods get
scheduled for which something changed, and they get scheduled
immediately without delay.
Generating the name avoids all potential name collisions. It's not clear how
much of a problem that was because users can avoid them and the deterministic
names for generic ephemeral volumes have not led to reports from users. But
using generated names is not too hard either.
What makes it relatively easy is that the new pod.status.resourceClaimStatus
map stores the generated name for kubelet and node authorizer, i.e. the
information in the pod is sufficient to determine the name of the
ResourceClaim.
The resource claim controller becomes a bit more complex and now needs
permission to modify the pod status. The new failure scenario of "ResourceClaim
created, updating pod status fails" is handled with the help of a new special
"resource.kubernetes.io/pod-claim-name" annotation that together with the owner
reference identifies exactly for what a ResourceClaim was generated, so
updating the pod status can be retried for existing ResourceClaims.
The transition from deterministic names is handled with a special case for that
recovery code path: a ResourceClaim with no annotation and a name that follows
the Kubernetes <= 1.27 naming pattern is assumed to be generated for that pod
claim and gets added to the pod status.
There's no immediate need for it, but just in case that it may become relevant,
the name of the generated ResourceClaim may also be left unset to record that
no claim was needed. Components processing such a pod can skip whatever they
normally would do for the claim. To ensure that they do and also cover other
cases properly ("no known field is set", "must check ownership"),
resourceclaim.Name gets extended.
The name "PodScheduling" was unusual because in contrast to most other names,
it was impossible to put an article in front of it. Now PodSchedulingContext is
used instead.