In reality, the kubelet plugin of a DRA driver is meant to be deployed as a
daemonset with a service account that limits its
permissions. https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#additional-metadata-in-pod-bound-tokens
ensures that the node name is bound to the pod, which then can be used
in a validating admission policy (VAP) to ensure that the operations are
limited to the node.
In E2E testing, we emulate that via impersonation. This ensures that the plugin
does not accidentally depend on additional permissions.
This is the second and final step towards making kubelet independent of the
resource.k8s.io API versioning because it now doesn't need to copy structs
defined by that API from the driver to the API server.
This is a first step towards making kubelet independent of the resource.k8s.io
API versioning because it now doesn't need to copy structs defined by that API
from the driver to the API server. The next step is removing the other
direction (reading ResourceClaim status and passing the resource handle to
drivers).
The drivers must get deployed so that they have their own connection to the API
server. Securing at least the writes via a validating admission policy should
be possible.
As before, the kubelet removes all ResourceSlices for its node at startup, then
DRA drivers recreate them if (and only if) they start up again. This ensures
that there are no orphaned ResourceSlices when a driver gets removed while the
kubelet was down.
While at it, logging gets cleaned up and updated to use structured, contextual
logging as much as possible. gRPC requests and streams now use a shared,
per-process request ID and streams also get logged.
This makes the API nicer:
resourceClaims:
- name: with-template
resourceClaimTemplateName: test-inline-claim-template
- name: with-claim
resourceClaimName: test-shared-claim
Previously, this was:
resourceClaims:
- name: with-template
source:
resourceClaimTemplateName: test-inline-claim-template
- name: with-claim
source:
resourceClaimName: test-shared-claim
A more long-term benefit is that other, future alternatives
might not make sense under the "source" umbrella.
This is a breaking change. It's justified because DRA is still
alpha and will have several other API breaks in 1.31.
Dropping the error that is returned by allocateOne hides the reason *why*
allocation failed. Including the UID is "too much information" for an error
message (usually the user doesn't care about the exact identity, just the name)
and the claim name can and will be added by the caller.
Before:
controller.go:373: E0625 16:04:12.140953] test-driver.cdi.k8s.io/resource controller: processing failed err="claim test-dramq9jv-resource-h72pg: failed allocating claim 8551afba-3c9a-4a8a-8633-6fad6c4b9e42" key="schedulingCtx:test/test-dramq9jv"
event.go:377: I0625 16:04:12.141031] test-driver.cdi.k8s.io/resource controller: Event(v1.ObjectReference{Kind:"PodSchedulingContext", Namespace:"test", Name:"test-dra65gfw", UID:"6be9ba57-31da-4fef-b61d-b0468d71afcf", APIVersion:"resource.k8s.io/v1alpha3", ResourceVersion:"197", FieldPath:""}): type: 'Warning' reason: 'Failed' claim test-dra65gfw-resource-zpzrj: failed allocating claim f98a32e1-ab7d-4b34-a258-6d8224aa9006
After:
controller.go:373: E0625 16:02:54.248059] test-driver.cdi.k8s.io/resource controller: processing failed err="claim test-dram98ll-resource-nvsbj: device selectors are not supported" key="schedulingCtx:test/test-dram98ll"
event.go:377: I0625 16:02:54.248163] test-driver.cdi.k8s.io/resource controller: Event(v1.ObjectReference{Kind:"PodSchedulingContext", Namespace:"test", Name:"test-dratpt77", UID:"24010402-b026-4fe4-a535-e1dab69db8c0", APIVersion:"resource.k8s.io/v1alpha3", ResourceVersion:"298", FieldPath:""}): type: 'Warning' reason: 'Failed' claim test-dratpt77-resource-vlgrv: device selectors are not supported
When using structured parameters, the instance name must match and not be in
use already.
NodeUnprepareResources must be called with the same handle are
NodePrepareResources.
While currently those objects only get published by the kubelet for node-local
resources, this could change once we also support network-attached
resources. Dropping the "Node" prefix enables such a future extension.
The NodeName in ResourceSlice and StructuredResourceHandle then becomes
optional. The kubelet still needs to provide one and it must match its own node
name, otherwise it doesn't have permission to access ResourceSlice objects.
The information is received from the DRA driver plugin through a new gRPC
streaming interface. This is backwards compatible with old DRA driver kubelet
plugins, their gRPC server will return "not implemented" and that can be
handled by kubelet. Therefore no API break is needed.
However, DRA drivers need to be updated because the Go API changed. They can
return
status.New(codes.Unimplemented, "no node resource support").Err()
if they don't support the new ListAndWatchResources method and
structured parameters.
The controller in kubelet then synchronizes this information from the driver
with NodeResourceSlice objects, creating, updating and deleting them as needed.
If the resource handle has data from a structured parameter model, then we need
to pass that to the DRA driver kubelet plugin. Because Kubernetes uses
gogo/protobuf, we cannot use "optional" for that new optional field and have to
resort to "repeated" with a single repetition if present.
This is a new, backwards-compatible field.
That extending the resource.k8s.io changes the checksum of a kubelet checkpoint
is unfortunate. Updating the test cases is a stop-gap measure, the actual
solution will have to be something else before beta.
When failing inside the `ginkgo.By` callback function, skipping intermediate
stack frames didn't work properly because `ginkgo.By` itself and other internal
code is also on the stack.
To fix this, the code which can fail now runs outside of such a
callback. That's not a big loss, the only advantage of the callback was getting
timing statistics from Ginkgo which weren't used in practice.
This moves adding a pod to ReservedFor out of the main scheduling cycle into
PreBind. There it is done concurrently in different goroutines. For claims
which were specifically allocated for a pod (the most common case), that
usually makes no difference because the claim is already reserved.
It starts to matter when that pod then cannot be scheduled for other reasons,
because then the claim gets unreserved to allow deallocating it. It also
matters for claims that are created separately and then get used multiple times
by different pods.
Because multiple pods might get added to the same claim rapidly independently
from each other, it makes sense to do all claim status updates via patching:
then it is no longer necessary to have an up-to-date copy of the claim because
the patch operation will succeed if (and only if) the patched claim is valid.
Server-side-apply cannot be used for this because a client always has to send
the full list of all entries that it wants to be set, i.e. it cannot add one
entry unless it knows the full list.
This changes the text registration so that tags for which the framework has a
dedicated API (features, feature gates, slow, serial, etc.) those APIs are
used.
Arbitrary, custom tags are still left in place for now.
Several enhancements:
- `--resource-config` is now listed under `controller` options instead of
`leader election`: merely a cosmetic change
- The driver name can be configured as part of the resource config. The
command line flag overrides the config, but only when set explicitly.
This makes it possible to pre-define complete driver setups where the
name is associated with certain resource availability. This will be
used for testing cluster autoscaling.
- The set of nodes where resources are available can optionally be specified
via node labels. This will be used for testing cluster autoscaling.
When filtering fails because a ResourceClass is missing, we can treat the pod
as "unschedulable" as long as we then also register a cluster event that wakes
up the pod. This is more efficient than periodically retrying.
Analyzing the CPU profile of
go test -timeout=0 -count=5 -cpuprofile profile.out -bench=BenchmarkPerfScheduling/.*Claim.* -benchtime=1ns -run=xxx ./test/integration/scheduler_perf
showed that a significant amount of time was spent iterating over allocated
claims to determine how many were allocated per node. That "naive" approach was
taken to avoid maintaining a redundant data structure, but now that performance
measurements show that this comes at a cost, it's not "premature optimization"
anymore to introduce such a second field.
The average scheduling throughput in
SchedulingWithResourceClaimTemplate/2000pods_100nodes increases from 16.4
pods/s to 19.2 pods/s.
When someone decides that a Pod should definitely run on a specific node, they
can create the Pod with spec.nodeName already set. Some custom scheduler might
do that. Then kubelet starts to check the pod and (if DRA is enabled) will
refuse to run it, either because the claims are still waiting for the first
consumer or the pod wasn't added to reservedFor. Both are things the scheduler
normally does.
Also, if a pod got scheduled while the DRA feature was off in the
kube-scheduler, a pod can reach the same state.
The resource claim controller can handle these two cases by taking over for the
kube-scheduler when nodeName is set. Triggering an allocation is simpler than
in the scheduler because all it takes is creating the right
PodSchedulingContext with spec.selectedNode set. There's no need to list nodes
because that choice was already made, permanently. Adding the pod to
reservedFor also isn't hard.
What's currently missing is triggering de-allocation of claims to re-allocate
them for the desired node. This is not important for claims that get created
for the pod from a template and then only get used once, but it might be
worthwhile to add de-allocation in the future.
The recommendation and default in the controller helper code is to set
ReservedFor to the pod which triggered delayed allocation. However, this
is neither required nor enforced. Therefore we should also test the fallback
path were kube-scheduler itself adds the pod to ReservedFor.
Combining all prepare/unprepare operations for a pod enables plugins to
optimize the execution. Plugins can continue to use the v1beta2 API for now,
but should switch. The new API is designed so that plugins which want to work
on each claim one-by-one can do so and then report errors for each claim
separately, i.e. partial success is supported.
The main problem probably was that
https://github.com/kubernetes/kubernetes/pull/118862 moved creating the first
pod before setting up the callback which blocks allocating one claim for that
pod. This is racy because allocations happen in the background.
The test also was unnecessarily complex and hard to read:
- The intended effect can be achieved with three instead of four claims.
- It wasn't clear which claim has "external-claim-other" as name.
Using the claim variable avoids that.
Generating the name avoids all potential name collisions. It's not clear how
much of a problem that was because users can avoid them and the deterministic
names for generic ephemeral volumes have not led to reports from users. But
using generated names is not too hard either.
What makes it relatively easy is that the new pod.status.resourceClaimStatus
map stores the generated name for kubelet and node authorizer, i.e. the
information in the pod is sufficient to determine the name of the
ResourceClaim.
The resource claim controller becomes a bit more complex and now needs
permission to modify the pod status. The new failure scenario of "ResourceClaim
created, updating pod status fails" is handled with the help of a new special
"resource.kubernetes.io/pod-claim-name" annotation that together with the owner
reference identifies exactly for what a ResourceClaim was generated, so
updating the pod status can be retried for existing ResourceClaims.
The transition from deterministic names is handled with a special case for that
recovery code path: a ResourceClaim with no annotation and a name that follows
the Kubernetes <= 1.27 naming pattern is assumed to be generated for that pod
claim and gets added to the pod status.
There's no immediate need for it, but just in case that it may become relevant,
the name of the generated ResourceClaim may also be left unset to record that
no claim was needed. Components processing such a pod can skip whatever they
normally would do for the claim. To ensure that they do and also cover other
cases properly ("no known field is set", "must check ownership"),
resourceclaim.Name gets extended.