Dynamic resource allocation is similar to storage in the sense that users
create ResourceClaim objects to request resources, same as with persistent
volume claims. The actual resource usage is only known when allocating claims,
but some limits can already be enforced at admission time:
- "count/resourceclaims.resource.k8s.io" limits the number of ResourceClaim objects in
a namespace; this is a generic feature that is already supported also without
this commit.
- "resourceclaims" is *not* an alias - use "count/resourceclaims.resource.k8s.io"
instead.
- <device-class-name>.deviceclass.resource.k8s.io/devices limits the number of
ResourceClaim objects in a namespace such that the number of devices
requested through those objects with that class does not exceed the limit.
A single request may cause the allocation of multiple devices. For exact
counts, the quota limit is based on the sum of those exact counts. For requests
asking for "all" matching devices, the maximum number of allocated devices per
claim is used as a worst-case upper bound.
Requests asking for "admin access" contribute to the quota.
DRA quota: remove admin mode exception
Split running the Wardle aggregated API into preparation and
running phase. This allows reusing the prepared options and
makes it possible for us to introduce additional hooks into
the server authorization flow.
Some of the E2E node tests were flaky. Their timeout apparently was chosen
under the assumption that kubelet would retry immediately after a failed gRPC
call, with a factor of 2 as safety margin. But according to
0449cef8fd,
kubelet has a different, higher retry period of 90 seconds, which was exactly
the test timeout. The test timeout has to be higher than that.
As the tests don't use the gRPC call timeout anymore, it can be made
private. While at it, the name and documentation gets updated.
This fixes the message (node name and "cluster-scoped" were switched) and
simplifies the VAP:
- a single matchCondition short circuits completely unless they're a user
we care about
- variables to extract the userNodeName and objectNodeName once
(using optionals to gracefully turn missing claims and fields into empty strings)
- leaves very tiny concise validations
Co-authored-by: Jordan Liggitt <liggitt@google.com>
In the API, the effect of the feature gate is that alpha fields get dropped on
create. They get preserved during updates if already set. The
PodSchedulingContext registration is *not* restricted by the feature gate.
This enables deleting stale PodSchedulingContext objects after disabling
the feature gate.
The scheduler checks the new feature gate before setting up an informer for
PodSchedulingContext objects and when deciding whether it can schedule a
pod. If any claim depends on a control plane controller, the scheduler bails
out, leading to:
Status: Pending
...
Warning FailedScheduling 73s default-scheduler 0/1 nodes are available: resourceclaim depends on disabled DRAControlPlaneController feature. no new claims to deallocate, preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
The rest of the changes prepare for testing the new feature separately from
"structured parameters". The goal is to have base "dra" jobs which just enable
and test those, then "classic-dra" jobs which add DRAControlPlaneController.
The structured parameter allocation logic was written from scratch in
staging/src/k8s.io/dynamic-resource-allocation/structured where it might be
useful for out-of-tree components.
Besides the new features (amount, admin access) and API it now supports
backtracking when the initial device selection doesn't lead to a complete
allocation of all claims.
Co-authored-by: Ed Bartosh <eduard.bartosh@intel.com>
Co-authored-by: John Belamaric <jbelamaric@google.com>
The advantages of using a validation admission policy (VAP) are that no changes
are needed in Kubernetes and that admins have full flexibility if and how they
want to control which users are allowed to use "admin access" in their
requests.
The downside is that without admins taking actions, the feature is enabled
out-of-the-box in a cluster. Documentation for DRA will have to make it very
clear that something needs to be done in multi-tenant clusters.
The test/e2e/testing-manifests/dra/admin-access-policy.yaml shows how to do
this. The corresponding E2E tests ensures that it actually works as intended.
For some reason, adding the namespace to the message expression leads to a
type check errors, so it's currently commented out.
This is a complete revamp of the original API. Some of the key
differences:
- refocused on structured parameters and allocating devices
- support for constraints across devices
- support for allocating "all" or a fixed amount
of similar devices in a single request
- no class for ResourceClaims, instead individual
device requests are associated with a mandatory
DeviceClass
For the sake of simplicity, optional basic types (ints, strings) where the null
value is the default are represented as values in the API types. This makes Go
code simpler because it doesn't have to check for nil (consumers) and values
can be set directly (producers). The effect is that in protobuf, these fields
always get encoded because `opt` only has an effect for pointers.
The roundtrip test data for v1.29.0 and v1.30.0 changes because of the new
"request" field. This is considered acceptable because the entire `claims`
field in the pod spec is still alpha.
The implementation is complete enough to bring up the apiserver.
Adapting other components follows.
The process count is expected to always be >= 1 for pods in the test.
Let's check it's >= 1, so we can catch issues if the proecss count is
not reported.
Signed-off-by: David Porter <david@porter.me>
Signed-off-by: Paco Xu <paco.xu@daocloud.io>
Enable LocalStorageCapacityIsolationFSQuotaMonitoring
only when hostUsers in PodSpec is set to false.
Modify unit tests and e2e tests to verify
Signed-off-by: PannagaRamamanohara <pbhojara@redhat.com>
This test is flaky. I have noticed that this happens because the pod is not READY when it is being deleted at the end of the test. This fix ensures that the pod is READY before continuing with the rest of the test.
As agreed in https://github.com/kubernetes/enhancements/pull/4709, immediate
allocation is one of those features which can be removed because it makes no
sense for structured parameters and the justification for classic DRA is weak.
This is in preparation for revamping the resource.k8s.io completely. Because
there will be no support for transitioning from v1alpha2 to v1alpha3, the
roundtrip test data for that API in 1.29 and 1.30 gets removed.
Repeating the version in the import name of the API packages is not really
required. It was done for a while to support simpler grepping for usage of
alpha APIs, but there are better ways for that now. So during this transition,
"resourceapi" gets used instead of "resourcev1alpha3" and the version gets
dropped from informer and lister imports. The advantage is that the next bump
to v1beta1 will affect fewer source code lines.
Only source code where the version really matters (like API registration)
retains the versioned import.