Commit Graph

83 Commits

Author SHA1 Message Date
Patrick Ohly
77341f7595 DRA: remove support for v1alpha2 kubelet API
The v1alpha2 API is several releases old. No current drivers should still
depend on it.
2024-04-19 18:27:05 +02:00
Kubernetes Prow Robot
7ebb64d176 Merge pull request #124235 from bitoku/dra-e2e
Use WaitForPodCondition instead of sleep
2024-04-18 03:24:42 -07:00
Kubernetes Prow Robot
d2ce87eb94 Merge pull request #123938 from pohly/dra-structured-parameters-tests
DRA: test for structured parameters
2024-04-18 02:10:08 -07:00
Ayato Tokubi
c52160eb3c Use WaitForPodCondition instead of sleep
Signed-off-by: Ayato Tokubi <atokubi@redhat.com>
2024-04-13 00:01:11 +00:00
Ed Bartosh
26881132bd kubelet: assign Node as an owner for the ResourceSlice
Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
2024-03-15 09:46:13 +02:00
Patrick Ohly
cf8fffae72 dra e2e: sanity check resource handle
When using structured parameters, the instance name must match and not be in
use already.

NodeUnprepareResources must be called with the same handle are
NodePrepareResources.
2024-03-14 20:42:31 +01:00
Patrick Ohly
f149d6d8f9 dra e2e: watch claims and validate them
Logging claims helps with debugging test failures. Checking the finalizer
catches unexpected behavior.
2024-03-14 20:42:31 +01:00
Patrick Ohly
a0add8d2c7 dra api: NodeResourceModel -> ResourceModel
When renaming NodeResourceSlice to ResourceSlice, the embedded
[Node]ResourceModel also should have been renamed.
2024-03-14 18:07:36 +01:00
Patrick Ohly
7f5566ac6f dra e2e: enable more tests for usage with structured parameters
This finishes the shuffling around of test scenarios so that all of them which
make sense with structured parameters are also executed with those.
2024-03-07 22:26:20 +01:00
Patrick Ohly
2c6246c906 dra e2e: move ResourceSlice test
This should better run with multiple nodes, it's more realistic that way.
2024-03-07 22:23:03 +01:00
Patrick Ohly
0b6a0d686a dra api: rename NodeResourceSlice -> ResourceSlice
While currently those objects only get published by the kubelet for node-local
resources, this could change once we also support network-attached
resources. Dropping the "Node" prefix enables such a future extension.

The NodeName in ResourceSlice and StructuredResourceHandle then becomes
optional. The kubelet still needs to provide one and it must match its own node
name, otherwise it doesn't have permission to access ResourceSlice objects.
2024-03-07 22:22:55 +01:00
Patrick Ohly
234dc1f63d dra e2e: run more test scenarios with structured parameters 2024-03-07 22:22:13 +01:00
Patrick Ohly
d59676a545 dra kubelet: publish NodeResourceSlices
The information is received from the DRA driver plugin through a new gRPC
streaming interface. This is backwards compatible with old DRA driver kubelet
plugins, their gRPC server will return "not implemented" and that can be
handled by kubelet. Therefore no API break is needed.

However, DRA drivers need to be updated because the Go API changed. They can
return
    status.New(codes.Unimplemented, "no node resource support").Err()
if they don't support the new ListAndWatchResources method and
structured parameters.

The controller in kubelet then synchronizes this information from the driver
with NodeResourceSlice objects, creating, updating and deleting them as needed.
2024-03-07 22:22:13 +01:00
Patrick Ohly
5e40afca06 dra testing: add tests for structured parameters
The test driver now supports a ConfigMap (as before) and the named resources
structured parameter model. It doesn't have any instance attributes.
2024-03-07 22:22:13 +01:00
Patrick Ohly
6f1ddfcd2e kubelet: support structured parameters for preparing resources
If the resource handle has data from a structured parameter model, then we need
to pass that to the DRA driver kubelet plugin. Because Kubernetes uses
gogo/protobuf, we cannot use "optional" for that new optional field and have to
resort to "repeated" with a single repetition if present.

This is a new, backwards-compatible field.

That extending the resource.k8s.io changes the checksum of a kubelet checkpoint
is unfortunate. Updating the test cases is a stop-gap measure, the actual
solution will have to be something else before beta.
2024-03-07 22:22:13 +01:00
Tim Hockin
81ba0f3b44 Make golang::setup-env turn on workspaces
Both GO111MODULE and GOWORK default to on, so this just unsets them.  We
could set them to explicit values but this seems equivalent and cleaner.
2024-02-29 22:07:42 -08:00
Patrick Ohly
cb3180950e dra e2e: fix stack unwinding in helper function
When failing inside the `ginkgo.By` callback function, skipping intermediate
stack frames didn't work properly because `ginkgo.By` itself and other internal
code is also on the stack.

To fix this, the code which can fail now runs outside of such a
callback. That's not a big loss, the only advantage of the callback was getting
timing statistics from Ginkgo which weren't used in practice.
2024-02-19 17:11:04 +01:00
Patrick Ohly
5d1509126f dra: patch ReservedFor during PreBind
This moves adding a pod to ReservedFor out of the main scheduling cycle into
PreBind. There it is done concurrently in different goroutines. For claims
which were specifically allocated for a pod (the most common case), that
usually makes no difference because the claim is already reserved.

It starts to matter when that pod then cannot be scheduled for other reasons,
because then the claim gets unreserved to allow deallocating it. It also
matters for claims that are created separately and then get used multiple times
by different pods.

Because multiple pods might get added to the same claim rapidly independently
from each other, it makes sense to do all claim status updates via patching:
then it is no longer necessary to have an up-to-date copy of the claim because
the patch operation will succeed if (and only if) the patched claim is valid.

Server-side-apply cannot be used for this because a client always has to send
the full list of all entries that it wants to be set, i.e. it cannot add one
entry unless it knows the full list.
2024-01-26 10:58:03 +01:00
Patrick Ohly
4ede571f8b dra e2e: unify per-node resource specification
When using a builder pattern for the actual callback, some common
code can be moved into a single function.
2023-12-21 12:43:28 +01:00
Patrick Ohly
f2cfbf44b1 e2e: use framework labels
This changes the text registration so that tags for which the framework has a
dedicated API (features, feature gates, slow, serial, etc.) those APIs are
used.

Arbitrary, custom tags are still left in place for now.
2023-11-01 15:17:34 +01:00
Kubernetes Prow Robot
4294c35fc9 Merge pull request #121297 from calvinballing/spellcheck-markdown
Fix typos in markdown
2023-10-25 13:18:26 +02:00
Kubernetes Prow Robot
7b9d244efd Merge pull request #120965 from bart0sh/PR122-DRA-unexpected-node-shutdown
DRA: e2e: test non-graceful node shutdown
2023-10-20 11:58:47 +02:00
Ed Bartosh
fb9f2f5bc5 DRA: e2e: test non-graceful node shutdown 2023-10-19 22:09:11 +03:00
Jim Hays
911700e64e Fix typos in markdown 2023-10-17 10:55:40 -04:00
Patrick Ohly
36146ad686 e2e dra: enhance test driver
Several enhancements:
- `--resource-config` is now listed under `controller` options instead of
  `leader election`: merely a cosmetic change
- The driver name can be configured as part of the resource config. The
  command line flag overrides the config, but only when set explicitly.
  This makes it possible to pre-define complete driver setups where the
  name is associated with certain resource availability. This will be
  used for testing cluster autoscaling.
- The set of nodes where resources are available can optionally be specified
  via node labels. This will be used for testing cluster autoscaling.
2023-09-25 19:50:33 +02:00
Patrick Ohly
c682d2b8c5 scheduler: add ResourceClass events
When filtering fails because a ResourceClass is missing, we can treat the pod
as "unschedulable" as long as we then also register a cluster event that wakes
up the pod. This is more efficient than periodically retrying.
2023-09-06 11:14:08 +02:00
Kubernetes Prow Robot
e298e92115 Merge pull request #119819 from pohly/dra-performance-test-driver
dra test: enhance performance of test driver controller
2023-08-16 04:32:26 -07:00
Patrick Ohly
0e23840929 dra test: enhance performance of test driver controller
Analyzing the CPU profile of

    go test -timeout=0 -count=5 -cpuprofile profile.out -bench=BenchmarkPerfScheduling/.*Claim.* -benchtime=1ns -run=xxx ./test/integration/scheduler_perf

showed that a significant amount of time was spent iterating over allocated
claims to determine how many were allocated per node. That "naive" approach was
taken to avoid maintaining a redundant data structure, but now that performance
measurements show that this comes at a cost, it's not "premature optimization"
anymore to introduce such a second field.

The average scheduling throughput in
SchedulingWithResourceClaimTemplate/2000pods_100nodes increases from 16.4
pods/s to 19.2 pods/s.
2023-08-08 13:36:35 +02:00
carlory
57226fbd27 e2e_dra: stop using deprecated framework.ExpectEqual
Co-authored-by: Thomas Milox <thomasmilox@gmail.com>
2023-07-25 10:03:56 +08:00
Kubernetes Prow Robot
bea27f82d3 Merge pull request #118209 from pohly/dra-pre-scheduled-pods
dra: pre-scheduled pods
2023-07-13 14:43:37 -07:00
Patrick Ohly
80ab8f0542 dra: handle scheduled pods in kube-controller-manager
When someone decides that a Pod should definitely run on a specific node, they
can create the Pod with spec.nodeName already set. Some custom scheduler might
do that. Then kubelet starts to check the pod and (if DRA is enabled) will
refuse to run it, either because the claims are still waiting for the first
consumer or the pod wasn't added to reservedFor. Both are things the scheduler
normally does.

Also, if a pod got scheduled while the DRA feature was off in the
kube-scheduler, a pod can reach the same state.

The resource claim controller can handle these two cases by taking over for the
kube-scheduler when nodeName is set. Triggering an allocation is simpler than
in the scheduler because all it takes is creating the right
PodSchedulingContext with spec.selectedNode set. There's no need to list nodes
because that choice was already made, permanently. Adding the pod to
reservedFor also isn't hard.

What's currently missing is triggering de-allocation of claims to re-allocate
them for the desired node. This is not important for claims that get created
for the pod from a template and then only get used once, but it might be
worthwhile to add de-allocation in the future.
2023-07-13 21:27:11 +02:00
Kubernetes Prow Robot
047d040ce7 Merge pull request #119012 from pohly/dra-batch-node-prepare
kubelet: support batched prepare/unprepare in v1alpha3 DRA plugin API
2023-07-12 10:57:37 -07:00
Patrick Ohly
08d40f53a7 dra: test with and without immediate ReservedFor
The recommendation and default in the controller helper code is to set
ReservedFor to the pod which triggered delayed allocation. However, this
is neither required nor enforced. Therefore we should also test the fallback
path were kube-scheduler itself adds the pod to ReservedFor.
2023-07-12 16:57:17 +02:00
Kubernetes Prow Robot
3cc729fc7f Merge pull request #119195 from pohly/dra-reallocate-flake
dra e2e: fix "reallocation works" flake
2023-07-12 05:55:25 -07:00
Patrick Ohly
d743c50bb9 kubelet: support batched prepare/unprepare in v1alpha3 DRA plugin API
Combining all prepare/unprepare operations for a pod enables plugins to
optimize the execution. Plugins can continue to use the v1beta2 API for now,
but should switch. The new API is designed so that plugins which want to work
on each claim one-by-one can do so and then report errors for each claim
separately, i.e. partial success is supported.
2023-07-12 14:50:30 +02:00
Patrick Ohly
c143a875ed dra e2e: fix "reallocation works" flake
The main problem probably was that
https://github.com/kubernetes/kubernetes/pull/118862 moved creating the first
pod before setting up the callback which blocks allocating one claim for that
pod. This is racy because allocations happen in the background.

The test also was unnecessarily complex and hard to read:
- The intended effect can be achieved with three instead of four claims.
- It wasn't clear which claim has "external-claim-other" as name.
  Using the claim variable avoids that.
2023-07-12 11:20:47 +02:00
Patrick Ohly
ba810871ad dra e2e: check that not generating a ResourceClaim works
This is not something that normally happens, but the API supports it because it
might be needed at some point, so we have to test it.
2023-07-11 14:23:49 +02:00
Patrick Ohly
444d23bd2f dra: generated name for ResourceClaim from template
Generating the name avoids all potential name collisions. It's not clear how
much of a problem that was because users can avoid them and the deterministic
names for generic ephemeral volumes have not led to reports from users. But
using generated names is not too hard either.

What makes it relatively easy is that the new pod.status.resourceClaimStatus
map stores the generated name for kubelet and node authorizer, i.e. the
information in the pod is sufficient to determine the name of the
ResourceClaim.

The resource claim controller becomes a bit more complex and now needs
permission to modify the pod status. The new failure scenario of "ResourceClaim
created, updating pod status fails" is handled with the help of a new special
"resource.kubernetes.io/pod-claim-name" annotation that together with the owner
reference identifies exactly for what a ResourceClaim was generated, so
updating the pod status can be retried for existing ResourceClaims.

The transition from deterministic names is handled with a special case for that
recovery code path: a ResourceClaim with no annotation and a name that follows
the Kubernetes <= 1.27 naming pattern is assumed to be generated for that pod
claim and gets added to the pod status.

There's no immediate need for it, but just in case that it may become relevant,
the name of the generated ResourceClaim may also be left unset to record that
no claim was needed. Components processing such a pod can skip whatever they
normally would do for the claim. To ensure that they do and also cover other
cases properly ("no known field is set", "must check ownership"),
resourceclaim.Name gets extended.
2023-07-11 14:23:48 +02:00
Kubernetes Prow Robot
d02d8ba635 Merge pull request #118862 from byako/batching-dra-calls
DRA controller: batch resource claims for Allocate
2023-07-06 11:33:03 -07:00
Kubernetes Prow Robot
6f9d1d38d8 Merge pull request #118817 from pohly/dra-delete-claims
DRA: improve handling of completed pods
2023-07-06 10:15:15 -07:00
Alexey Fomenko
b10cc642b5 DRA controller: batch resource claims for Allocate
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
2023-07-06 19:31:45 +03:00
Patrick Ohly
a514f40131 dra resourceclaim controller: delete generated claims when pod is done
When a pod is done, but not getting removed yet for while, then a claim that
got generated for that pod can be deleted already. This then also triggers
deallocation.
2023-07-05 16:10:20 +02:00
Patrick Ohly
e8a0c42212 dra resourceclaim controller: remove reservation for completed pods
When a pod is known to never run (again), the reservation for it also can be
removed. This is relevant in particular for the job controller.
2023-07-05 16:10:20 +02:00
Patrick Ohly
c903c29c3b e2e: support admissionapi.LevelRestricted in test/e2e/framwork/pod
CreatePod and MakePod only accepted an `isPrivileged` boolean, which made it
impossible to write tests using those helpers which work in a default
framework.Framework, because the default there is LevelRestricted.

The simple boolean gets replaced with admissionapi.Level. Passing
LevelRestricted does the same as calling e2epod.MixinRestrictedPodSecurity.

Instead of explicitly passing a constant to these modified helpers, most tests
get updated to pass f.NamespacePodSecurityLevel. This has the advantage
that if that level gets lowered in the future, tests only need to be updated in
one place.

In some cases, helpers taking client+namespace+timeouts parameters get replaced
with passing the Framework instance to get access to
f.NamespacePodSecurityEnforceLevel. These helpers don't need separate
parameters because in practice all they ever used where the values from the
Framework instance.
2023-07-03 16:26:28 +02:00
Kubernetes Prow Robot
ec87834bae Merge pull request #118936 from pohly/dra-deallocate-when-unused
DRA: for delayed allocation, deallocate when no longer used
2023-07-01 12:56:48 -07:00
Patrick Ohly
1b47e6433b dra delayed allocation: deallocate when a pod is done
This releases the underlying resource sooner and ensures that another consumer
can get scheduled without being influenced by a decision that was made for the
previous consumer.

An alternative would have been to have the apiserver trigger the deallocation
whenever it sees the `status.reservedFor` getting reduced to zero. But that
then also triggers deallocation when kube-scheduler removes the last
reservation after a failed scheduling cycle. In that case we want to keep the
claim allocated and let the kube-scheduler decide on a case-by-case basis which
claim should get deallocated.
2023-06-29 09:47:30 +02:00
Patrick Ohly
4a5a242a68 dra e2e: using logging for background activity
ginkgo.By should be used for steps in the test flow. Creating and deleting CDI
files happens in parallel to that. If reported via ginkgo.By, progress reports
look weird because they contain e.g. step "waiting for...." (from the main
test, which is still on-going) and end with "creating CDI file" (which is
already completed).
2023-06-28 21:48:57 +02:00
Kubernetes Prow Robot
2190775b69 Merge pull request #118280 from stlaz/e2e_psa_labels
Set all PSa labels in tests
2023-06-28 11:14:43 -07:00
Stanislav Laznicka
7f532891c9 e2e tests: set all PSa labels instead of just enforcing 2023-06-21 15:05:13 +02:00
Patrick Ohly
ec70b2ec80 e2e dra: add "kubelet must skip NodePrepareResource if not used by any container"
If (for whatever reason) no container uses a claim, then there's no need to
prepare it.
2023-06-21 10:42:22 +02:00