kubernetes

Author	SHA1	Message	Date
Alvaro Aleman	6d0ac8c561	Use the generic/typed workqueue throughout This change makes us use the generic workqueue throughout the project in order to improve type safety and readability of the code.	2024-05-04 14:33:12 -04:00
Patrick Ohly	77341f7595	DRA: remove support for v1alpha2 kubelet API The v1alpha2 API is several releases old. No current drivers should still depend on it.	2024-04-19 18:27:05 +02:00
Ayato Tokubi	d04f87abde	add nil check for Node(Un)PrepareResources. Signed-off-by: Ayato Tokubi <atokubi@redhat.com>	2024-04-04 23:24:25 +00:00
HirazawaUi	10b6319e64	fix slow dra unit test	2024-03-16 22:21:15 +08:00
Ed Bartosh	26881132bd	kubelet: assign Node as an owner for the ResourceSlice Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>	2024-03-15 09:46:13 +02:00
Patrick Ohly	a0add8d2c7	dra api: NodeResourceModel -> ResourceModel When renaming NodeResourceSlice to ResourceSlice, the embedded [Node]ResourceModel also should have been renamed.	2024-03-14 18:07:36 +01:00
Kevin Klues	fc2134c84c	dra kubelet: fix error log Previously we were returning the error string from 'err' (which is nil), when we should have been returning it from result.Error. Without this it is hard to debug issues with NodeUnprepareResources. Signed-off-by: Kevin Klues <kklues@nvidia.com>	2024-03-11 13:51:29 +00:00
Kevin Klues	13a6dcc21c	dra kubelet: add StructuredResourceModel to UnprepareResources call Signed-off-by: Kevin Klues <kklues@nvidia.com>	2024-03-09 18:08:14 +00:00
Patrick Ohly	0b6a0d686a	dra api: rename NodeResourceSlice -> ResourceSlice While currently those objects only get published by the kubelet for node-local resources, this could change once we also support network-attached resources. Dropping the "Node" prefix enables such a future extension. The NodeName in ResourceSlice and StructuredResourceHandle then becomes optional. The kubelet still needs to provide one and it must match its own node name, otherwise it doesn't have permission to access ResourceSlice objects.	2024-03-07 22:22:55 +01:00
Patrick Ohly	d59676a545	dra kubelet: publish NodeResourceSlices The information is received from the DRA driver plugin through a new gRPC streaming interface. This is backwards compatible with old DRA driver kubelet plugins, their gRPC server will return "not implemented" and that can be handled by kubelet. Therefore no API break is needed. However, DRA drivers need to be updated because the Go API changed. They can return status.New(codes.Unimplemented, "no node resource support").Err() if they don't support the new ListAndWatchResources method and structured parameters. The controller in kubelet then synchronizes this information from the driver with NodeResourceSlice objects, creating, updating and deleting them as needed.	2024-03-07 22:22:13 +01:00
Patrick Ohly	6f1ddfcd2e	kubelet: support structured parameters for preparing resources If the resource handle has data from a structured parameter model, then we need to pass that to the DRA driver kubelet plugin. Because Kubernetes uses gogo/protobuf, we cannot use "optional" for that new optional field and have to resort to "repeated" with a single repetition if present. This is a new, backwards-compatible field. That extending the resource.k8s.io changes the checksum of a kubelet checkpoint is unfortunate. Updating the test cases is a stop-gap measure, the actual solution will have to be something else before beta.	2024-03-07 22:22:13 +01:00
TommyStarK	6f021e99cf	dra: increase timeout in setupFakeDRADriverGRPCServer to prevent tests to flake. Signed-off-by: TommyStarK <thomasmilox@gmail.com>	2024-01-11 09:20:04 +01:00
charles-chenzz	abaf7a800d	increase timeout in fakeDraDriverGrpcServer to fix flake in dra/manger_test	2023-11-07 19:38:27 +08:00
Kubernetes Prow Robot	191abe34b8	Merge pull request #120550 from adrianchiris/fix-dra-node-reboot DRA: call plugins for claims even if exist in cache	2023-10-26 10:26:59 +02:00
adrianc	3738111337	Add unit tests adjust existing tests and add new test flows to cover new DRA manager behaviour Signed-off-by: adrianc <adrianc@nvidia.com>	2023-10-25 13:20:22 +03:00
adrianc	08b942028f	DRA: call plugins for claims even if exist in cache Today, DRA manager does not call plugin NodePrepareResource for claims that it previously successfully handled, that is, if claims are present in cache (checkpoint) even if node rebooted. After node reboots, it is required to call DRA plugin for resource claims so that plugins may prepare them again in case the resources dont persist reboot. To achieve that, once kubelet is started, we call DRA plugins for claims once if a pod sandbox is required to be created during PodSync. Signed-off-by: adrianc <adrianc@nvidia.com>	2023-10-25 13:20:16 +03:00
TommyStarK	55e3662b72	dra: refactoring overall flow of prepare/unprepare resources Signed-off-by: TommyStarK <thomasmilox@gmail.com>	2023-10-23 15:11:27 +02:00
Kubernetes Prow Robot	f9f00da6bc	Merge pull request #118761 from TommyStarK/gh_113831 move common logic of highestSupportedVersion to util package	2023-09-18 13:59:25 -07:00
TommyStarK	42356bfbb3	move common logic of highestSupportedVersion to util package Signed-off-by: TommyStarK <thomasmilox@gmail.com>	2023-09-18 21:25:29 +02:00
Kubernetes Prow Robot	82bca6304b	Merge pull request #119464 from TommyStarK/dra/cleanup-manager-unit-tests dra: cleanup manager unit tests	2023-09-18 07:08:43 -07:00
Kubernetes Prow Robot	19deb04a90	Merge pull request #118619 from TommyStarK/gh_113832 dynamic resource allocation: reuse gRPC connection	2023-08-16 09:32:27 -07:00
charles-chenzz	ba9ce3ab08	fix flaky test on dra TestPrepareResources/should_timeout Co-authored-by: TommyStarK <thomasmilox@gmail.com>	2023-08-03 22:37:54 +08:00
TommyStarK	391c1a3ecc	dra: cleanup manager unit tests Signed-off-by: TommyStarK <thomasmilox@gmail.com>	2023-08-02 23:35:45 +02:00
TommyStarK	60a8bca507	dynamic resource allocation: add unit test to check the reuse of the gRPC connection Signed-off-by: TommyStarK <thomasmilox@gmail.com>	2023-07-20 19:22:25 +02:00
TommyStarK	7ffd3063ce	dynamic resource allocation: reuse gRPC connection Signed-off-by: TommyStarK <thomasmilox@gmail.com>	2023-07-19 10:12:52 +02:00
Kevin Klues	0449cef8fd	Increase timeout for DRA kubelet plugin client The 10 second timeout was too low. Given that the retry loop for the kubelet itself is 90s, increasing the timeout to half of this seems reasonable. Ideally we would pull in the variable that sets the retry timeout to 90s and then just set our local timeout to half of that. Unfortunately, this is not exported, so we settle (for now with just explicitly setting it to 45s. Signed-off-by: Kevin Klues <kklues@nvidia.com>	2023-07-18 22:45:01 +01:00
Ed Bartosh	0ec99fb0b2	Kubelet DRA: fix failing test cases	2023-07-18 19:06:33 +03:00
Ed Bartosh	f6431c6138	DRA: don't query claims from API server When a pod is force-deleted UnprepareResources fails to get a claim from an API server. PrepareResources should cache claim info required by the UnprepareResources so that UnprepareResources would get it from the cache instead of querying API server.	2023-07-18 18:23:10 +03:00
Kubernetes Prow Robot	6d83e22ba4	Merge pull request #118711 from TommyStarK/tom/gh_118436 add unit test for dra/manager.go	2023-07-18 04:17:09 -07:00
charles-chenzz	0372e4b662	add unit test for dra/manager.go. Co-Authored-By: charles-chenzz <Rekles666@gmail.com> Signed-off-by: TommyStarK <thomasmilox@gmail.com>	2023-07-18 12:14:27 +02:00
Kubernetes Prow Robot	bdcf812c95	Merge pull request #118254 from elezar/4009/add-cdi-devices-to-device-plugin Add CDI devices to device plugin API	2023-07-17 05:21:08 -07:00
Kubernetes Prow Robot	047d040ce7	Merge pull request #119012 from pohly/dra-batch-node-prepare kubelet: support batched prepare/unprepare in v1alpha3 DRA plugin API	2023-07-12 10:57:37 -07:00
Kubernetes Prow Robot	be222f38f0	Merge pull request #119058 from TommyStarK/dra-state-checkpoint-unit-test dynamic resource allocation: Improve code coverage of state checkpoint	2023-07-12 07:49:14 -07:00
Patrick Ohly	d743c50bb9	kubelet: support batched prepare/unprepare in v1alpha3 DRA plugin API Combining all prepare/unprepare operations for a pod enables plugins to optimize the execution. Plugins can continue to use the v1beta2 API for now, but should switch. The new API is designed so that plugins which want to work on each claim one-by-one can do so and then report errors for each claim separately, i.e. partial success is supported.	2023-07-12 14:50:30 +02:00
TommyStarK	f924bf95df	dynamic resource allocation: Improve code coverage of state checkpoint Signed-off-by: TommyStarK <thomasmilox@gmail.com>	2023-07-12 13:27:18 +02:00
Patrick Ohly	444d23bd2f	dra: generated name for ResourceClaim from template Generating the name avoids all potential name collisions. It's not clear how much of a problem that was because users can avoid them and the deterministic names for generic ephemeral volumes have not led to reports from users. But using generated names is not too hard either. What makes it relatively easy is that the new pod.status.resourceClaimStatus map stores the generated name for kubelet and node authorizer, i.e. the information in the pod is sufficient to determine the name of the ResourceClaim. The resource claim controller becomes a bit more complex and now needs permission to modify the pod status. The new failure scenario of "ResourceClaim created, updating pod status fails" is handled with the help of a new special "resource.kubernetes.io/pod-claim-name" annotation that together with the owner reference identifies exactly for what a ResourceClaim was generated, so updating the pod status can be retried for existing ResourceClaims. The transition from deterministic names is handled with a special case for that recovery code path: a ResourceClaim with no annotation and a name that follows the Kubernetes <= 1.27 naming pattern is assumed to be generated for that pod claim and gets added to the pod status. There's no immediate need for it, but just in case that it may become relevant, the name of the generated ResourceClaim may also be left unset to record that no claim was needed. Components processing such a pod can skip whatever they normally would do for the claim. To ensure that they do and also cover other cases properly ("no known field is set", "must check ownership"), resourceclaim.Name gets extended.	2023-07-11 14:23:48 +02:00
Evan Lezar	f0e3c32fe5	Move CDI annotation code to utils package Signed-off-by: Evan Lezar <elezar@nvidia.com>	2023-07-11 11:47:53 +02:00
Kubernetes Prow Robot	7581ae8123	Merge pull request #116739 from moshe010/clone-cdi-devices kubelet dra: lock before getting claimInfo CDIDevices and annotations fields	2023-07-07 06:31:04 -07:00
Patrick Ohly	bde66bfb55	kubelet dra: restore skipping of unused resource claims `1aeec10efb` removed iterating over containers in favor of iterating over pod claims. This had the unintended consequence that NodePrepareResource gets called unnecessarily when no container needs the claim. The more natural behavior is to skip unused resources. This enables (theoretic, at this time) use cases where some DRA driver relies on the controller part to influence scheduling, but then doesn't use CDI with containers.	2023-06-27 16:02:31 +02:00
Patrick Ohly	874daa8b52	kubelet dra: fix checking of second pod which uses a claim When a second pod wanted to use a claim, the obligatory sanity check whether the pod is really allowed to use the claim ("reserved for") was skipped.	2023-06-27 16:01:11 +02:00
Moshe Levi	04ad946e8f	kubelet dra: lock before getting claimInfo CDIDevices and annotations fields Currently claimInfo CDIDevices and annotations access directly without RLock. This can lead to concurrent read write error. To avoid it we added RLock all before getting the CDIDevices and annotations Signed-off-by: Moshe Levi <moshele@nvidia.com>	2023-05-01 15:09:43 +03:00
Ed Bartosh	1aeec10efb	DRA: get rid of unneeded loops over pod containers	2023-03-15 09:41:30 +02:00
Kubernetes Prow Robot	74123a7341	Merge pull request #116621 from moshe010/dra-lock kubelet dra: add lock to addCDIDevices	2023-03-14 19:27:28 -07:00
Kevin Klues	579295e727	Update kubeletplugin API for DynamicResourceAllocation to v1alpha2 This PR makes the NodePrepareResources() and NodeUnprepareResource() calls of the kubeletplugin API for DynamicResourceAllocation symmetrical. It wasn't clear how one would use the set of CDIDevices passed back in the NodeUnprepareResource() of the v1alpha1 API, and the new API now passes back the full ResourceHandle that was originally passed to the Prepare() call. Passing the ResourceHandle is strictly more informative and a plugin could always (re)derive the set of CDIDevice from it. This is a breaking change, but this release is scheduled to break multiple APIs for DynamicResourceAllocation, so it makes sense to do this now instead of later. Signed-off-by: Kevin Klues <kklues@nvidia.com>	2023-03-14 23:09:44 +00:00
Moshe Levi	ffb07d1e78	kubelet dra: add lock to addCDIDevices Signed-off-by: Moshe Levi <moshele@nvidia.com>	2023-03-15 00:50:45 +02:00
Kevin Klues	74d634a028	Update kubelet support for recent changes to resource.k8s.io/v1alpha2 Signed-off-by: Kevin Klues <kklues@nvidia.com>	2023-03-14 22:34:18 +00:00
Moshe Levi	2a568bcfc8	kubelet podresources: extend List to support Dynamic Resources and implement Get API Signed-off-by: Moshe Levi <moshele@nvidia.com>	2023-03-14 19:33:04 +02:00
Moshe Levi	9c57613912	Add ClassName to chekpoint state and in-memory cache Signed-off-by: Moshe Levi <moshele@nvidia.com>	2023-03-14 19:33:04 +02:00
Patrick Ohly	29941b8d3e	api: resource.k8s.io v1alpha1 -> v1alpha2 For Kubernetes 1.27, we intend to make some breaking API changes: - rename PodScheduling -> PodSchedulingHints (https://github.com/kubernetes/kubernetes/issues/114283) - extend ResourceClaimStatus (https://github.com/kubernetes/enhancements/pull/3802) We need to switch from v1alpha1 to v1alpha2 for that.	2023-03-14 07:52:03 +01:00
Kubernetes Prow Robot	e998b09bc4	Merge pull request #116555 from bart0sh/PR106-dra-plugin-constant DRA: add constant PluginClientTimeout	2023-03-13 17:51:31 -07:00

1 2

63 Commits