Commit Graph

1418 Commits

Author SHA1 Message Date
Kevin Klues
f80be2728e kubelet: DRA: change key of claimInfo cache to "namespace/claimname"
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2024-05-03 13:23:29 +00:00
Kevin Klues
639e887631 kubelet: DRA: add a reconcile loop to unprepare claims for deleted pods
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2024-05-03 13:23:29 +00:00
Kevin Klues
a8931c6c25 kubelet: DRA: update locking/checkpoint semantics of the claimInfo cache
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2024-05-03 13:23:27 +00:00
Kubernetes Prow Robot
1fd835ce59 Merge pull request #123398 from ffromani/remove-legacy-checkpoint
node: devicemgr: remove obsolete pre-1.20 checkpoint file support
2024-04-29 14:46:53 -07:00
Marek Siarkowicz
3ee8178768 Cleanup defer from SetFeatureGateDuringTest function call 2024-04-24 20:25:29 +02:00
Patrick Ohly
77341f7595 DRA: remove support for v1alpha2 kubelet API
The v1alpha2 API is several releases old. No current drivers should still
depend on it.
2024-04-19 18:27:05 +02:00
Kubernetes Prow Robot
bbfd2145de Merge pull request #124091 from bitoku/dra-nil-check
kubelet: add nil check for Node(Un)PrepareResources.
2024-04-18 10:46:05 -07:00
Kubernetes Prow Robot
528cff12f6 Merge pull request #120969 from skitt/uber-go-mock
Switch from golang/mock to uber-go/mock
2024-04-17 23:59:24 -07:00
Francesco Romani
181fb0da51 node: devicemgr: remove obsolete pre-1.20 checkpoint file support
In commit 2f426fdba6 we added
compatibility (and tests) to deal with pre-1.20 checkpoint files.
We are now well past the end of support for pre-1.20 kubelets,
so we can get rid of this code.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2024-04-15 14:01:56 +02:00
Ayato Tokubi
d04f87abde add nil check for Node(Un)PrepareResources.
Signed-off-by: Ayato Tokubi <atokubi@redhat.com>
2024-04-04 23:24:25 +00:00
HirazawaUi
10b6319e64 fix slow dra unit test 2024-03-16 22:21:15 +08:00
Ed Bartosh
26881132bd kubelet: assign Node as an owner for the ResourceSlice
Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
2024-03-15 09:46:13 +02:00
Patrick Ohly
a0add8d2c7 dra api: NodeResourceModel -> ResourceModel
When renaming NodeResourceSlice to ResourceSlice, the embedded
[Node]ResourceModel also should have been renamed.
2024-03-14 18:07:36 +01:00
Kevin Klues
fc2134c84c dra kubelet: fix error log
Previously we were returning the error string from 'err' (which is nil), when
we should have been returning it from result.Error. Without this it is hard to
debug issues with NodeUnprepareResources.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2024-03-11 13:51:29 +00:00
Kevin Klues
13a6dcc21c dra kubelet: add StructuredResourceModel to UnprepareResources call
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2024-03-09 18:08:14 +00:00
fakecore
4060ee60c1 fix: handle socket file detection on Windows
Update socket file detection logic to use os.Stat as per upstream
Go fix for https://github.com/golang/go/issues/33357. This resolves
the issue where socket files could not be properly identified on
Windows systems.
2024-03-08 18:16:10 +08:00
Patrick Ohly
0b6a0d686a dra api: rename NodeResourceSlice -> ResourceSlice
While currently those objects only get published by the kubelet for node-local
resources, this could change once we also support network-attached
resources. Dropping the "Node" prefix enables such a future extension.

The NodeName in ResourceSlice and StructuredResourceHandle then becomes
optional. The kubelet still needs to provide one and it must match its own node
name, otherwise it doesn't have permission to access ResourceSlice objects.
2024-03-07 22:22:55 +01:00
Patrick Ohly
d59676a545 dra kubelet: publish NodeResourceSlices
The information is received from the DRA driver plugin through a new gRPC
streaming interface. This is backwards compatible with old DRA driver kubelet
plugins, their gRPC server will return "not implemented" and that can be
handled by kubelet. Therefore no API break is needed.

However, DRA drivers need to be updated because the Go API changed. They can
return
    status.New(codes.Unimplemented, "no node resource support").Err()
if they don't support the new ListAndWatchResources method and
structured parameters.

The controller in kubelet then synchronizes this information from the driver
with NodeResourceSlice objects, creating, updating and deleting them as needed.
2024-03-07 22:22:13 +01:00
Patrick Ohly
6f1ddfcd2e kubelet: support structured parameters for preparing resources
If the resource handle has data from a structured parameter model, then we need
to pass that to the DRA driver kubelet plugin. Because Kubernetes uses
gogo/protobuf, we cannot use "optional" for that new optional field and have to
resort to "repeated" with a single repetition if present.

This is a new, backwards-compatible field.

That extending the resource.k8s.io changes the checksum of a kubelet checkpoint
is unfortunate. Updating the test cases is a stop-gap measure, the actual
solution will have to be something else before beta.
2024-03-07 22:22:13 +01:00
Stephen Kitt
6bf667af06 Switch from golang/mock to uber-go/mock
See https://github.com/golang/mock#gomock: golang/mock is no longer
maintained, and should be replaced by go.uber.org/mock.

This allows golang/mock to be dropped from the status and vendored
fields in unwanted-dependencies.json.

Signed-off-by: Stephen Kitt <skitt@redhat.com>
2024-03-07 09:12:16 +01:00
Kubernetes Prow Robot
70383f3701 Merge pull request #119561 from payall4u/fix-kubelet-panic-when-allocate-device
Fix kubelet panic when allocate resource for pod.
2024-02-29 03:06:54 -08:00
Kubernetes Prow Robot
0f7cc6fcaa Merge pull request #121778 from Tal-or/mm_metrics
kubelet: memorymanager: metrics:  add metrics about static allocation
2024-02-20 09:41:50 -08:00
Kubernetes Prow Robot
79e11fe563 Merge pull request #122703 from TommyStarK/fix/dra-manager-should-timeout
dra: increase timeout in setupFakeDRADriverGRPCServer to prevent tests to flake
2024-02-13 09:33:17 -08:00
Kubernetes Prow Robot
244fbf94fd Merge pull request #122698 from daniel-hutao/feat-1
Code Cleanup: Redundant String Conversions and Spelling/Grammar Corrections
2024-02-05 16:57:07 -08:00
Daniel Hu
1baf7d4586 Corrected some spelling and grammatical errors
Signed-off-by: Daniel Hu <farmer.hutao@outlook.com>
2024-01-27 10:10:25 +08:00
Kubernetes Prow Robot
3da22db11c Merge pull request #121499 from matte21/add-comments-to-cpu-accumulator
Improve understandability of kubelet's cpu accumulator code
2024-01-26 00:56:21 +01:00
Daniel Hu
d652596e42 Remove redundant string conversions in print statements
Signed-off-by: Daniel Hu <farmer.hutao@outlook.com>
2024-01-15 09:57:35 +08:00
TommyStarK
6f021e99cf dra: increase timeout in setupFakeDRADriverGRPCServer to prevent tests to flake.
Signed-off-by: TommyStarK <thomasmilox@gmail.com>
2024-01-11 09:20:04 +01:00
Akihiro Suda
2e999fff02 Fix compiling e2e.test on macOS
Fix issue 122650 (regression in PR 122552)

```
$ make WHAT=test/e2e/e2e.test
+++ [0109 10:06:53] Building go targets for darwin/amd64
    k8s.io/kubernetes/test/e2e/e2e.test (test)
package k8s.io/kubernetes/test/e2e
        imports k8s.io/kubernetes/test/e2e/common
        imports k8s.io/kubernetes/test/e2e/common/node
        imports k8s.io/kubernetes/pkg/kubelet
        imports github.com/opencontainers/runc/libcontainer/userns: C source files not allowed when not using cgo or SWIG: userns_maps.c
!!! [0109 10:06:54] Call tree:
!!! [0109 10:06:54]  1: /Users/suda/gopath/src/k8s.io/kubernetes/hack/lib/golang.sh:948 kube::golang::build_binaries_for_platform(...)
!!! [0109 10:06:54]  2: hack/make-rules/build.sh:27 kube::golang::build_binaries(...)
!!! [0109 10:06:54] Call tree:
!!! [0109 10:06:54]  1: hack/make-rules/build.sh:27 kube::golang::build_binaries(...)
!!! [0109 10:06:54] Call tree:
!!! [0109 10:06:54]  1: hack/make-rules/build.sh:27 kube::golang::build_binaries(...)
make: *** [all] Error 1
```

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2024-01-09 10:42:20 +09:00
Kubernetes Prow Robot
c96d7a5b5a Merge pull request #121774 from charles-chenzz/increase_timeout_in_dra_shouldTimeOut
increase timeout in fakeDraDriverGrpcServer to fix flake
2024-01-04 17:59:12 +01:00
weilaaa
eb8f3f194f use build-in max and min func to instead of k8s.io/utils/integer funcs 2023-12-15 15:09:11 +08:00
Kubernetes Prow Robot
3c1356bc9b Merge pull request #119764 from linxiulei/reservedTypo
Fix error message for invalid resource reservation
2023-12-13 21:25:42 +01:00
Talor Itzhak
ddd60de3f3 memorymanager:metrics: add metrics
As part of the memory manager GA graduation effort, we should add
metrics in order to iprove observability.

The metrics also mentioned in the PR https://github.com/kubernetes/enhancements/pull/4251 (which was not merged yet)

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
2023-11-12 09:34:55 +02:00
payall4u
d6b8a660b0 Fix kubelet panic when allocate resource for pod.
Signed-off-by: payall4u <payall4u@qq.com>
2023-11-12 10:54:05 +08:00
charles-chenzz
abaf7a800d increase timeout in fakeDraDriverGrpcServer to fix flake in dra/manger_test 2023-11-07 19:38:27 +08:00
Kubernetes Prow Robot
960431407c Merge pull request #120715 from gjkim42/do-not-reuse-memory-of-restartable-init-containers
Don't reuse memory of a restartable init container
2023-11-01 01:50:45 +01:00
Kubernetes Prow Robot
a5ff0324a9 Merge pull request #120461 from gjkim42/do-not-reuse-device-of-restartable-init-container
Don't reuse the device of a restartable init container
2023-10-31 19:15:53 +01:00
Kubernetes Prow Robot
bfeb3c2621 Merge pull request #119447 from gjkim42/do-not-reuse-cpu-set-of-restartable-init-container
Don't reuse CPU set of a restartable init container
2023-10-31 19:15:26 +01:00
Kubernetes Prow Robot
191abe34b8 Merge pull request #120550 from adrianchiris/fix-dra-node-reboot
DRA: call plugins for claims even if exist in cache
2023-10-26 10:26:59 +02:00
adrianc
3738111337 Add unit tests
adjust existing tests and add new test flows
to cover new DRA manager behaviour

Signed-off-by: adrianc <adrianc@nvidia.com>
2023-10-25 13:20:22 +03:00
adrianc
08b942028f DRA: call plugins for claims even if exist in cache
Today, DRA manager does not call plugin NodePrepareResource
for claims that it previously successfully handled, that is,
if claims are present in cache (checkpoint) even if node
rebooted.

After node reboots, it is required to call DRA plugin
for resource claims so that plugins may prepare them
again in case the resources dont persist reboot.

To achieve that, once kubelet is started, we call DRA
plugins for claims once if a pod sandbox is required
to be created during PodSync.

Signed-off-by: adrianc <adrianc@nvidia.com>
2023-10-25 13:20:16 +03:00
matte21
4bba73a2bd Add comments to cpu accumulator and minor renames
The cpu accumulator logic (that select CPUs for containers)
has some non-obvious code.
This commit adds some comments to make that code easier to
understand for new contributors. Some minor renames to
improve readability are also performed.
2023-10-24 22:49:54 -04:00
Antonio Ojea
8e0be64b8f remove data race on the devicemanager client plugin
Change-Id: I45b85440a792e5ed2f75a344ec1f0332854d8d6d
2023-10-24 21:35:13 +00:00
Shiming Zhang
35f4d29d73 Fix unit test 2023-10-24 11:06:35 +08:00
Kubernetes Prow Robot
76fc18c528 Merge pull request #120099 from TommyStarK/gh_119469
dra: refactoring overall flow of prepare/unprepare resources
2023-10-23 19:51:53 +02:00
TommyStarK
55e3662b72 dra: refactoring overall flow of prepare/unprepare resources
Signed-off-by: TommyStarK <thomasmilox@gmail.com>
2023-10-23 15:11:27 +02:00
Kubernetes Prow Robot
f41ede6241 Merge pull request #118534 from swatisehgal/sample-dp-register-by-default
node: sample-device-plugin: register to kubelet by default and ensure re-registration to kubelet on kubelet restarts
2023-10-23 13:41:19 +02:00
Kubernetes Prow Robot
a7b8357a55 Merge pull request #118165 from champly/master
kubelet: fix comment typo
2023-10-17 23:28:25 +02:00
Swati Sehgal
9a354fc9d0 node: sample-dp: Add retry to handle device plugin restart failure
Add retry mechanism to handle cases where after kubelet restarts, the device
plugin unix socket(s) were created but not ready to serve yet.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2023-10-17 12:19:10 +01:00
Swati Sehgal
d0d133298d node: sample-dp: Use fsnotify for kubelet restart detection
Add kubeletSocket file to fsnotify instead of polling and waiting for deletion
of device plugin unix socket as a way of detecting kubelet restart. We need to
ensure that the device plugin re-registers itself after kubelet restart depending
on the configured registration mode (auto-registration or controller registration).

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2023-10-17 12:19:10 +01:00