Commit Graph

1378 Commits

Author SHA1 Message Date
Kubernetes Prow Robot
fad52aedfc Merge pull request #125086 from oxxenix/exponential-backoff
add exponential backoff in NodeResourceSlices controller
2024-05-28 02:46:43 -07:00
Oksana Baranova
c4ec24890e nodeResourceSlicesController: add exponential backoff 2024-05-27 23:12:53 +03:00
Itamar Holder
a6b971f14b Use kubelet owned directories for mounting rather than /tmp
Signed-off-by: Itamar Holder <iholder@redhat.com>
2024-05-21 13:18:16 +03:00
Itamar Holder
74f29880bd Replace log entry by a warning event
Signed-off-by: Itamar Holder <iholder@redhat.com>
2024-05-21 13:18:16 +03:00
Itamar Holder
29535c0463 Warn of swap is enabled on the OS and tmpfs noswap is not supported
When --fail-swap-on=false kubelet CLI argument
is provided, but tmpfs noswap is not supported
by the kernel, warn about the risks of memory-backed
volumes being swapped into disk

Signed-off-by: Itamar Holder <iholder@redhat.com>
2024-05-21 13:18:16 +03:00
Kubernetes Prow Robot
8352c09592 Merge pull request #124323 from bart0sh/PR142-dra-fix-cache-integrity
kubelet: DRA: fix cache integrity
2024-05-13 09:54:02 -07:00
Alvaro Aleman
6d0ac8c561 Use the generic/typed workqueue throughout
This change makes us use the generic workqueue throughout the project in
order to improve type safety and readability of the code.
2024-05-04 14:33:12 -04:00
Ed Bartosh
f24134d7b2 kubelet: DRA: add unit test for ClaimInfo and claimInfoCache 2024-05-03 13:30:31 +00:00
Ed Bartosh
6ce294558a kubelet: DRA: add stress test
The tests calls PrepareResources and UnprepareResources API in
parallel to help discover race conditions.
2024-05-03 13:30:29 +00:00
Kevin Klues
86a18d5333 kubelet: DRA: update manager test to adhere to new claiminfo cache APIs
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2024-05-03 13:28:37 +00:00
Kevin Klues
805e7c3434 kubelet: DRA: remove check to set pluginName to DriverName if not in ResourceHandle
It has always been validated that a ResourceHandle MUST have DriverName set, so
this check is unnecessary.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2024-05-03 13:23:29 +00:00
Kevin Klues
f80be2728e kubelet: DRA: change key of claimInfo cache to "namespace/claimname"
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2024-05-03 13:23:29 +00:00
Kevin Klues
639e887631 kubelet: DRA: add a reconcile loop to unprepare claims for deleted pods
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2024-05-03 13:23:29 +00:00
Kevin Klues
a8931c6c25 kubelet: DRA: update locking/checkpoint semantics of the claimInfo cache
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2024-05-03 13:23:27 +00:00
Kubernetes Prow Robot
1fd835ce59 Merge pull request #123398 from ffromani/remove-legacy-checkpoint
node: devicemgr: remove obsolete pre-1.20 checkpoint file support
2024-04-29 14:46:53 -07:00
Marek Siarkowicz
3ee8178768 Cleanup defer from SetFeatureGateDuringTest function call 2024-04-24 20:25:29 +02:00
Patrick Ohly
77341f7595 DRA: remove support for v1alpha2 kubelet API
The v1alpha2 API is several releases old. No current drivers should still
depend on it.
2024-04-19 18:27:05 +02:00
Kubernetes Prow Robot
bbfd2145de Merge pull request #124091 from bitoku/dra-nil-check
kubelet: add nil check for Node(Un)PrepareResources.
2024-04-18 10:46:05 -07:00
Kubernetes Prow Robot
528cff12f6 Merge pull request #120969 from skitt/uber-go-mock
Switch from golang/mock to uber-go/mock
2024-04-17 23:59:24 -07:00
Francesco Romani
181fb0da51 node: devicemgr: remove obsolete pre-1.20 checkpoint file support
In commit 2f426fdba6 we added
compatibility (and tests) to deal with pre-1.20 checkpoint files.
We are now well past the end of support for pre-1.20 kubelets,
so we can get rid of this code.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2024-04-15 14:01:56 +02:00
Ayato Tokubi
d04f87abde add nil check for Node(Un)PrepareResources.
Signed-off-by: Ayato Tokubi <atokubi@redhat.com>
2024-04-04 23:24:25 +00:00
HirazawaUi
10b6319e64 fix slow dra unit test 2024-03-16 22:21:15 +08:00
Ed Bartosh
26881132bd kubelet: assign Node as an owner for the ResourceSlice
Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
2024-03-15 09:46:13 +02:00
Patrick Ohly
a0add8d2c7 dra api: NodeResourceModel -> ResourceModel
When renaming NodeResourceSlice to ResourceSlice, the embedded
[Node]ResourceModel also should have been renamed.
2024-03-14 18:07:36 +01:00
Kevin Klues
fc2134c84c dra kubelet: fix error log
Previously we were returning the error string from 'err' (which is nil), when
we should have been returning it from result.Error. Without this it is hard to
debug issues with NodeUnprepareResources.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2024-03-11 13:51:29 +00:00
Kevin Klues
13a6dcc21c dra kubelet: add StructuredResourceModel to UnprepareResources call
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2024-03-09 18:08:14 +00:00
Patrick Ohly
0b6a0d686a dra api: rename NodeResourceSlice -> ResourceSlice
While currently those objects only get published by the kubelet for node-local
resources, this could change once we also support network-attached
resources. Dropping the "Node" prefix enables such a future extension.

The NodeName in ResourceSlice and StructuredResourceHandle then becomes
optional. The kubelet still needs to provide one and it must match its own node
name, otherwise it doesn't have permission to access ResourceSlice objects.
2024-03-07 22:22:55 +01:00
Patrick Ohly
d59676a545 dra kubelet: publish NodeResourceSlices
The information is received from the DRA driver plugin through a new gRPC
streaming interface. This is backwards compatible with old DRA driver kubelet
plugins, their gRPC server will return "not implemented" and that can be
handled by kubelet. Therefore no API break is needed.

However, DRA drivers need to be updated because the Go API changed. They can
return
    status.New(codes.Unimplemented, "no node resource support").Err()
if they don't support the new ListAndWatchResources method and
structured parameters.

The controller in kubelet then synchronizes this information from the driver
with NodeResourceSlice objects, creating, updating and deleting them as needed.
2024-03-07 22:22:13 +01:00
Patrick Ohly
6f1ddfcd2e kubelet: support structured parameters for preparing resources
If the resource handle has data from a structured parameter model, then we need
to pass that to the DRA driver kubelet plugin. Because Kubernetes uses
gogo/protobuf, we cannot use "optional" for that new optional field and have to
resort to "repeated" with a single repetition if present.

This is a new, backwards-compatible field.

That extending the resource.k8s.io changes the checksum of a kubelet checkpoint
is unfortunate. Updating the test cases is a stop-gap measure, the actual
solution will have to be something else before beta.
2024-03-07 22:22:13 +01:00
Stephen Kitt
6bf667af06 Switch from golang/mock to uber-go/mock
See https://github.com/golang/mock#gomock: golang/mock is no longer
maintained, and should be replaced by go.uber.org/mock.

This allows golang/mock to be dropped from the status and vendored
fields in unwanted-dependencies.json.

Signed-off-by: Stephen Kitt <skitt@redhat.com>
2024-03-07 09:12:16 +01:00
Kubernetes Prow Robot
70383f3701 Merge pull request #119561 from payall4u/fix-kubelet-panic-when-allocate-device
Fix kubelet panic when allocate resource for pod.
2024-02-29 03:06:54 -08:00
Kubernetes Prow Robot
0f7cc6fcaa Merge pull request #121778 from Tal-or/mm_metrics
kubelet: memorymanager: metrics:  add metrics about static allocation
2024-02-20 09:41:50 -08:00
Kubernetes Prow Robot
79e11fe563 Merge pull request #122703 from TommyStarK/fix/dra-manager-should-timeout
dra: increase timeout in setupFakeDRADriverGRPCServer to prevent tests to flake
2024-02-13 09:33:17 -08:00
Kubernetes Prow Robot
244fbf94fd Merge pull request #122698 from daniel-hutao/feat-1
Code Cleanup: Redundant String Conversions and Spelling/Grammar Corrections
2024-02-05 16:57:07 -08:00
Daniel Hu
1baf7d4586 Corrected some spelling and grammatical errors
Signed-off-by: Daniel Hu <farmer.hutao@outlook.com>
2024-01-27 10:10:25 +08:00
Kubernetes Prow Robot
3da22db11c Merge pull request #121499 from matte21/add-comments-to-cpu-accumulator
Improve understandability of kubelet's cpu accumulator code
2024-01-26 00:56:21 +01:00
Daniel Hu
d652596e42 Remove redundant string conversions in print statements
Signed-off-by: Daniel Hu <farmer.hutao@outlook.com>
2024-01-15 09:57:35 +08:00
TommyStarK
6f021e99cf dra: increase timeout in setupFakeDRADriverGRPCServer to prevent tests to flake.
Signed-off-by: TommyStarK <thomasmilox@gmail.com>
2024-01-11 09:20:04 +01:00
Akihiro Suda
2e999fff02 Fix compiling e2e.test on macOS
Fix issue 122650 (regression in PR 122552)

```
$ make WHAT=test/e2e/e2e.test
+++ [0109 10:06:53] Building go targets for darwin/amd64
    k8s.io/kubernetes/test/e2e/e2e.test (test)
package k8s.io/kubernetes/test/e2e
        imports k8s.io/kubernetes/test/e2e/common
        imports k8s.io/kubernetes/test/e2e/common/node
        imports k8s.io/kubernetes/pkg/kubelet
        imports github.com/opencontainers/runc/libcontainer/userns: C source files not allowed when not using cgo or SWIG: userns_maps.c
!!! [0109 10:06:54] Call tree:
!!! [0109 10:06:54]  1: /Users/suda/gopath/src/k8s.io/kubernetes/hack/lib/golang.sh:948 kube::golang::build_binaries_for_platform(...)
!!! [0109 10:06:54]  2: hack/make-rules/build.sh:27 kube::golang::build_binaries(...)
!!! [0109 10:06:54] Call tree:
!!! [0109 10:06:54]  1: hack/make-rules/build.sh:27 kube::golang::build_binaries(...)
!!! [0109 10:06:54] Call tree:
!!! [0109 10:06:54]  1: hack/make-rules/build.sh:27 kube::golang::build_binaries(...)
make: *** [all] Error 1
```

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2024-01-09 10:42:20 +09:00
Kubernetes Prow Robot
c96d7a5b5a Merge pull request #121774 from charles-chenzz/increase_timeout_in_dra_shouldTimeOut
increase timeout in fakeDraDriverGrpcServer to fix flake
2024-01-04 17:59:12 +01:00
weilaaa
eb8f3f194f use build-in max and min func to instead of k8s.io/utils/integer funcs 2023-12-15 15:09:11 +08:00
Kubernetes Prow Robot
3c1356bc9b Merge pull request #119764 from linxiulei/reservedTypo
Fix error message for invalid resource reservation
2023-12-13 21:25:42 +01:00
Talor Itzhak
ddd60de3f3 memorymanager:metrics: add metrics
As part of the memory manager GA graduation effort, we should add
metrics in order to iprove observability.

The metrics also mentioned in the PR https://github.com/kubernetes/enhancements/pull/4251 (which was not merged yet)

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
2023-11-12 09:34:55 +02:00
payall4u
d6b8a660b0 Fix kubelet panic when allocate resource for pod.
Signed-off-by: payall4u <payall4u@qq.com>
2023-11-12 10:54:05 +08:00
charles-chenzz
abaf7a800d increase timeout in fakeDraDriverGrpcServer to fix flake in dra/manger_test 2023-11-07 19:38:27 +08:00
Kubernetes Prow Robot
960431407c Merge pull request #120715 from gjkim42/do-not-reuse-memory-of-restartable-init-containers
Don't reuse memory of a restartable init container
2023-11-01 01:50:45 +01:00
Kubernetes Prow Robot
a5ff0324a9 Merge pull request #120461 from gjkim42/do-not-reuse-device-of-restartable-init-container
Don't reuse the device of a restartable init container
2023-10-31 19:15:53 +01:00
Kubernetes Prow Robot
bfeb3c2621 Merge pull request #119447 from gjkim42/do-not-reuse-cpu-set-of-restartable-init-container
Don't reuse CPU set of a restartable init container
2023-10-31 19:15:26 +01:00
Kubernetes Prow Robot
191abe34b8 Merge pull request #120550 from adrianchiris/fix-dra-node-reboot
DRA: call plugins for claims even if exist in cache
2023-10-26 10:26:59 +02:00
adrianc
3738111337 Add unit tests
adjust existing tests and add new test flows
to cover new DRA manager behaviour

Signed-off-by: adrianc <adrianc@nvidia.com>
2023-10-25 13:20:22 +03:00