Commit Graph

1346 Commits

Author SHA1 Message Date
Kubernetes Prow Robot
0f7cc6fcaa
Merge pull request #121778 from Tal-or/mm_metrics
kubelet: memorymanager: metrics:  add metrics about static allocation
2024-02-20 09:41:50 -08:00
Kubernetes Prow Robot
79e11fe563
Merge pull request #122703 from TommyStarK/fix/dra-manager-should-timeout
dra: increase timeout in setupFakeDRADriverGRPCServer to prevent tests to flake
2024-02-13 09:33:17 -08:00
Kubernetes Prow Robot
244fbf94fd
Merge pull request #122698 from daniel-hutao/feat-1
Code Cleanup: Redundant String Conversions and Spelling/Grammar Corrections
2024-02-05 16:57:07 -08:00
Daniel Hu
1baf7d4586 Corrected some spelling and grammatical errors
Signed-off-by: Daniel Hu <farmer.hutao@outlook.com>
2024-01-27 10:10:25 +08:00
Kubernetes Prow Robot
3da22db11c
Merge pull request #121499 from matte21/add-comments-to-cpu-accumulator
Improve understandability of kubelet's cpu accumulator code
2024-01-26 00:56:21 +01:00
Daniel Hu
d652596e42 Remove redundant string conversions in print statements
Signed-off-by: Daniel Hu <farmer.hutao@outlook.com>
2024-01-15 09:57:35 +08:00
TommyStarK
6f021e99cf
dra: increase timeout in setupFakeDRADriverGRPCServer to prevent tests to flake.
Signed-off-by: TommyStarK <thomasmilox@gmail.com>
2024-01-11 09:20:04 +01:00
Akihiro Suda
2e999fff02
Fix compiling e2e.test on macOS
Fix issue 122650 (regression in PR 122552)

```
$ make WHAT=test/e2e/e2e.test
+++ [0109 10:06:53] Building go targets for darwin/amd64
    k8s.io/kubernetes/test/e2e/e2e.test (test)
package k8s.io/kubernetes/test/e2e
        imports k8s.io/kubernetes/test/e2e/common
        imports k8s.io/kubernetes/test/e2e/common/node
        imports k8s.io/kubernetes/pkg/kubelet
        imports github.com/opencontainers/runc/libcontainer/userns: C source files not allowed when not using cgo or SWIG: userns_maps.c
!!! [0109 10:06:54] Call tree:
!!! [0109 10:06:54]  1: /Users/suda/gopath/src/k8s.io/kubernetes/hack/lib/golang.sh:948 kube::golang::build_binaries_for_platform(...)
!!! [0109 10:06:54]  2: hack/make-rules/build.sh:27 kube::golang::build_binaries(...)
!!! [0109 10:06:54] Call tree:
!!! [0109 10:06:54]  1: hack/make-rules/build.sh:27 kube::golang::build_binaries(...)
!!! [0109 10:06:54] Call tree:
!!! [0109 10:06:54]  1: hack/make-rules/build.sh:27 kube::golang::build_binaries(...)
make: *** [all] Error 1
```

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2024-01-09 10:42:20 +09:00
Kubernetes Prow Robot
c96d7a5b5a
Merge pull request #121774 from charles-chenzz/increase_timeout_in_dra_shouldTimeOut
increase timeout in fakeDraDriverGrpcServer to fix flake
2024-01-04 17:59:12 +01:00
weilaaa
eb8f3f194f use build-in max and min func to instead of k8s.io/utils/integer funcs 2023-12-15 15:09:11 +08:00
Kubernetes Prow Robot
3c1356bc9b
Merge pull request #119764 from linxiulei/reservedTypo
Fix error message for invalid resource reservation
2023-12-13 21:25:42 +01:00
Talor Itzhak
ddd60de3f3 memorymanager:metrics: add metrics
As part of the memory manager GA graduation effort, we should add
metrics in order to iprove observability.

The metrics also mentioned in the PR https://github.com/kubernetes/enhancements/pull/4251 (which was not merged yet)

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
2023-11-12 09:34:55 +02:00
charles-chenzz
abaf7a800d increase timeout in fakeDraDriverGrpcServer to fix flake in dra/manger_test 2023-11-07 19:38:27 +08:00
Kubernetes Prow Robot
960431407c
Merge pull request #120715 from gjkim42/do-not-reuse-memory-of-restartable-init-containers
Don't reuse memory of a restartable init container
2023-11-01 01:50:45 +01:00
Kubernetes Prow Robot
a5ff0324a9
Merge pull request #120461 from gjkim42/do-not-reuse-device-of-restartable-init-container
Don't reuse the device of a restartable init container
2023-10-31 19:15:53 +01:00
Kubernetes Prow Robot
bfeb3c2621
Merge pull request #119447 from gjkim42/do-not-reuse-cpu-set-of-restartable-init-container
Don't reuse CPU set of a restartable init container
2023-10-31 19:15:26 +01:00
Kubernetes Prow Robot
191abe34b8
Merge pull request #120550 from adrianchiris/fix-dra-node-reboot
DRA: call plugins for claims even if exist in cache
2023-10-26 10:26:59 +02:00
adrianc
3738111337
Add unit tests
adjust existing tests and add new test flows
to cover new DRA manager behaviour

Signed-off-by: adrianc <adrianc@nvidia.com>
2023-10-25 13:20:22 +03:00
adrianc
08b942028f
DRA: call plugins for claims even if exist in cache
Today, DRA manager does not call plugin NodePrepareResource
for claims that it previously successfully handled, that is,
if claims are present in cache (checkpoint) even if node
rebooted.

After node reboots, it is required to call DRA plugin
for resource claims so that plugins may prepare them
again in case the resources dont persist reboot.

To achieve that, once kubelet is started, we call DRA
plugins for claims once if a pod sandbox is required
to be created during PodSync.

Signed-off-by: adrianc <adrianc@nvidia.com>
2023-10-25 13:20:16 +03:00
matte21
4bba73a2bd Add comments to cpu accumulator and minor renames
The cpu accumulator logic (that select CPUs for containers)
has some non-obvious code.
This commit adds some comments to make that code easier to
understand for new contributors. Some minor renames to
improve readability are also performed.
2023-10-24 22:49:54 -04:00
Antonio Ojea
8e0be64b8f remove data race on the devicemanager client plugin
Change-Id: I45b85440a792e5ed2f75a344ec1f0332854d8d6d
2023-10-24 21:35:13 +00:00
Shiming Zhang
35f4d29d73 Fix unit test 2023-10-24 11:06:35 +08:00
Kubernetes Prow Robot
76fc18c528
Merge pull request #120099 from TommyStarK/gh_119469
dra: refactoring overall flow of prepare/unprepare resources
2023-10-23 19:51:53 +02:00
TommyStarK
55e3662b72
dra: refactoring overall flow of prepare/unprepare resources
Signed-off-by: TommyStarK <thomasmilox@gmail.com>
2023-10-23 15:11:27 +02:00
Kubernetes Prow Robot
f41ede6241
Merge pull request #118534 from swatisehgal/sample-dp-register-by-default
node: sample-device-plugin: register to kubelet by default and ensure re-registration to kubelet on kubelet restarts
2023-10-23 13:41:19 +02:00
Kubernetes Prow Robot
a7b8357a55
Merge pull request #118165 from champly/master
kubelet: fix comment typo
2023-10-17 23:28:25 +02:00
Swati Sehgal
9a354fc9d0 node: sample-dp: Add retry to handle device plugin restart failure
Add retry mechanism to handle cases where after kubelet restarts, the device
plugin unix socket(s) were created but not ready to serve yet.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2023-10-17 12:19:10 +01:00
Swati Sehgal
d0d133298d node: sample-dp: Use fsnotify for kubelet restart detection
Add kubeletSocket file to fsnotify instead of polling and waiting for deletion
of device plugin unix socket as a way of detecting kubelet restart. We need to
ensure that the device plugin re-registers itself after kubelet restart depending
on the configured registration mode (auto-registration or controller registration).

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2023-10-17 12:19:10 +01:00
Swati Sehgal
211d8cc80a node: sample-dp: stubRegisterControlFunc for controlling registration
If the user specifies the intent to control registration process, we rely on
registration triggers (deletion of control file) to prompt registration.

This behvaiour is expected to be consistent across kubelet restarts and therefore
across the watch calls where we watch for changes to the unix socket so we make
this part of Stub object instead of a parameter.

Co-authored-by: Francesco Romani <fromani@redhat.com>
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2023-10-17 12:19:10 +01:00
Swati Sehgal
c4c9d61d66 node: sample-dp: Handle re-registration for controlled registrations
In case `REGISTER_CONTROL_FILE` is specified, we want to ensure that the
registration is triggered by deletion of the control file. This is
applicable both when the registration happens for the first time and
subsequent ones because of kubelet restarts.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2023-10-17 12:19:07 +01:00
Swati Sehgal
6714e678d3 node: sample-dp: register by default and re-register on restarts
In issue: 115107 we added an environment variable to control the registration of sample
device plugin to kubelet. The intent of this patch is to ensure that the default
behaviour of the plugin is to register to kubelet (in case no environment
variable is specified).

In addition to that, we want to ensure that the plugin registers itself not just once.
It should re-register itself to kubelet in case of node reboot or kubelet restarts.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2023-10-17 12:14:09 +01:00
Gunju Kim
d2b803246a
Don't reuse the device allocated to the restartable init container 2023-10-17 18:28:29 +09:00
Kubernetes Prow Robot
c7d270302c
Merge pull request #121059 from matte21/improve_err_message_in_cpu_assignments
Improve error message in Kubelet CPU assignment logic
2023-10-16 16:48:54 +02:00
Kubernetes Prow Robot
0de29e1d43
Merge pull request #120911 from gjkim42/devicemanager-remove-deprecated-sets-string
pkg/kubelet/cm: Remove deprecated sets.String and sets.Int
2023-10-16 16:48:40 +02:00
matte21
d4a5a085a8 Improve error message in cpu assignment logic
Include number of requested and available CPUs in the error message
when the assignment of CPUs fails because there are less available
CPUs than requested.
2023-10-09 13:31:37 -04:00
Gunju Kim
8b5f30ef09
Don't reuse CPU set of a restartable init container 2023-10-06 22:16:15 +09:00
matte21
a213edae2a Add package-level godoc to pkg/kubelet/cm
Add file doc.go with some rudimentary information to package
kubelet/cm. This will make it easier for people approaching the
kubelet codebase for the first time to quickly understand what's
in the package, since its name is abbreviated and hostile to
newcomers.
2023-10-05 14:20:51 -04:00
Gunju Kim
a0610a97b3
pkg/kubelet/cm: Remove deprecated sets.String and sets.Int
This removes deprecated sets.String and sets.Int
- replace sets.String with sets.Set[string]
- replace sets.Int with sets.Set[int]
- replace sets.NewString with sets.New[string]
- replace sets.NewInt with sets.New[int]
- replace sets.(OLD).List with sets.List(NEW)
2023-09-27 22:02:15 +09:00
Kubernetes Prow Robot
f9f00da6bc
Merge pull request #118761 from TommyStarK/gh_113831
move common logic of highestSupportedVersion to util package
2023-09-18 13:59:25 -07:00
TommyStarK
42356bfbb3
move common logic of highestSupportedVersion to util package
Signed-off-by: TommyStarK <thomasmilox@gmail.com>
2023-09-18 21:25:29 +02:00
Kubernetes Prow Robot
82bca6304b
Merge pull request #119464 from TommyStarK/dra/cleanup-manager-unit-tests
dra: cleanup manager unit tests
2023-09-18 07:08:43 -07:00
Gunju Kim
b4e5b868a8
Don't reuse memory of a restartable init container 2023-09-17 14:49:15 +09:00
Eric Lin
286628b030 Fix error message for invalid resource reservation
Signed-off-by: Eric Lin <exlin@google.com>
2023-08-20 12:55:26 +00:00
Kubernetes Prow Robot
19deb04a90
Merge pull request #118619 from TommyStarK/gh_113832
dynamic resource allocation: reuse gRPC connection
2023-08-16 09:32:27 -07:00
charles-chenzz
ba9ce3ab08 fix flaky test on dra TestPrepareResources/should_timeout
Co-authored-by: TommyStarK <thomasmilox@gmail.com>
2023-08-03 22:37:54 +08:00
TommyStarK
391c1a3ecc
dra: cleanup manager unit tests
Signed-off-by: TommyStarK <thomasmilox@gmail.com>
2023-08-02 23:35:45 +02:00
TommyStarK
60a8bca507 dynamic resource allocation: add unit test to check the reuse of the gRPC connection
Signed-off-by: TommyStarK <thomasmilox@gmail.com>
2023-07-20 19:22:25 +02:00
TommyStarK
7ffd3063ce dynamic resource allocation: reuse gRPC connection
Signed-off-by: TommyStarK <thomasmilox@gmail.com>
2023-07-19 10:12:52 +02:00
Kevin Klues
0449cef8fd Increase timeout for DRA kubelet plugin client
The 10 second timeout was too low. Given that the retry loop for the
kubelet itself is 90s, increasing the timeout to half of this seems
reasonable. Ideally we would pull in the variable that sets the retry
timeout to 90s and then just set our local timeout to half of that.
Unfortunately, this is not exported, so we settle (for now with just
explicitly setting it to 45s.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2023-07-18 22:45:01 +01:00
Ed Bartosh
0ec99fb0b2 Kubelet DRA: fix failing test cases 2023-07-18 19:06:33 +03:00