Commit Graph

1336 Commits

Author SHA1 Message Date
Kubernetes Prow Robot
c96d7a5b5a
Merge pull request #121774 from charles-chenzz/increase_timeout_in_dra_shouldTimeOut
increase timeout in fakeDraDriverGrpcServer to fix flake
2024-01-04 17:59:12 +01:00
weilaaa
eb8f3f194f use build-in max and min func to instead of k8s.io/utils/integer funcs 2023-12-15 15:09:11 +08:00
Kubernetes Prow Robot
3c1356bc9b
Merge pull request #119764 from linxiulei/reservedTypo
Fix error message for invalid resource reservation
2023-12-13 21:25:42 +01:00
charles-chenzz
abaf7a800d increase timeout in fakeDraDriverGrpcServer to fix flake in dra/manger_test 2023-11-07 19:38:27 +08:00
Kubernetes Prow Robot
960431407c
Merge pull request #120715 from gjkim42/do-not-reuse-memory-of-restartable-init-containers
Don't reuse memory of a restartable init container
2023-11-01 01:50:45 +01:00
Kubernetes Prow Robot
a5ff0324a9
Merge pull request #120461 from gjkim42/do-not-reuse-device-of-restartable-init-container
Don't reuse the device of a restartable init container
2023-10-31 19:15:53 +01:00
Kubernetes Prow Robot
bfeb3c2621
Merge pull request #119447 from gjkim42/do-not-reuse-cpu-set-of-restartable-init-container
Don't reuse CPU set of a restartable init container
2023-10-31 19:15:26 +01:00
Kubernetes Prow Robot
191abe34b8
Merge pull request #120550 from adrianchiris/fix-dra-node-reboot
DRA: call plugins for claims even if exist in cache
2023-10-26 10:26:59 +02:00
adrianc
3738111337
Add unit tests
adjust existing tests and add new test flows
to cover new DRA manager behaviour

Signed-off-by: adrianc <adrianc@nvidia.com>
2023-10-25 13:20:22 +03:00
adrianc
08b942028f
DRA: call plugins for claims even if exist in cache
Today, DRA manager does not call plugin NodePrepareResource
for claims that it previously successfully handled, that is,
if claims are present in cache (checkpoint) even if node
rebooted.

After node reboots, it is required to call DRA plugin
for resource claims so that plugins may prepare them
again in case the resources dont persist reboot.

To achieve that, once kubelet is started, we call DRA
plugins for claims once if a pod sandbox is required
to be created during PodSync.

Signed-off-by: adrianc <adrianc@nvidia.com>
2023-10-25 13:20:16 +03:00
Antonio Ojea
8e0be64b8f remove data race on the devicemanager client plugin
Change-Id: I45b85440a792e5ed2f75a344ec1f0332854d8d6d
2023-10-24 21:35:13 +00:00
Shiming Zhang
35f4d29d73 Fix unit test 2023-10-24 11:06:35 +08:00
Kubernetes Prow Robot
76fc18c528
Merge pull request #120099 from TommyStarK/gh_119469
dra: refactoring overall flow of prepare/unprepare resources
2023-10-23 19:51:53 +02:00
TommyStarK
55e3662b72
dra: refactoring overall flow of prepare/unprepare resources
Signed-off-by: TommyStarK <thomasmilox@gmail.com>
2023-10-23 15:11:27 +02:00
Kubernetes Prow Robot
f41ede6241
Merge pull request #118534 from swatisehgal/sample-dp-register-by-default
node: sample-device-plugin: register to kubelet by default and ensure re-registration to kubelet on kubelet restarts
2023-10-23 13:41:19 +02:00
Kubernetes Prow Robot
a7b8357a55
Merge pull request #118165 from champly/master
kubelet: fix comment typo
2023-10-17 23:28:25 +02:00
Swati Sehgal
9a354fc9d0 node: sample-dp: Add retry to handle device plugin restart failure
Add retry mechanism to handle cases where after kubelet restarts, the device
plugin unix socket(s) were created but not ready to serve yet.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2023-10-17 12:19:10 +01:00
Swati Sehgal
d0d133298d node: sample-dp: Use fsnotify for kubelet restart detection
Add kubeletSocket file to fsnotify instead of polling and waiting for deletion
of device plugin unix socket as a way of detecting kubelet restart. We need to
ensure that the device plugin re-registers itself after kubelet restart depending
on the configured registration mode (auto-registration or controller registration).

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2023-10-17 12:19:10 +01:00
Swati Sehgal
211d8cc80a node: sample-dp: stubRegisterControlFunc for controlling registration
If the user specifies the intent to control registration process, we rely on
registration triggers (deletion of control file) to prompt registration.

This behvaiour is expected to be consistent across kubelet restarts and therefore
across the watch calls where we watch for changes to the unix socket so we make
this part of Stub object instead of a parameter.

Co-authored-by: Francesco Romani <fromani@redhat.com>
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2023-10-17 12:19:10 +01:00
Swati Sehgal
c4c9d61d66 node: sample-dp: Handle re-registration for controlled registrations
In case `REGISTER_CONTROL_FILE` is specified, we want to ensure that the
registration is triggered by deletion of the control file. This is
applicable both when the registration happens for the first time and
subsequent ones because of kubelet restarts.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2023-10-17 12:19:07 +01:00
Swati Sehgal
6714e678d3 node: sample-dp: register by default and re-register on restarts
In issue: 115107 we added an environment variable to control the registration of sample
device plugin to kubelet. The intent of this patch is to ensure that the default
behaviour of the plugin is to register to kubelet (in case no environment
variable is specified).

In addition to that, we want to ensure that the plugin registers itself not just once.
It should re-register itself to kubelet in case of node reboot or kubelet restarts.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2023-10-17 12:14:09 +01:00
Gunju Kim
d2b803246a
Don't reuse the device allocated to the restartable init container 2023-10-17 18:28:29 +09:00
Kubernetes Prow Robot
c7d270302c
Merge pull request #121059 from matte21/improve_err_message_in_cpu_assignments
Improve error message in Kubelet CPU assignment logic
2023-10-16 16:48:54 +02:00
Kubernetes Prow Robot
0de29e1d43
Merge pull request #120911 from gjkim42/devicemanager-remove-deprecated-sets-string
pkg/kubelet/cm: Remove deprecated sets.String and sets.Int
2023-10-16 16:48:40 +02:00
matte21
d4a5a085a8 Improve error message in cpu assignment logic
Include number of requested and available CPUs in the error message
when the assignment of CPUs fails because there are less available
CPUs than requested.
2023-10-09 13:31:37 -04:00
Gunju Kim
8b5f30ef09
Don't reuse CPU set of a restartable init container 2023-10-06 22:16:15 +09:00
matte21
a213edae2a Add package-level godoc to pkg/kubelet/cm
Add file doc.go with some rudimentary information to package
kubelet/cm. This will make it easier for people approaching the
kubelet codebase for the first time to quickly understand what's
in the package, since its name is abbreviated and hostile to
newcomers.
2023-10-05 14:20:51 -04:00
Gunju Kim
a0610a97b3
pkg/kubelet/cm: Remove deprecated sets.String and sets.Int
This removes deprecated sets.String and sets.Int
- replace sets.String with sets.Set[string]
- replace sets.Int with sets.Set[int]
- replace sets.NewString with sets.New[string]
- replace sets.NewInt with sets.New[int]
- replace sets.(OLD).List with sets.List(NEW)
2023-09-27 22:02:15 +09:00
Kubernetes Prow Robot
f9f00da6bc
Merge pull request #118761 from TommyStarK/gh_113831
move common logic of highestSupportedVersion to util package
2023-09-18 13:59:25 -07:00
TommyStarK
42356bfbb3
move common logic of highestSupportedVersion to util package
Signed-off-by: TommyStarK <thomasmilox@gmail.com>
2023-09-18 21:25:29 +02:00
Kubernetes Prow Robot
82bca6304b
Merge pull request #119464 from TommyStarK/dra/cleanup-manager-unit-tests
dra: cleanup manager unit tests
2023-09-18 07:08:43 -07:00
Gunju Kim
b4e5b868a8
Don't reuse memory of a restartable init container 2023-09-17 14:49:15 +09:00
Eric Lin
286628b030 Fix error message for invalid resource reservation
Signed-off-by: Eric Lin <exlin@google.com>
2023-08-20 12:55:26 +00:00
Kubernetes Prow Robot
19deb04a90
Merge pull request #118619 from TommyStarK/gh_113832
dynamic resource allocation: reuse gRPC connection
2023-08-16 09:32:27 -07:00
charles-chenzz
ba9ce3ab08 fix flaky test on dra TestPrepareResources/should_timeout
Co-authored-by: TommyStarK <thomasmilox@gmail.com>
2023-08-03 22:37:54 +08:00
TommyStarK
391c1a3ecc
dra: cleanup manager unit tests
Signed-off-by: TommyStarK <thomasmilox@gmail.com>
2023-08-02 23:35:45 +02:00
TommyStarK
60a8bca507 dynamic resource allocation: add unit test to check the reuse of the gRPC connection
Signed-off-by: TommyStarK <thomasmilox@gmail.com>
2023-07-20 19:22:25 +02:00
TommyStarK
7ffd3063ce dynamic resource allocation: reuse gRPC connection
Signed-off-by: TommyStarK <thomasmilox@gmail.com>
2023-07-19 10:12:52 +02:00
Kevin Klues
0449cef8fd Increase timeout for DRA kubelet plugin client
The 10 second timeout was too low. Given that the retry loop for the
kubelet itself is 90s, increasing the timeout to half of this seems
reasonable. Ideally we would pull in the variable that sets the retry
timeout to 90s and then just set our local timeout to half of that.
Unfortunately, this is not exported, so we settle (for now with just
explicitly setting it to 45s.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2023-07-18 22:45:01 +01:00
Ed Bartosh
0ec99fb0b2 Kubelet DRA: fix failing test cases 2023-07-18 19:06:33 +03:00
Ed Bartosh
f6431c6138 DRA: don't query claims from API server
When a pod is force-deleted UnprepareResources fails to get a claim
from an API server.
PrepareResources should cache claim info required by the
UnprepareResources so that UnprepareResources would get it from
the cache instead of querying API server.
2023-07-18 18:23:10 +03:00
Kubernetes Prow Robot
6d83e22ba4
Merge pull request #118711 from TommyStarK/tom/gh_118436
add unit test for dra/manager.go
2023-07-18 04:17:09 -07:00
charles-chenzz
0372e4b662 add unit test for dra/manager.go.
Co-Authored-By: charles-chenzz <Rekles666@gmail.com>
Signed-off-by: TommyStarK <thomasmilox@gmail.com>
2023-07-18 12:14:27 +02:00
Kubernetes Prow Robot
da2fdf8cc3
Merge pull request #118764 from iholder101/Swap/burstableQoS-impl
Add full cgroup v2 swap support with automatically calculated swap limit for LimitedSwap and Burstable QoS Pods
2023-07-17 19:49:07 -07:00
Kubernetes Prow Robot
bdcf812c95
Merge pull request #118254 from elezar/4009/add-cdi-devices-to-device-plugin
Add CDI devices to device plugin API
2023-07-17 05:21:08 -07:00
Evan Lezar
b57c7e2fe4 Add CDI devices to device plugin API
This change adds CDI device IDs to the ContainerAllocateResponse in the
device plugin API. This allows a device plugin to specify CDI devices
by their unique fully-qualified CDI device names using the related field
in the CRI specification.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
2023-07-17 11:53:09 +02:00
Kubernetes Prow Robot
900237fada
Merge pull request #118635 from ffromani/devmgr-check-pod-running
kubelet: devices: skip allocation for running pods
2023-07-15 05:43:16 -07:00
Kubernetes Prow Robot
cab65e2008
Merge pull request #118816 from PiotrProkop/topo-opts-to-beta
topologymanager: Promote support for improved multi-numa alignment in Topology Manager to beta
2023-07-14 16:55:08 -07:00
Itamar Holder
a30410d9ce LimitedSwap: Automatically configure swap limit for Burstable QoS Pods
After this commit, when LimitedSwap is enabled,
containers would get swap acess limited with respect
the container memory request, total physical memory
on the node, and the swap size on the node.

Pods of Best-Effort / Guaranteed QoS classes don't get
to swap. In addition, container with memory requests
that are equal to their memory limits also don't get to
swap.

The swap limitation is calculated in the following way:
1. Calculate the container's memory proportionate to the node's memory:
- Divide the container's memory request by the total node's physical memory.
  Let's call this value ContainerMemoryProportion.

2. Multiply the container memory proportion by the available
swap memory for Pods:
Meaning: ContainerMemoryProportion * TotalPodsSwapAvailable.

Fore more information:
https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2400-node-swap/README.md

Signed-off-by: Itamar Holder <iholder@redhat.com>
2023-07-14 14:52:28 +03:00
Kubernetes Prow Robot
0086712926
Merge pull request #116922 from sourcelliu/checkpoint
Improve the performance of map usage
2023-07-12 17:59:30 -07:00