kubernetes

Author	SHA1	Message	Date
Swati Sehgal	211d8cc80a	node: sample-dp: stubRegisterControlFunc for controlling registration If the user specifies the intent to control registration process, we rely on registration triggers (deletion of control file) to prompt registration. This behvaiour is expected to be consistent across kubelet restarts and therefore across the watch calls where we watch for changes to the unix socket so we make this part of Stub object instead of a parameter. Co-authored-by: Francesco Romani <fromani@redhat.com> Signed-off-by: Swati Sehgal <swsehgal@redhat.com>	2023-10-17 12:19:10 +01:00
Swati Sehgal	c4c9d61d66	node: sample-dp: Handle re-registration for controlled registrations In case `REGISTER_CONTROL_FILE` is specified, we want to ensure that the registration is triggered by deletion of the control file. This is applicable both when the registration happens for the first time and subsequent ones because of kubelet restarts. Signed-off-by: Swati Sehgal <swsehgal@redhat.com>	2023-10-17 12:19:07 +01:00
Swati Sehgal	6714e678d3	node: sample-dp: register by default and re-register on restarts In issue: 115107 we added an environment variable to control the registration of sample device plugin to kubelet. The intent of this patch is to ensure that the default behaviour of the plugin is to register to kubelet (in case no environment variable is specified). In addition to that, we want to ensure that the plugin registers itself not just once. It should re-register itself to kubelet in case of node reboot or kubelet restarts. Signed-off-by: Swati Sehgal <swsehgal@redhat.com>	2023-10-17 12:14:09 +01:00
Gunju Kim	d2b803246a	Don't reuse the device allocated to the restartable init container	2023-10-17 18:28:29 +09:00
Kubernetes Prow Robot	c7d270302c	Merge pull request #121059 from matte21/improve_err_message_in_cpu_assignments Improve error message in Kubelet CPU assignment logic	2023-10-16 16:48:54 +02:00
Kubernetes Prow Robot	0de29e1d43	Merge pull request #120911 from gjkim42/devicemanager-remove-deprecated-sets-string pkg/kubelet/cm: Remove deprecated sets.String and sets.Int	2023-10-16 16:48:40 +02:00
matte21	d4a5a085a8	Improve error message in cpu assignment logic Include number of requested and available CPUs in the error message when the assignment of CPUs fails because there are less available CPUs than requested.	2023-10-09 13:31:37 -04:00
Gunju Kim	8b5f30ef09	Don't reuse CPU set of a restartable init container	2023-10-06 22:16:15 +09:00
matte21	a213edae2a	Add package-level godoc to pkg/kubelet/cm Add file doc.go with some rudimentary information to package kubelet/cm. This will make it easier for people approaching the kubelet codebase for the first time to quickly understand what's in the package, since its name is abbreviated and hostile to newcomers.	2023-10-05 14:20:51 -04:00
Gunju Kim	a0610a97b3	pkg/kubelet/cm: Remove deprecated sets.String and sets.Int This removes deprecated sets.String and sets.Int - replace sets.String with sets.Set[string] - replace sets.Int with sets.Set[int] - replace sets.NewString with sets.New[string] - replace sets.NewInt with sets.New[int] - replace sets.(OLD).List with sets.List(NEW)	2023-09-27 22:02:15 +09:00
Kubernetes Prow Robot	f9f00da6bc	Merge pull request #118761 from TommyStarK/gh_113831 move common logic of highestSupportedVersion to util package	2023-09-18 13:59:25 -07:00
TommyStarK	42356bfbb3	move common logic of highestSupportedVersion to util package Signed-off-by: TommyStarK <thomasmilox@gmail.com>	2023-09-18 21:25:29 +02:00
Kubernetes Prow Robot	82bca6304b	Merge pull request #119464 from TommyStarK/dra/cleanup-manager-unit-tests dra: cleanup manager unit tests	2023-09-18 07:08:43 -07:00
Gunju Kim	b4e5b868a8	Don't reuse memory of a restartable init container	2023-09-17 14:49:15 +09:00
Eric Lin	286628b030	Fix error message for invalid resource reservation Signed-off-by: Eric Lin <exlin@google.com>	2023-08-20 12:55:26 +00:00
Kubernetes Prow Robot	19deb04a90	Merge pull request #118619 from TommyStarK/gh_113832 dynamic resource allocation: reuse gRPC connection	2023-08-16 09:32:27 -07:00
charles-chenzz	ba9ce3ab08	fix flaky test on dra TestPrepareResources/should_timeout Co-authored-by: TommyStarK <thomasmilox@gmail.com>	2023-08-03 22:37:54 +08:00
TommyStarK	391c1a3ecc	dra: cleanup manager unit tests Signed-off-by: TommyStarK <thomasmilox@gmail.com>	2023-08-02 23:35:45 +02:00
TommyStarK	60a8bca507	dynamic resource allocation: add unit test to check the reuse of the gRPC connection Signed-off-by: TommyStarK <thomasmilox@gmail.com>	2023-07-20 19:22:25 +02:00
TommyStarK	7ffd3063ce	dynamic resource allocation: reuse gRPC connection Signed-off-by: TommyStarK <thomasmilox@gmail.com>	2023-07-19 10:12:52 +02:00
Kevin Klues	0449cef8fd	Increase timeout for DRA kubelet plugin client The 10 second timeout was too low. Given that the retry loop for the kubelet itself is 90s, increasing the timeout to half of this seems reasonable. Ideally we would pull in the variable that sets the retry timeout to 90s and then just set our local timeout to half of that. Unfortunately, this is not exported, so we settle (for now with just explicitly setting it to 45s. Signed-off-by: Kevin Klues <kklues@nvidia.com>	2023-07-18 22:45:01 +01:00
Ed Bartosh	0ec99fb0b2	Kubelet DRA: fix failing test cases	2023-07-18 19:06:33 +03:00
Ed Bartosh	f6431c6138	DRA: don't query claims from API server When a pod is force-deleted UnprepareResources fails to get a claim from an API server. PrepareResources should cache claim info required by the UnprepareResources so that UnprepareResources would get it from the cache instead of querying API server.	2023-07-18 18:23:10 +03:00
Kubernetes Prow Robot	6d83e22ba4	Merge pull request #118711 from TommyStarK/tom/gh_118436 add unit test for dra/manager.go	2023-07-18 04:17:09 -07:00
charles-chenzz	0372e4b662	add unit test for dra/manager.go. Co-Authored-By: charles-chenzz <Rekles666@gmail.com> Signed-off-by: TommyStarK <thomasmilox@gmail.com>	2023-07-18 12:14:27 +02:00
Kubernetes Prow Robot	da2fdf8cc3	Merge pull request #118764 from iholder101/Swap/burstableQoS-impl Add full cgroup v2 swap support with automatically calculated swap limit for LimitedSwap and Burstable QoS Pods	2023-07-17 19:49:07 -07:00
Kubernetes Prow Robot	bdcf812c95	Merge pull request #118254 from elezar/4009/add-cdi-devices-to-device-plugin Add CDI devices to device plugin API	2023-07-17 05:21:08 -07:00
Evan Lezar	b57c7e2fe4	Add CDI devices to device plugin API This change adds CDI device IDs to the ContainerAllocateResponse in the device plugin API. This allows a device plugin to specify CDI devices by their unique fully-qualified CDI device names using the related field in the CRI specification. Signed-off-by: Evan Lezar <elezar@nvidia.com>	2023-07-17 11:53:09 +02:00
Kubernetes Prow Robot	900237fada	Merge pull request #118635 from ffromani/devmgr-check-pod-running kubelet: devices: skip allocation for running pods	2023-07-15 05:43:16 -07:00
Kubernetes Prow Robot	cab65e2008	Merge pull request #118816 from PiotrProkop/topo-opts-to-beta topologymanager: Promote support for improved multi-numa alignment in Topology Manager to beta	2023-07-14 16:55:08 -07:00
Itamar Holder	a30410d9ce	LimitedSwap: Automatically configure swap limit for Burstable QoS Pods After this commit, when LimitedSwap is enabled, containers would get swap acess limited with respect the container memory request, total physical memory on the node, and the swap size on the node. Pods of Best-Effort / Guaranteed QoS classes don't get to swap. In addition, container with memory requests that are equal to their memory limits also don't get to swap. The swap limitation is calculated in the following way: 1. Calculate the container's memory proportionate to the node's memory: - Divide the container's memory request by the total node's physical memory. Let's call this value ContainerMemoryProportion. 2. Multiply the container memory proportion by the available swap memory for Pods: Meaning: ContainerMemoryProportion * TotalPodsSwapAvailable. Fore more information: https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2400-node-swap/README.md Signed-off-by: Itamar Holder <iholder@redhat.com>	2023-07-14 14:52:28 +03:00
Kubernetes Prow Robot	0086712926	Merge pull request #116922 from sourcelliu/checkpoint Improve the performance of map usage	2023-07-12 17:59:30 -07:00
Kubernetes Prow Robot	047d040ce7	Merge pull request #119012 from pohly/dra-batch-node-prepare kubelet: support batched prepare/unprepare in v1alpha3 DRA plugin API	2023-07-12 10:57:37 -07:00
Kubernetes Prow Robot	be222f38f0	Merge pull request #119058 from TommyStarK/dra-state-checkpoint-unit-test dynamic resource allocation: Improve code coverage of state checkpoint	2023-07-12 07:49:14 -07:00
Patrick Ohly	d743c50bb9	kubelet: support batched prepare/unprepare in v1alpha3 DRA plugin API Combining all prepare/unprepare operations for a pod enables plugins to optimize the execution. Plugins can continue to use the v1beta2 API for now, but should switch. The new API is designed so that plugins which want to work on each claim one-by-one can do so and then report errors for each claim separately, i.e. partial success is supported.	2023-07-12 14:50:30 +02:00
TommyStarK	f924bf95df	dynamic resource allocation: Improve code coverage of state checkpoint Signed-off-by: TommyStarK <thomasmilox@gmail.com>	2023-07-12 13:27:18 +02:00
Francesco Romani	c635a7e7d8	node: devicemgr: topomgr: add logs One of the contributing factors of issues #118559 and #109595 hard to debug and fix is that the devicemanager has very few logs in important flow, so it's unnecessarily hard to reconstruct the state from logs. We add minimal logs to be able to improve troubleshooting. We add minimal logs to be backport-friendly, deferring a more comprehensive review of logging to later PRs. Signed-off-by: Francesco Romani <fromani@redhat.com>	2023-07-12 13:25:36 +02:00
Francesco Romani	3bcf4220ec	kubelet: devices: skip allocation for running pods When kubelet initializes, runs admission for pods and possibly allocated requested resources. We need to distinguish between node reboot (no containers running) versus kubelet restart (containers potentially running). Running pods should always survive kubelet restart. This means that device allocation on admission should not be attempted, because if a container requires devices and is still running when kubelet is restarting, that container already has devices allocated and working. Thus, we need to properly detect this scenario in the allocation step and handle it explicitely. We need to inform the devicemanager about which pods are already running. Note that if container runtime is down when kubelet restarts, the approach implemented here won't work. In this scenario, so on kubelet restart containers will again fail admission, hitting https://github.com/kubernetes/kubernetes/issues/118559 again. This scenario should however be pretty rare. Signed-off-by: Francesco Romani <fromani@redhat.com>	2023-07-12 13:25:36 +02:00
Kubernetes Prow Robot	e0dafe57a3	Merge pull request #117351 from pohly/dra-generated-resource-claim-names DRA: generated resource claim names	2023-07-11 10:33:11 -07:00
PiotrProkop	f855a23b45	topologymanager: promote TopologyManagerPolicyOptions feature to beta * Promote TopologyManagerPolicyOptions feature to beta * Promote PreferClosestNUMANodes TopologyManagerPolicyOption to beta Signed-off-by: PiotrProkop <pprokop@nvidia.com>	2023-07-11 15:06:57 +02:00
PiotrProkop	23833b9c81	topologymanager: Increase TopologyManager test coverage by adding negative test cases around NUMA topology discovery Signed-off-by: PiotrProkop <pprokop@nvidia.com>	2023-07-11 15:04:32 +02:00
Patrick Ohly	444d23bd2f	dra: generated name for ResourceClaim from template Generating the name avoids all potential name collisions. It's not clear how much of a problem that was because users can avoid them and the deterministic names for generic ephemeral volumes have not led to reports from users. But using generated names is not too hard either. What makes it relatively easy is that the new pod.status.resourceClaimStatus map stores the generated name for kubelet and node authorizer, i.e. the information in the pod is sufficient to determine the name of the ResourceClaim. The resource claim controller becomes a bit more complex and now needs permission to modify the pod status. The new failure scenario of "ResourceClaim created, updating pod status fails" is handled with the help of a new special "resource.kubernetes.io/pod-claim-name" annotation that together with the owner reference identifies exactly for what a ResourceClaim was generated, so updating the pod status can be retried for existing ResourceClaims. The transition from deterministic names is handled with a special case for that recovery code path: a ResourceClaim with no annotation and a name that follows the Kubernetes <= 1.27 naming pattern is assumed to be generated for that pod claim and gets added to the pod status. There's no immediate need for it, but just in case that it may become relevant, the name of the generated ResourceClaim may also be left unset to record that no claim was needed. Components processing such a pod can skip whatever they normally would do for the claim. To ensure that they do and also cover other cases properly ("no known field is set", "must check ownership"), resourceclaim.Name gets extended.	2023-07-11 14:23:48 +02:00
Kubernetes Prow Robot	bc01306c98	Merge pull request #116738 from AxeZhan/TopologyManagerPolicy When TopologyManagerPolicy is None, skip checks in NewManager.	2023-07-11 04:53:13 -07:00
Evan Lezar	cd14e97ea8	Add a builder for ContainerAllocateResponse objects This chagne introduces a helper to construct ContainerAllocateResponse instances. Test cases are updated to use a new constructor accepting functional options allowing the response contents to be set based on the test requirements. This can then be extended to also test additional fields in the device plugin API such as annotations which are not currently covered or new fields. Signed-off-by: Evan Lezar <elezar@nvidia.com>	2023-07-11 11:48:26 +02:00
Evan Lezar	db2a1edbdd	Generate empty cdi annotations Signed-off-by: Evan Lezar <elezar@nvidia.com>	2023-07-11 11:48:24 +02:00
Evan Lezar	f0e3c32fe5	Move CDI annotation code to utils package Signed-off-by: Evan Lezar <elezar@nvidia.com>	2023-07-11 11:47:53 +02:00
Kubernetes Prow Robot	7581ae8123	Merge pull request #116739 from moshe010/clone-cdi-devices kubelet dra: lock before getting claimInfo CDIDevices and annotations fields	2023-07-07 06:31:04 -07:00
Patrick Ohly	bde66bfb55	kubelet dra: restore skipping of unused resource claims `1aeec10efb` removed iterating over containers in favor of iterating over pod claims. This had the unintended consequence that NodePrepareResource gets called unnecessarily when no container needs the claim. The more natural behavior is to skip unused resources. This enables (theoretic, at this time) use cases where some DRA driver relies on the controller part to influence scheduling, but then doesn't use CDI with containers.	2023-06-27 16:02:31 +02:00
Patrick Ohly	874daa8b52	kubelet dra: fix checking of second pod which uses a claim When a second pod wanted to use a claim, the obligatory sanity check whether the pod is really allowed to use the claim ("reserved for") was skipped.	2023-06-27 16:01:11 +02:00
Kubernetes Prow Robot	299b72c587	Merge pull request #114760 from TommyStarK/unit-tests/pkg-kubelet-cm-containermap kubelet/cm/containermap: Improving test coverage	2023-06-06 11:18:24 -07:00

1 2 3 4 5 ...

1418 Commits