kubernetes

Author	SHA1	Message	Date
Francesco Romani	2f426fdba6	devicemanager: checkpoint: support pre-1.20 data The commit `a8b8995ef2` changed the content of the data kubelet writes in the checkpoint. Unfortunately, the checkpoint restore code was not updated, so if we upgrade kubelet from pre-1.20 to 1.20+, the device manager cannot anymore restore its state correctly. The only trace of this misbehaviour is this line in the kubelet logs: ``` W0615 07:31:49.744770 4852 manager.go:244] Continue after failing to read checkpoint file. Device allocation info may NOT be up-to-date. Err: json: cannot unmarshal array into Go struct field PodDevicesEntry.Data.PodDeviceEntries.DeviceIDs of type checkpoint.DevicesPerNUMA ``` If we hit this bug, the device allocation info is indeed NOT up-to-date up until the device plugins register themselves again. This can take up to few minutes, depending on the specific device plugin. While the device manager state is inconsistent: 1. the kubelet will NOT update the device availability to zero, so the scheduler will send pods towards the inconsistent kubelet. 2. at pod admission time, the device manager allocation will not trigger, so pods will be admitted without devices actually being allocated to them. To fix these issues, we add support to the device manager to read pre-1.20 checkpoint data. We retroactively call this format "v1". Signed-off-by: Francesco Romani <fromani@redhat.com>	2021-10-26 09:54:11 +02:00
Kubernetes Prow Robot	c4d802b0b5	Merge pull request #103289 from AlexeyPerevalov/DoNotExportEmptyTopology podresources: do not export empty NUMA topology	2021-10-07 07:11:46 -07:00
Francesco Romani	1b6efa5e21	devicemanager: skip unhealthy devs in GetAllocatable The GetAllocatableDevices, needed to support the podresources API, doesn't take into account the device health when computing its output. In this PR we address this gap and add unit tests along the way to prevent regressions. This gives us a good initial coverage, E2E tests to cover this case are much harder to write, because we would need to inject faults to trigger the unhealthy status. We will evaluate if adding these tests into later PRs. Signed-off-by: Francesco Romani <fromani@redhat.com>	2021-09-22 19:20:04 +02:00
Alexey Perevalov	bb81101570	podresource: do not export NUMA topology if it's empty If device plugin returns device without topology, keep it internaly as NUMA node -1, it helps at podresources level to not export NUMA topology, otherwise topology is exported with NUMA node id 0, which is not accurate. It's imposible to unveile this bug just by tracing json.Marshal(resp) in podresource client, because NUMANodes field ID has json property omitempty, in this case when ID=0 shown as emtpy NUMANode. To reproduce it, better to iterate on devices and just trace dev.Topology.Nodes[0].ID. Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>	2021-08-24 15:38:21 +00:00
kikimo	86d68effc2	clean code	2021-06-02 09:07:53 +08:00
kikimo	9d2135f703	reuse fake topology manager	2021-06-02 01:35:00 +08:00
kikimo	84a4b40526	fix incompatible interface in fakeTopologyManagerWithHint	2021-05-19 10:12:12 +08:00
kikimo	893ebf3a1c	add a reusable fakeTopologyManagerWithHint{}	2021-05-19 10:07:37 +08:00
kikimo	2ef1f81076	Avoid undesirable allocation when device is associated with multiple NUMA Nodes suppose there are two devices dev1 and dev2, each has NUMA Nodes associated as below: dev1: numa1 dev2: numa1, numa2 and we request a device from numa2, currently filterByAffinity() will return [], [dev1, dev2], [] if loop of available devices produce a sequence of [dev1, dev2], that is is not desirable as what we truely expect is an allocation of dev2 from numa2.	2021-05-19 10:07:37 +08:00
Francesco Romani	ad68f9588c	node: podresources: make GetDevices() consistent We want to make the return type of the GetDevices() method of the podresources DevicesProvider interface consistent with the newly added GetAllocatableDevices type. This makes the code easier to read and reduces the coupling between the podresourcesapi server and the devicemanager code. No intended changes in behaviour, but the different return types now requires some data massaging. Tests are updated accordingly. Signed-off-by: Francesco Romani <fromani@redhat.com>	2021-03-09 13:13:36 +01:00
Kubernetes Prow Robot	06a7e2bacf	Merge pull request #96781 from fighterhit/fix-kukelet-device-plugin-bug Fix: kubelet return error when device plugin sets PreStartRequired true while creating pods with 0 resource	2021-01-25 17:59:00 -08:00
fighterhit	16c6b99fcd	del unused value	2021-01-13 12:43:54 +08:00
fighterhit	24dd9b1f04	add a test to demonstrate PR#96781	2021-01-13 11:27:30 +08:00
jornshen	93606f8ba3	[flaky test] fix devicemanager TestDevicePluginReRegistrationProbeMode fail	2020-12-10 21:07:49 +08:00
Alexey Perevalov	5e6aed4137	Fixes sigfault in case of empty TopologyInfo Device plugin which implements v1beta interface can return nil in Topology field For example nvidia-gpu-deviceplugin `3520254b75/nvidia.go (L147)` Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>	2020-11-13 11:51:47 +03:00
Alexey Perevalov	a8b8995ef2	Implement TopologyInfo and cpu_ids in podresources It covers deviceplugin & cpumanager. It has drawback, since cpuset and all other structs including cadvisor's keep cpu as int, but for protobuf based interface is better to have fixed int. This patch also introduces additional interface CPUsProvider, while DeviceProvider might have been extended too. Checkpoint not covered by unit test. Signed-off-by: Swati Sehgal <swsehgal@redhat.com> Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>	2020-11-11 13:50:49 +03:00
Alexey Perevalov	62326a1846	Convert podDevices to struct PodDevices will have its own guard Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>	2020-11-11 13:50:48 +03:00
Ali	bfdeda58b7	Delete framework/v1alpha1 folder and change remaining import paths	2020-10-23 13:16:13 +11:00
Kubernetes Prow Robot	2e59a17dc1	Merge pull request #92288 from zhijianli88/cleanup-tempfiles Cleanup tempfiles	2020-08-27 17:56:54 -07:00
Kevin Klues	cbd405d85c	Update existing tests in support of GetPreferredallocation()	2020-07-03 13:01:32 +00:00
Li Zhijian	02eaa4f354	cleanup tempfiles in unit test Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>	2020-06-23 11:47:18 +08:00
Abdullah Gharaibeh	d6522e0e74	rename framework pkg with schedulerframework for all instances under pkg/kubelet	2020-04-14 14:24:07 -04:00
Abdullah Gharaibeh	bed9b2f23b	Cleanup obsolete NodeInfo methods	2020-04-12 18:13:46 -04:00
nolancon	a9c6129577	Device Manager - Update unit tests - Pass container to Allocate(). - Loop through containers to call Allocate() on container by container basis.	2020-02-27 07:24:34 +00:00
nolancon	cb9fdc49db	Device Manager - Refactor allocatePodResources - allocatePodResources logic altered to allow for container by container device allocation. - New type PodReusableDevices - New field in devicemanager devicesToReuse	2020-02-27 07:24:34 +00:00
Kevin Klues	a3f099ea4d	Split devicemanager Allocate into two functions Instead of having a single call for Allocate(), we now split this into two functions Allocate() and UpdatePluginResources(). The semantics split across them: // Allocate configures and assigns devices to a pod. From the requested // device resources, Allocate will communicate with the owning device // plugin to allow setup procedures to take place, and for the device // plugin to provide runtime settings to use the device (environment // variables, mount points and device files). Allocate(pod v1.Pod) error // UpdatePluginResources updates node resources based on devices already // allocated to pods. The node object is provided for the device manager to // update the node capacity to reflect the currently available devices. UpdatePluginResources( node schedulernodeinfo.NodeInfo, attrs *lifecycle.PodAdmitAttributes) error As we move to a model in which the TopologyManager is able to ensure aligned allocations from the CPUManager, devicemanger, and any other TopologManager HintProviders in the same synchronous loop, we will need to be able to call Allocate() independently from an UpdatePluginResources(). This commit makes that possible.	2020-02-10 03:27:47 +00:00
Kubernetes Prow Robot	ed10b5b17f	Merge pull request #85047 from yutedz/dev-mgr-err-handling Handle error return from allocatePodResources	2019-11-12 11:51:27 -08:00
David Zhu	802fe12803	Remove plugin watching of deprecated directory {kubelet_root_dir}/plugins and support for CSI V0 in accordance with deprecation announcement in https://v1-13.docs.kubernetes.io/docs/setup/release/notes/	2019-11-11 11:42:58 -08:00
Ted Yu	db0f616974	Handle error return from allocatePodResources	2019-11-09 16:25:15 -08:00
Davanum Srinivas	d30c489c54	Move pkg/kubelet/pluginregistration and deviceplugin Change-Id: I06adcb43bd278b430ffad2010869e1524c8cc4ff	2019-10-06 15:28:38 -04:00
Louise Daly	9a118ceac4	Added stub support for Topology Manager to Device Manager Co-authored-by: Conor Nolan <conor.nolan@intel.com> Co-authored-by: Sreemanti Ghosh <sreemanti.ghosh@intel.com> Co-authored-by: Kevin Klues <kklues@nvidia.com>	2019-08-29 07:45:43 -05:00
Tim Allclair	a2c51674cf	Cleanup more static check issues (S1,ST)	2019-08-21 10:40:21 -07:00
Tim Allclair	6510d26b6a	Fix misc static check issues	2019-08-21 10:40:21 -07:00
Tara Gu	5e18554442	Implement plugin manager - a controller that manages plugin registration/unregistration	2019-05-30 19:00:59 -04:00
Richard Chen	c9f1b57b5b	Reset extended resources only when node is recreated.	2019-05-21 14:16:54 -07:00
yuexiao-wang	f3353c358d	[scheduler cleanup phase 2]: Rename to Signed-off-by: yuexiao-wang <wang.yuexiao@zte.com.cn>	2018-12-11 11:21:12 +08:00
saad-ali	8f666d9e41	Modify kubelet watcher to support old versions Modify kubelet plugin watcher to support older CSI drivers that use an the old plugins directory for socket registration. Also modify CSI plugin registration to support multiple versions of CSI registering with the same name.	2018-11-21 18:37:31 -08:00
Vladimir Vivien	b195396154	Kubelet Plugin Registration v1 update fix	2018-11-15 17:40:35 -05:00
Christoph Blecker	97b2992dc1	Update gofmt for go1.11	2018-10-05 12:59:38 -07:00
Renaud Gaubert	8dd1d27c03	Updated the device manager pluginwatcher handler	2018-09-06 15:34:46 +02:00
Jiaying Zhang	7b1ae66432	Fail container start if its requested device plugin resource doesn't have cached option state to make sure the device plugin resource is in ready state when we start the container.	2018-08-08 13:11:36 -07:00
hui luo	7101c17498	While reviewing devicemanager code, found the caching layer on endpoint is redundant. Here are the 3 related objects in picture: devicemanager <-> endpoint <-> plugin Plugin is the source of truth for devices and device health status. devicemanager maintain healthyDevices, unhealthyDevices, allocatedDevices based on updates from plugin. So there is no point for endpoint caching devices, this patch is removing this caching layer on endpoint, Also removing the Manager.Devices() since i didn't find any caller of this other than test, i am adding a notification channel to facilitate testing, If we need to get all devices from manager in future, it just need to return healthyDevices + unhealthyDevices, we don't have to call endpoint after all. This patch makes code more readable, data model been simplified.	2018-07-29 21:07:14 -07:00
vikaschoudhary16	a5842503eb	Use probe based plugin discovery mechanism in device manager	2018-07-17 04:02:31 -04:00
Guoliang Wang	761cf41427	Move pkg/scheduler/schedulercache -> pkg/scheduler/cache	2018-05-31 22:55:34 +08:00
Kubernetes Submit Queue	15cc20630d	Merge pull request #60034 from pohly/device-manager-goroutine Automatic merge from submit-queue (batch tested with PRs 58474, 60034, 62101, 63198). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. avoid race condition in device manager and plugin startup/shutdown: wait for goroutines What this PR does / why we need it: Commit `1325c2f` worked around issue #59488, but it is still worthwhile to fix the underlying root cause properly. Which issue(s) this PR fixes: Fixes #59488 Special notes for your reviewer: This is an alternative to PR #59861, which used a different approach. Personally I tend to prefer this one now. Release note: ```release-note NONE ``` /sig node /area hw-accelerators /assign vikaschoudhary16	2018-04-30 13:24:08 -07:00
vikaschoudhary16	cedbd93255	Make 'pod' package to use unified checkpointManager Signed-off-by: vikaschoudhary16 <choudharyvikas16@gmail.com>	2018-04-16 01:30:20 -04:00
vikaschoudhary16	d62bd9ef65	Node-level Checkpointing manager	2018-04-16 00:19:42 -04:00
Patrick Ohly	fcbb64b93d	avoid race condition in device manager and plugin startup/shutdown A flaky test exposed a race condition where shutting down one server instance broke the startup of the next instance when using the same socket path. Commit `1325c2f8be` removed the reuse of the same socket path and thus avoided the issue. But the real fix is to ensure that the listening socket is really closed once Stop returns. Two solutions were proposed in https://github.com/grpc/grpc-go/issues/1861: - waiting for the goroutine to complete - closing the socket The former is done here because it's cleaner to not keep lingering goroutines. While at it, the Stop methods are made idempotent (similar to e.g. Close on a socket) and no longer crash when called without prior Start. Fixes https://github.com/kubernetes/kubernetes/issues/59488	2018-04-12 17:59:10 +02:00
Jiaying Zhang	5514a1f4dd	Fixes the races around devicemanager Allocate() and endpoint deletion. There is a race in predicateAdmitHandler Admit() that getNodeAnyWayFunc() could get Node with non-zero deviceplugin resource allocatable for a non-existing endpoint. That race can happen when a device plugin fails, but is more likely when kubelet restarts as with the current registration model, there is a time gap between kubelet restart and device plugin re-registration. During this time window, even though devicemanager could have removed the resource initially during GetCapacity() call, Kubelet may overwrite the device plugin resource capacity/allocatable with the old value when node update from the API server comes in later. This could cause a pod to be started without proper device runtime config set. To solve this problem, introduce endpointStopGracePeriod. When a device plugin fails, don't immediately remove the endpoint but set stopTime in its endpoint. During kubelet restart, create endpoints with stopTime set for any checkpointed registered resource. The endpoint is considered to be in stopGracePeriod if its stoptime is set. This allows us to track what resources should be handled by devicemanager during the time gap. When an endpoint's stopGracePeriod expires, we remove the endpoint and its resource. This allows the resource to be exported through other channels (e.g., by directly updating node status through API server) if there is such use case. Currently endpointStopGracePeriod is set as 5 minutes. Given that an endpoint is no longer immediately removed upon disconnection, mark all its devices unhealthy so that we can signal the resource allocatable change to the scheduler to avoid scheduling more pods to the node. When a device plugin endpoint is in stopGracePeriod, pods requesting the corresponding resource will fail admission handler.	2018-03-09 17:00:57 -08:00
Jiaying Zhang	07beac6004	Made a couple API changes to deviceplugin/v1beta1 to avoid future incompatible changes: - Add GetDevicePluginOptions rpc call. This is needed when we switch from Registration service to probe-based plugin watcher. - Change AllocateRequest and AllocateResponse to allow device requests from multiple containers in a pod. Currently only made mechanical change on the devicemanager and test code to cope with the API but still issues an Allocate call per container. We can modify the devicemanager in 1.11 to issue a single Allocate call per pod. The change will also facilitate incremental API change to communicate pod level information through Allocate rpc if there is such future need.	2018-02-23 16:15:09 -08:00

1 2

57 Commits