kubernetes

Author	SHA1	Message	Date
Swati Sehgal	dc1a592632	node: device-mgr: Handle recovery by checking if healthy devices exist In case of node reboot/kubelet restart, the flow of events involves obtaining the state from the checkpoint file followed by setting the `healthDevices`/`unhealthyDevices` to its zero value. This is done to allow the device plugin to re-register itself so that capacity can be updated appropriately. During the allocation phase, we need to check if the resources requested by the pod have been registered AND healthy devices are present on the node to be allocated. Also we need to move this check above `needed==0` where needed is required - devices allocated to the container (which is obtained from the checkpoint file) because even in cases where no additional devices have to be allocated (as they were pre-allocated), we still need to make sure he devices that were previously allocated are healthy. Signed-off-by: Swati Sehgal <swsehgal@redhat.com>	2023-04-28 14:41:30 +01:00
David Porter	9c20cee504	Revert "node: device-mgr: Handle recovery flow by checking if healthy devices exist"	2023-03-07 11:50:52 -08:00
Claudiu Belu	5ba74c81ca	unit tests: Skip flaky tests on Windows Some of the unit tests are currently flaky on Windows. This commit skips them until they are resolved.	2023-03-06 20:46:05 +00:00
Kubernetes Prow Robot	890d39f976	Merge pull request #114640 from swatisehgal/handle-device-mgr-recovery node: device-mgr: Handle recovery flow by checking if healthy devices exist	2023-03-06 07:10:28 -08:00
Swati Sehgal	5b2a3dbbdc	node: device-mgr: explicitly check if pre-allocated devices are healthy Signed-off-by: Swati Sehgal <swsehgal@redhat.com>	2023-03-06 11:52:23 +00:00
Swati Sehgal	a799ffb571	node: device-mgr: unit-tests: admission failure due to unhealthy devices Signed-off-by: Swati Sehgal <swsehgal@redhat.com>	2023-03-06 11:52:23 +00:00
huyinhou	32495ae3f1	add lock in generate topology hints function	2023-02-20 10:56:53 +08:00
huyinhou	4702503d15	update test case Signed-off-by: huyinhou <huyinhou@bytedance.com>	2023-01-03 15:00:12 +08:00
huyinhou	997cefc9da	add unit test	2022-12-29 14:50:18 +08:00
Claudiu Belu	9f95b7b18c	unittests: Fixes unit tests for Windows (part 3) Currently, there are some unit tests that are failing on Windows due to various reasons: - paths not properly joined (filepath.Join should be used). - Proxy Mode IPVS not supported on Windows. - DeadlineExceeded can occur when trying to read data from an UDP socket. This can be used to detect whether the port was closed or not. - In Windows, with long file name support enabled, file names can have up to 32,767 characters. In this case, the error windows.ERROR_FILENAME_EXCED_RANGE will be encountered instead. - files not closed, which means that they cannot be removed / renamed. - time.Now() is not as precise on Windows, which means that 2 consecutive calls may return the same timestamp. - path.Base() will return the same path. filepath.Base() should be used instead. - path.Join() will always join the paths with a / instead of the OS specific separator. filepath.Join() should be used instead.	2022-10-21 19:25:48 +03:00
inosato	3b95d3b076	Remove ioutil in kubelet and its tests Signed-off-by: inosato <si17_21@yahoo.co.jp>	2022-07-30 12:35:26 +09:00
Kevin Klues	57f8b31b42	Update tests to accommodate devicemanager refactoring Signed-off-by: Kevin Klues <kklues@nvidia.com>	2022-04-29 10:52:37 +00:00
waynepeking348	6157d3cc4a	skip deleted activePods and return nil	2022-03-27 20:35:09 +08:00
waynepeking348	35a456b0c6	skip reallocate logic if pod is already removed	2022-03-20 21:09:47 +08:00
Francesco Romani	2f426fdba6	devicemanager: checkpoint: support pre-1.20 data The commit `a8b8995ef2` changed the content of the data kubelet writes in the checkpoint. Unfortunately, the checkpoint restore code was not updated, so if we upgrade kubelet from pre-1.20 to 1.20+, the device manager cannot anymore restore its state correctly. The only trace of this misbehaviour is this line in the kubelet logs: ``` W0615 07:31:49.744770 4852 manager.go:244] Continue after failing to read checkpoint file. Device allocation info may NOT be up-to-date. Err: json: cannot unmarshal array into Go struct field PodDevicesEntry.Data.PodDeviceEntries.DeviceIDs of type checkpoint.DevicesPerNUMA ``` If we hit this bug, the device allocation info is indeed NOT up-to-date up until the device plugins register themselves again. This can take up to few minutes, depending on the specific device plugin. While the device manager state is inconsistent: 1. the kubelet will NOT update the device availability to zero, so the scheduler will send pods towards the inconsistent kubelet. 2. at pod admission time, the device manager allocation will not trigger, so pods will be admitted without devices actually being allocated to them. To fix these issues, we add support to the device manager to read pre-1.20 checkpoint data. We retroactively call this format "v1". Signed-off-by: Francesco Romani <fromani@redhat.com>	2021-10-26 09:54:11 +02:00
Kubernetes Prow Robot	c4d802b0b5	Merge pull request #103289 from AlexeyPerevalov/DoNotExportEmptyTopology podresources: do not export empty NUMA topology	2021-10-07 07:11:46 -07:00
Francesco Romani	1b6efa5e21	devicemanager: skip unhealthy devs in GetAllocatable The GetAllocatableDevices, needed to support the podresources API, doesn't take into account the device health when computing its output. In this PR we address this gap and add unit tests along the way to prevent regressions. This gives us a good initial coverage, E2E tests to cover this case are much harder to write, because we would need to inject faults to trigger the unhealthy status. We will evaluate if adding these tests into later PRs. Signed-off-by: Francesco Romani <fromani@redhat.com>	2021-09-22 19:20:04 +02:00
Alexey Perevalov	bb81101570	podresource: do not export NUMA topology if it's empty If device plugin returns device without topology, keep it internaly as NUMA node -1, it helps at podresources level to not export NUMA topology, otherwise topology is exported with NUMA node id 0, which is not accurate. It's imposible to unveile this bug just by tracing json.Marshal(resp) in podresource client, because NUMANodes field ID has json property omitempty, in this case when ID=0 shown as emtpy NUMANode. To reproduce it, better to iterate on devices and just trace dev.Topology.Nodes[0].ID. Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>	2021-08-24 15:38:21 +00:00
kikimo	86d68effc2	clean code	2021-06-02 09:07:53 +08:00
kikimo	9d2135f703	reuse fake topology manager	2021-06-02 01:35:00 +08:00
kikimo	84a4b40526	fix incompatible interface in fakeTopologyManagerWithHint	2021-05-19 10:12:12 +08:00
kikimo	893ebf3a1c	add a reusable fakeTopologyManagerWithHint{}	2021-05-19 10:07:37 +08:00
kikimo	2ef1f81076	Avoid undesirable allocation when device is associated with multiple NUMA Nodes suppose there are two devices dev1 and dev2, each has NUMA Nodes associated as below: dev1: numa1 dev2: numa1, numa2 and we request a device from numa2, currently filterByAffinity() will return [], [dev1, dev2], [] if loop of available devices produce a sequence of [dev1, dev2], that is is not desirable as what we truely expect is an allocation of dev2 from numa2.	2021-05-19 10:07:37 +08:00
Francesco Romani	ad68f9588c	node: podresources: make GetDevices() consistent We want to make the return type of the GetDevices() method of the podresources DevicesProvider interface consistent with the newly added GetAllocatableDevices type. This makes the code easier to read and reduces the coupling between the podresourcesapi server and the devicemanager code. No intended changes in behaviour, but the different return types now requires some data massaging. Tests are updated accordingly. Signed-off-by: Francesco Romani <fromani@redhat.com>	2021-03-09 13:13:36 +01:00
Kubernetes Prow Robot	06a7e2bacf	Merge pull request #96781 from fighterhit/fix-kukelet-device-plugin-bug Fix: kubelet return error when device plugin sets PreStartRequired true while creating pods with 0 resource	2021-01-25 17:59:00 -08:00
fighterhit	16c6b99fcd	del unused value	2021-01-13 12:43:54 +08:00
fighterhit	24dd9b1f04	add a test to demonstrate PR#96781	2021-01-13 11:27:30 +08:00
jornshen	93606f8ba3	[flaky test] fix devicemanager TestDevicePluginReRegistrationProbeMode fail	2020-12-10 21:07:49 +08:00
Alexey Perevalov	5e6aed4137	Fixes sigfault in case of empty TopologyInfo Device plugin which implements v1beta interface can return nil in Topology field For example nvidia-gpu-deviceplugin `3520254b75/nvidia.go (L147)` Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>	2020-11-13 11:51:47 +03:00
Alexey Perevalov	a8b8995ef2	Implement TopologyInfo and cpu_ids in podresources It covers deviceplugin & cpumanager. It has drawback, since cpuset and all other structs including cadvisor's keep cpu as int, but for protobuf based interface is better to have fixed int. This patch also introduces additional interface CPUsProvider, while DeviceProvider might have been extended too. Checkpoint not covered by unit test. Signed-off-by: Swati Sehgal <swsehgal@redhat.com> Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>	2020-11-11 13:50:49 +03:00
Alexey Perevalov	62326a1846	Convert podDevices to struct PodDevices will have its own guard Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>	2020-11-11 13:50:48 +03:00
Ali	bfdeda58b7	Delete framework/v1alpha1 folder and change remaining import paths	2020-10-23 13:16:13 +11:00
Kubernetes Prow Robot	2e59a17dc1	Merge pull request #92288 from zhijianli88/cleanup-tempfiles Cleanup tempfiles	2020-08-27 17:56:54 -07:00
Kevin Klues	cbd405d85c	Update existing tests in support of GetPreferredallocation()	2020-07-03 13:01:32 +00:00
Li Zhijian	02eaa4f354	cleanup tempfiles in unit test Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>	2020-06-23 11:47:18 +08:00
Abdullah Gharaibeh	d6522e0e74	rename framework pkg with schedulerframework for all instances under pkg/kubelet	2020-04-14 14:24:07 -04:00
Abdullah Gharaibeh	bed9b2f23b	Cleanup obsolete NodeInfo methods	2020-04-12 18:13:46 -04:00
nolancon	a9c6129577	Device Manager - Update unit tests - Pass container to Allocate(). - Loop through containers to call Allocate() on container by container basis.	2020-02-27 07:24:34 +00:00
nolancon	cb9fdc49db	Device Manager - Refactor allocatePodResources - allocatePodResources logic altered to allow for container by container device allocation. - New type PodReusableDevices - New field in devicemanager devicesToReuse	2020-02-27 07:24:34 +00:00
Kevin Klues	a3f099ea4d	Split devicemanager Allocate into two functions Instead of having a single call for Allocate(), we now split this into two functions Allocate() and UpdatePluginResources(). The semantics split across them: // Allocate configures and assigns devices to a pod. From the requested // device resources, Allocate will communicate with the owning device // plugin to allow setup procedures to take place, and for the device // plugin to provide runtime settings to use the device (environment // variables, mount points and device files). Allocate(pod v1.Pod) error // UpdatePluginResources updates node resources based on devices already // allocated to pods. The node object is provided for the device manager to // update the node capacity to reflect the currently available devices. UpdatePluginResources( node schedulernodeinfo.NodeInfo, attrs *lifecycle.PodAdmitAttributes) error As we move to a model in which the TopologyManager is able to ensure aligned allocations from the CPUManager, devicemanger, and any other TopologManager HintProviders in the same synchronous loop, we will need to be able to call Allocate() independently from an UpdatePluginResources(). This commit makes that possible.	2020-02-10 03:27:47 +00:00
Kubernetes Prow Robot	ed10b5b17f	Merge pull request #85047 from yutedz/dev-mgr-err-handling Handle error return from allocatePodResources	2019-11-12 11:51:27 -08:00
David Zhu	802fe12803	Remove plugin watching of deprecated directory {kubelet_root_dir}/plugins and support for CSI V0 in accordance with deprecation announcement in https://v1-13.docs.kubernetes.io/docs/setup/release/notes/	2019-11-11 11:42:58 -08:00
Ted Yu	db0f616974	Handle error return from allocatePodResources	2019-11-09 16:25:15 -08:00
Davanum Srinivas	d30c489c54	Move pkg/kubelet/pluginregistration and deviceplugin Change-Id: I06adcb43bd278b430ffad2010869e1524c8cc4ff	2019-10-06 15:28:38 -04:00
Louise Daly	9a118ceac4	Added stub support for Topology Manager to Device Manager Co-authored-by: Conor Nolan <conor.nolan@intel.com> Co-authored-by: Sreemanti Ghosh <sreemanti.ghosh@intel.com> Co-authored-by: Kevin Klues <kklues@nvidia.com>	2019-08-29 07:45:43 -05:00
Tim Allclair	a2c51674cf	Cleanup more static check issues (S1,ST)	2019-08-21 10:40:21 -07:00
Tim Allclair	6510d26b6a	Fix misc static check issues	2019-08-21 10:40:21 -07:00
Tara Gu	5e18554442	Implement plugin manager - a controller that manages plugin registration/unregistration	2019-05-30 19:00:59 -04:00
Richard Chen	c9f1b57b5b	Reset extended resources only when node is recreated.	2019-05-21 14:16:54 -07:00
yuexiao-wang	f3353c358d	[scheduler cleanup phase 2]: Rename to Signed-off-by: yuexiao-wang <wang.yuexiao@zte.com.cn>	2018-12-11 11:21:12 +08:00

1 2

71 Commits