kubernetes

Author	SHA1	Message	Date
Davanum Srinivas	9682b7248f	OWNERS cleanup - Jan 2021 Week 1 Signed-off-by: Davanum Srinivas <davanum@gmail.com>	2022-01-10 08:14:29 -05:00
Davanum Srinivas	497e9c1971	Cleanup OWNERS files (No Activity in the last year) Signed-off-by: Davanum Srinivas <davanum@gmail.com>	2021-12-15 10:34:02 -05:00
Davanum Srinivas	9405e9b55e	Check in OWNERS modified by update-yamlfmt.sh Signed-off-by: Davanum Srinivas <davanum@gmail.com>	2021-12-09 21:31:26 -05:00
Francesco Romani	2f426fdba6	devicemanager: checkpoint: support pre-1.20 data The commit `a8b8995ef2` changed the content of the data kubelet writes in the checkpoint. Unfortunately, the checkpoint restore code was not updated, so if we upgrade kubelet from pre-1.20 to 1.20+, the device manager cannot anymore restore its state correctly. The only trace of this misbehaviour is this line in the kubelet logs: ``` W0615 07:31:49.744770 4852 manager.go:244] Continue after failing to read checkpoint file. Device allocation info may NOT be up-to-date. Err: json: cannot unmarshal array into Go struct field PodDevicesEntry.Data.PodDeviceEntries.DeviceIDs of type checkpoint.DevicesPerNUMA ``` If we hit this bug, the device allocation info is indeed NOT up-to-date up until the device plugins register themselves again. This can take up to few minutes, depending on the specific device plugin. While the device manager state is inconsistent: 1. the kubelet will NOT update the device availability to zero, so the scheduler will send pods towards the inconsistent kubelet. 2. at pod admission time, the device manager allocation will not trigger, so pods will be admitted without devices actually being allocated to them. To fix these issues, we add support to the device manager to read pre-1.20 checkpoint data. We retroactively call this format "v1". Signed-off-by: Francesco Romani <fromani@redhat.com>	2021-10-26 09:54:11 +02:00
Kubernetes Prow Robot	c4d802b0b5	Merge pull request #103289 from AlexeyPerevalov/DoNotExportEmptyTopology podresources: do not export empty NUMA topology	2021-10-07 07:11:46 -07:00
Francesco Romani	1b6efa5e21	devicemanager: skip unhealthy devs in GetAllocatable The GetAllocatableDevices, needed to support the podresources API, doesn't take into account the device health when computing its output. In this PR we address this gap and add unit tests along the way to prevent regressions. This gives us a good initial coverage, E2E tests to cover this case are much harder to write, because we would need to inject faults to trigger the unhealthy status. We will evaluate if adding these tests into later PRs. Signed-off-by: Francesco Romani <fromani@redhat.com>	2021-09-22 19:20:04 +02:00
Alexey Perevalov	bb81101570	podresource: do not export NUMA topology if it's empty If device plugin returns device without topology, keep it internaly as NUMA node -1, it helps at podresources level to not export NUMA topology, otherwise topology is exported with NUMA node id 0, which is not accurate. It's imposible to unveile this bug just by tracing json.Marshal(resp) in podresource client, because NUMANodes field ID has json property omitempty, in this case when ID=0 shown as emtpy NUMANode. To reproduce it, better to iterate on devices and just trace dev.Topology.Nodes[0].ID. Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>	2021-08-24 15:38:21 +00:00
Artyom Lukianov	73a5cce3e6	device manager: do not clean admitted pods from the state Signed-off-by: Artyom Lukianov <alukiano@redhat.com>	2021-08-08 16:46:06 +03:00
jingxueli	45d18acbcc	add info for possible failed listAndWatch grpc call	2021-06-17 16:25:20 +08:00
Kubernetes Prow Robot	1795a98eeb	Merge pull request #102221 from kikimo/add-hint-to-fake-topology-manager Add hint to fake topology manager.	2021-06-02 03:40:05 -07:00
kikimo	86d68effc2	clean code	2021-06-02 09:07:53 +08:00
kikimo	9d2135f703	reuse fake topology manager	2021-06-02 01:35:00 +08:00
sanwishe	9e257ec194	Optimization logging format for pkg/kubelet Signed-off-by: sanwishe <jiang.mingzhi35@zte.com.cn>	2021-05-25 08:52:08 +08:00
Kubernetes Prow Robot	cf59c68e15	Merge pull request #102088 from wzshiming/fix/pod-devices-has-pod-lock Add the missing RLock	2021-05-24 15:16:20 -07:00
kikimo	c0a7939cbb	remove redundant test branch in sorting algorithm	2021-05-20 20:31:47 +08:00
kikimo	445b9c0762	minor tweak on numa node sorting algorithm	2021-05-20 08:21:20 +08:00
kikimo	ecfa609b71	simplify sorting comparator of numa nodes	2021-05-19 21:19:47 +08:00
kikimo	84a4b40526	fix incompatible interface in fakeTopologyManagerWithHint	2021-05-19 10:12:12 +08:00
kikimo	7d30bfecd5	simplify sorting comparator of numa nodes	2021-05-19 10:07:37 +08:00
kikimo	893ebf3a1c	add a reusable fakeTopologyManagerWithHint{}	2021-05-19 10:07:37 +08:00
kikimo	2ef1f81076	Avoid undesirable allocation when device is associated with multiple NUMA Nodes suppose there are two devices dev1 and dev2, each has NUMA Nodes associated as below: dev1: numa1 dev2: numa1, numa2 and we request a device from numa2, currently filterByAffinity() will return [], [dev1, dev2], [] if loop of available devices produce a sequence of [dev1, dev2], that is is not desirable as what we truely expect is an allocation of dev2 from numa2.	2021-05-19 10:07:37 +08:00
Shiming Zhang	bbed9d27b0	Add the missing RLock	2021-05-18 17:27:27 +08:00
Elana Hashman	6af7eb6d49	Migrate missed log entries in kubelet Co-Authored-By: pacoxu <paco.xu@daocloud.io>	2021-03-18 14:26:26 -07:00
Kubernetes Prow Robot	e082d84575	Merge pull request #100196 from ehashman/remains-of-logs Migrate remaining logs to structured logging	2021-03-16 13:12:55 -07:00
Elana Hashman	ee0bcac1d2	Migrate devicemanager/topology_hints.go to structured logs	2021-03-15 12:39:45 -07:00
Amim Knabben	c1d24c87bb	Migrate devicemanager to structured logging	2021-03-14 11:57:06 -04:00
Francesco Romani	ad68f9588c	node: podresources: make GetDevices() consistent We want to make the return type of the GetDevices() method of the podresources DevicesProvider interface consistent with the newly added GetAllocatableDevices type. This makes the code easier to read and reduces the coupling between the podresourcesapi server and the devicemanager code. No intended changes in behaviour, but the different return types now requires some data massaging. Tests are updated accordingly. Signed-off-by: Francesco Romani <fromani@redhat.com>	2021-03-09 13:13:36 +01:00
Francesco Romani	6d33354e4c	node: podresources: implement GetAllocatableResources API Extend the podresources API implementing the GetAllocatableResources endpoint, as specified in the KEPs: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2043-pod-resource-concrete-assigments https://github.com/kubernetes/enhancements/pull/2404 Signed-off-by: Francesco Romani <fromani@redhat.com>	2021-03-09 13:13:36 +01:00
Benjamin Elder	56e092e382	hack/update-bazel.sh	2021-02-28 15:17:29 -08:00
Kubernetes Prow Robot	06a7e2bacf	Merge pull request #96781 from fighterhit/fix-kukelet-device-plugin-bug Fix: kubelet return error when device plugin sets PreStartRequired true while creating pods with 0 resource	2021-01-25 17:59:00 -08:00
fighterhit	16c6b99fcd	del unused value	2021-01-13 12:43:54 +08:00
fighterhit	24dd9b1f04	add a test to demonstrate PR#96781	2021-01-13 11:27:30 +08:00
Kubernetes Prow Robot	b37e9a440e	Merge pull request #97193 from JornShen/flaky_devicemanager_test [flaky test] fix devicemanager TestDevicePluginReRegistrationProbeMode failed	2021-01-05 11:46:21 -08:00
Anthony ARNAUD	6013aaa370	use Lstat instead of Stat for unix socket on windows	2020-12-29 15:14:29 -05:00
Anthony ARNAUD	8bdc3d8970	Port deviceManager in windows container manager	2020-12-16 00:25:26 -05:00
jornshen	93606f8ba3	[flaky test] fix devicemanager TestDevicePluginReRegistrationProbeMode fail	2020-12-10 21:07:49 +08:00
fighterhit	0eaceb7eb5	Fix: kubelet return error when device plugin sets PreStartRequired true while creating pods with 0 resource	2020-11-21 22:44:27 +08:00
Alexey Perevalov	5e6aed4137	Fixes sigfault in case of empty TopologyInfo Device plugin which implements v1beta interface can return nil in Topology field For example nvidia-gpu-deviceplugin `3520254b75/nvidia.go (L147)` Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>	2020-11-13 11:51:47 +03:00
Krzysztof Wiatrzyk	b7714918db	Run ./update-all.sh Signed-off-by: Krzysztof Wiatrzyk <k.wiatrzyk@samsung.com>	2020-11-12 12:25:55 +01:00
sw.han	7b0ccaa1e9	Add tests for getPodDeviceRequest() for devicemanager Signed-off-by: Krzysztof Wiatrzyk <k.wiatrzyk@samsung.com>	2020-11-12 12:25:55 +01:00
sw.han	ba1e8abce7	Add tests for GetPodTopologyHints() for devicemanager * Add additional test cases returned by getPodScopeTestCases() Signed-off-by: Krzysztof Wiatrzyk <k.wiatrzyk@samsung.com>	2020-11-12 12:25:55 +01:00
sw.han	1c4a1ba6ae	Update topology hints tests to use pod object for devicemanager Pod object is more flexible to use and construct * Update TestGetTopologyHints() to work according to new test cases * Update topologyHintTestCase{} to include proper field Signed-off-by: Krzysztof Wiatrzyk <k.wiatrzyk@samsung.com>	2020-11-12 12:25:55 +01:00
sw.han	7ad65bf22d	Add tests for GetPodTopologyHints() for cpumanager * Add tests for getPodRequestedCPU() Signed-off-by: Krzysztof Wiatrzyk <k.wiatrzyk@samsung.com>	2020-11-12 12:25:55 +01:00
Byonggon Chun	9da0912a33	Implement devicemanager.GetPodLevelTopologyHints() function * Add podDevices() func * Add getPodDeviceRequest() func Signed-off-by: Krzysztof Wiatrzyk <k.wiatrzyk@samsung.com>	2020-11-12 12:25:55 +01:00
Krzysztof Wiatrzyk	6db58b2e92	Update logging to use a format util Signed-off-by: Krzysztof Wiatrzyk <k.wiatrzyk@samsung.com>	2020-11-12 12:25:55 +01:00
sw.han	f5997fe537	Add GetPodTopologyHints() interface to Topology/CPU/Device Manager Signed-off-by: Krzysztof Wiatrzyk <k.wiatrzyk@samsung.com>	2020-11-12 12:25:54 +01:00
Alexey Perevalov	a8b8995ef2	Implement TopologyInfo and cpu_ids in podresources It covers deviceplugin & cpumanager. It has drawback, since cpuset and all other structs including cadvisor's keep cpu as int, but for protobuf based interface is better to have fixed int. This patch also introduces additional interface CPUsProvider, while DeviceProvider might have been extended too. Checkpoint not covered by unit test. Signed-off-by: Swati Sehgal <swsehgal@redhat.com> Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>	2020-11-11 13:50:49 +03:00
Alexey Perevalov	62326a1846	Convert podDevices to struct PodDevices will have its own guard Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>	2020-11-11 13:50:48 +03:00
Alexey Perevalov	9f54dccc92	Change GetDevices interface This change is necessary for supporting Topology in the ContainerDevices. Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>	2020-11-11 12:41:31 +03:00
Kubernetes Prow Robot	bbc26ba7e6	Merge pull request #96048 from rphillips/fixes/device_plugin_stub_race Deflake TestDevicePluginReRegistrationProbeMode: Devices of previous registered should be removed	2020-11-02 08:20:54 -08:00

1 2 3 4

187 Commits