kubernetes

Author	SHA1	Message	Date
Kevin Klues	00df26a985	Fix a bug whereby reusable CPUs and devices were not being honored Previously, it was possible for reusable CPUs and reusable devices (i.e. those previously consumed by init containers) to not be reused by subsequent init containers or app containers if the TopologyManager was enabled. This would happen because hint generation for the TopologyManager was not considering the reusable devices when it made its hint calculation. As such, it would sometimes: 1) Generate a hint for a differnent NUMA node, causing the CPUs and devices to be allocated from that node instead of the one where the reusable devices live; or 2) End up thinking there were not enough CPUs or devices to allocate and throw a TopologyAffinity admission error This patch fixes this by ensuring that reusable CPUs and devices are considered as part of TopologyHint generation. This frunctionality is difficult to unit test since it spans multiple components, but an e2e test will be added in a subsequent patch to test this functionality.	2020-07-20 11:41:13 +00:00
Seth Jennings	45d2b98aa8	add sjenning as kubelet approver	2020-06-19 13:00:55 -05:00
Davanum Srinivas	07d88617e5	Run hack/update-vendor.sh Signed-off-by: Davanum Srinivas <davanum@gmail.com>	2020-05-16 07:54:33 -04:00
Davanum Srinivas	442a69c3bd	switch over k/k to use klog v2 Signed-off-by: Davanum Srinivas <davanum@gmail.com>	2020-05-16 07:54:27 -04:00
Kubernetes Prow Robot	7fdc1275d9	Merge pull request #90377 from cbf123/container_cpuset_fixup_2 Fix exclusive CPU allocations being deleted at container restart	2020-04-27 13:40:04 -07:00
Chris Friesen	ab5870d808	Fix exclusive CPU allocations being deleted at container restart The expectation is that exclusive CPU allocations happen at pod creation time. When a container restarts, it should not have its exclusive CPU allocations removed, and it should not need to re-allocate CPUs. There are a few places in the current code that look for containers that have exited and call CpuManager.RemoveContainer() to clean up the container. This will end up deleting any exclusive CPU allocations for that container, and if the container restarts within the same pod it will end up using the default cpuset rather than what should be exclusive CPUs. Removing those calls and adding resource cleanup at allocation time should get rid of the problem. Signed-off-by: Chris Friesen <chris.friesen@windriver.com>	2020-04-27 11:36:54 -06:00
Kevin Klues	751b9f3e13	Update strategy used to reuse CPUs from init containers in CPUManager With the old strategy, it was possible for an init container to end up running without some of its CPUs being exclusive if it requested more guaranteed CPUs than the sum of all guaranteed CPUs requested by app containers. Unfortunately, this case was not caught by our unit tests because they didn't validate the state of the defaultCPUSet to ensure there was no overlap with CPUs assigned to containers. This patch updates the strategy to reuse the CPUs assigned to init containers across into app containers, while avoiding this edge case. It also updates the unit tests to now catch this type of error in the future.	2020-04-23 20:27:43 +00:00
Francesco Romani	623587ec8b	cpumanager: test: add missing helper add back the missing AssertStateEqual helper; it is needed by some tests we still want to run. Signed-off-by: Francesco Romani <fromani@redhat.com>	2020-04-07 16:59:07 +02:00
Francesco Romani	be0fe3df9b	cpumanager: drop old custom file backend The cpumanager file-based state backend was obsoleted since few releases, aving the cpumanager moved to the checkpointmanager common infrastructure. The old test checking compatibility to/from the old format is also no longer needed, because the checkpoint format is stable (see https://github.com/kubernetes/kubernetes/tree/master/pkg/kubelet/checkpointmanager). Signed-off-by: Francesco Romani <fromani@redhat.com>	2020-04-07 13:24:48 +02:00
Byonggon Chun	a3047672d0	move pkg/kubelet/cm/cpumanager/containermap to pkg/kubelet/cm/containermap for reusing containerMap is used in CPU Manager to store all containers information in the node. containerMap provides a mapping from (pod, container) -> containerID for all containers a pod It is reusable in another component in pkg/kubelet/cm which needs to track changes of all containers in the node. Signed-off-by: Byonggon Chun <bg.chun@samsung.com>	2020-03-14 02:38:51 +09:00
nolancon	0a9bd0334d	CPU Manager - Updates to unit tests: - Where previously we called manager.AddContainer(), we now call both manager.Allocate() and manager.AddContainer(). - Some test cases now have two expected errors. One each from Allocate() and AddContainer(). Existing outcomes are unchanged.	2020-02-27 07:24:34 +00:00
nolancon	467f66580b	CPU Manager - Add check to policy.Allocate() for init conatiners If container allocated CPUs is an init container, release those CPUs back into the shared pool for re-allocation to next container.	2020-02-27 07:24:33 +00:00
nolancon	709989efa2	CPU Manager - Rename policy.AddContainer() to policy.Allocate()	2020-02-27 07:24:33 +00:00
Kevin Klues	91f91858a5	Change CPUManager to implement HintProvider.Allocate() This change will not work on its own. Higher level code needs to make sure and call Allocate() before AddContainer is called. This is already being done in cases when the TopologyManager feature gate is enabled (in the PodAdmitHandler of the TopologyManager). However, we need to make sure we add proper logic to call it in cases when the TopologyManager feature gate is disabled.	2020-02-10 03:27:47 +00:00
Kevin Klues	bc686ea27b	Update TopologyManager.GetTopologyHints() to take pointers Previously, this function was taking full Pod and Container objects unnecessarily. This commit updates this so that they will take pointers instead.	2020-02-03 17:13:28 +00:00
Kubernetes Prow Robot	9822016bf8	Merge pull request #87397 from klueska/upstream-cpu-manager-set-initial-containers Initialize CPUManager containerMap to set of initial containers	2020-01-20 17:39:50 -08:00
Kubernetes Prow Robot	e6b5194ec1	Merge pull request #84300 from klueska/upstream-cpu-manager-reconcile-on-container-state Update logic in `CPUManager` `reconcileState()`	2020-01-20 12:27:37 -08:00
Kevin Klues	bd9d8fa42f	Initialize CPUManager containerMap to set of initial containers A recent change made it so that the CPUManager receives a list of initial containers that exist on the system at startup. This list can be non-empty, for example, after a kubelet retart. This commit ensures that the CPUManagers containerMap structure is initialized with the containers from this list.	2020-01-20 20:42:29 +01:00
Kubernetes Prow Robot	37ee6425ef	Merge pull request #87255 from klueska/upstream-remove-redundant-active-pods-check Remove check for empty activePods list in CPUManager removeStaleState	2020-01-20 09:05:50 -08:00
Kubernetes Prow Robot	23fa359d6c	Merge pull request #84705 from whypro/cpumanager-panic-master Return error instead of panic when cpu manager fails on startup.	2020-01-20 07:25:37 -08:00
Kevin Klues	7be9b0fe55	Update comments and error messages in the CPUManager	2020-01-20 15:31:01 +01:00
Kevin Klues	f2acbf6607	Base CPUManager state reconciliation on container state, not pod state	2020-01-20 13:57:30 +00:00
Kevin Klues	f6cf9b8ce9	Move CPUManager Pod Status logic before container loop	2020-01-20 13:57:30 +00:00
Sascha Grunert	278717bc57	Fix ineffectual assignment to CPUSets Signed-off-by: Sascha Grunert <sgrunert@suse.com>	2020-01-16 08:57:42 +01:00
Kevin Klues	34b942a41d	Remove check for empty activePods list in CPUManager removeStaleState This check is redundant since we protect this call with a call to `m.sourcesReady.AllReady()` earlier on. Moreover, having this check in place means that we will leave some stale state around in cases where there are actually no active pods in the system and this loop hasn't cleaned them up yet. This can happen, for example, if a pod exits while the kubelet is down for some reason. We see this exact case being triggered in our e2e tests, where a test has been failing since October when this change was first introduced.	2020-01-15 20:09:24 +01:00
Kevin Klues	5802f3a910	Add proper activePods list in TestGetTopologyHints for CPUManager	2020-01-15 20:08:41 +01:00
Kevin Klues	b373121a14	Make CPUManagerCheckpointV2 type an alias of CPUManagerCheckpoint This change is to prevent problems when we remove the V1->V2 migration code in the future. Without this, the checksums of all checkpoints would be hashed with the name CPUManagerCheckpointV2 embedded inside of them, which is undesirable. We want the checkpoints to be hashed with the name CPUManagerCheckpoint instead.	2019-12-28 19:29:13 +01:00
Kevin Klues	5faf8f4c52	Lock checksum calculation for v1 CPUManager state to pre 1.18 logic The updated CPUManager from PR #84462 implements logic to migrate the CPUManager checkpoint file from an old format to a new one. To do so, it defines the following types: ``` type CPUManagerCheckpoint = CPUManagerCheckpointV2 type CPUManagerCheckpointV1 struct { ... } type CPUManagerCheckpointV2 struct { ... } ``` This replaces the old definition of just: ``` type CPUManagerCheckpoint struct { ... } ``` Code was put in place to ensure proper migration from checkpoints in V1 format to checkpoints in V2 format. However (and this is a big however), all of the unit tests were performed on V1 checkpoints that were generated using the type name `CPUManagerCheckpointV1` and not the original type name of `CPUManagerCheckpoint`. As such, the checksum in the checkpoint file uses the `CPUManagerCheckpointV1` type to calculate its checksum and not the original type name of `CPUManagerCheckpoint`. This causes problems in the real world since all pre-1.18 checkpoint files will have been generated with the original type name of `CPUManagerCheckpoint`. When verifying the checksum of the checkpoint file across an upgrade to 1.18, the checksum is calculated assuming a type name of `CPUManagerCheckpointV1` (which is incorrect) and the file is seen to be corrupt. This patch ensures that all V1 checksums are verified against a type name of `CPUManagerCheckpoint` instead of ``CPUManagerCheckpointV1`. It also locks the algorithm used to calculate the checksum in place, since it wil never change in the future (for pre-1.18 checkpoint files at least).	2019-12-28 14:17:55 +01:00
whypro	f4bd4e2e96	Return error instead of panic when cpu manager starts failed.	2019-12-19 21:56:23 +08:00
Kevin Klues	f553286156	Pass initial set of runtime containers to the CPUManager at startup These information associatedd with these containers is used to migrate the CPUManager state from it's old format to its new (i.e. keyed off of podUID and containerName instead of containerID).	2019-12-11 23:02:51 +01:00
Kevin Klues	6441e1ef43	Move CPUManager Checkpoint restoration to Start() instead of New()	2019-12-11 23:02:51 +01:00
Kevin Klues	69f8053850	Update top-level CPUManager to adhere to new state semantics For now, we just pass 'nil' as the set of 'initialContainers' for migrating from old state semantics to new ones. In a subsequent commit will we pull this information from higher layers so that we can pass it down at this stage properly.	2019-12-11 23:02:51 +01:00
Kevin Klues	185e790f71	Update CPUManager policies to adhere to new state semantics	2019-12-11 23:02:51 +01:00
Kevin Klues	7c760fea38	Change CPUManager state to key off of podUID and containerName Previously, the state was keyed off of containerID intead of podUID and containerName. Unfortunately, this is no longer possible as we move to a to model where we we allocate CPUs to containers at pod adit time rather than container start time. This patch is the first step towards full migration to the new semantics. Only the unit tests in cpumanager/state are passing. In subsequent commits we will update the CPUManager itself to use these new semantics. This patch also includes code to do migration from the old checkpoint format to the new one, assuming the existence of a ContainerMap with the proper mapping of (containerID)->(podUID, containerName). A subsequent commit will update code in higher layers to make sure that this ContainerMap is made available to this state logic.	2019-12-11 23:02:51 +01:00
Kevin Klues	9191a949ae	Extend makePod() helper in CPUManager to take PodUID and ContainerName	2019-12-11 23:02:51 +01:00
Kevin Klues	7a15d3a4d7	Fix bug in parsing int to string in CPUManager tests	2019-12-11 23:02:51 +01:00
Kevin Klues	765aae93f8	Move containerMap out of static policy and into top-level CPUManager	2019-12-11 23:02:51 +01:00
Kevin Klues	1d995c98ef	Update CPUmanager containerMap to allow removal by containerRef	2019-12-11 23:02:47 +01:00
Kevin Klues	0639bd0942	Change CPUManager containerMap to key off of (podUID, containerName) Previously it keyed off of a pointer to the actual pod / container, which was unnecessary, and hard to work with (especially on the retrieval side).	2019-12-11 23:02:11 +01:00
Kevin Klues	3881e50cce	Update CPUmanager containerMap to also return a containerRef	2019-12-11 23:01:01 +01:00
Kevin Klues	347d5f57ac	Move CPUManager ContainerMap to its own package	2019-12-11 22:59:00 +01:00
Kubernetes Prow Robot	73b2c82b28	Merge pull request #83592 from jianzzha/opt-reserved-cpus added --reserved-cpus kubelet command option	2019-11-06 22:14:42 -08:00
Kubernetes Prow Robot	08e5781b41	Merge pull request #84525 from klueska/upstream-fix-hint-generation-after-kubelet-restart Fix bug in TopologyManager hint generation after kubelet restart	2019-11-06 15:33:50 -08:00
Jianzhu Zhang	89dfd24483	added --reserved-cpus kubelet command option	2019-11-06 07:33:52 -05:00
yuxiaobo	81e9f21f83	Correct spelling mistakes Signed-off-by: yuxiaobo <yuxiaobogo@163.com>	2019-11-06 20:25:19 +08:00
Kevin Klues	9dc116eb08	Ensure CPUManager TopologyHints are regenerated after kubelet restart This patch also includes test to make sure the newly added logic works as expected.	2019-11-05 15:48:51 +00:00
Kevin Klues	58f3554ebe	Sync all CPU and device state before generating TopologyHints for them This ensures that we have the most up-to-date state when generating topology hints for a container. Without this, it's possible that some resources will be seen as allocated, when they are actually free.	2019-11-05 13:00:20 +00:00
Kevin Klues	d9adf20360	Abstract removeStaleState from reconcileState in CPUManager This will become especially important as we move to a model where exclusive CPUs are assigned at pod admission time rather than at pod creation time. Having this function will allow us to do garbage collection on these CPUs anytime we are about to allocate CPUs to a new set of containers, in addition to reclaiming state periodically in the reconcileState() loop.	2019-11-05 12:45:11 +00:00
Adrian Chiris	b17706b149	Added LessThan() and IsEqual() methods for TopologyHints	2019-11-04 18:43:07 +01:00
Kubernetes Prow Robot	17a57f99d5	Merge pull request #81344 from zouyee/cpm fix cpumanager reconcileState without sourceready	2019-10-30 23:33:36 -07:00

1 2 3 4

152 Commits