kubernetes

Author	SHA1	Message	Date
boenn	cec2aae1e5	rebase master	2021-11-25 11:21:12 +08:00
Kevin Klues	f8511877e2	Add regression test for CPUManager distribute NUMA algorithm We witnessed this exact allocation attempt in a live cluster and witnessed the algorithm fail with an accounting error. This test was added to verify that this case is now handled by the updates to the algorithm and that we don't regress from it in the future. "test" description="ensure previous failure encountered on live machine has been fixed (1/1)" "combo remainderSet balance" combo=[2 4 6] remainderSet=[2 4 6] distribution=9 remainder=1 available=[14 2 4 4 0 3 4 1] balance=4.031 "combo remainderSet balance" combo=[2 4 6] remainderSet=[2 4] distribution=9 remainder=1 available=[0 3 4 1 14 2 4 4] balance=4.031 "combo remainderSet balance" combo=[2 4 6] remainderSet=[2 6] distribution=9 remainder=1 available=[1 14 2 4 4 0 3 4] balance=4.031 "combo remainderSet balance" combo=[2 4 6] remainderSet=[4 6] distribution=9 remainder=1 available=[1 3 4 0 14 2 4 4] balance=4.031 "combo remainderSet balance" combo=[2 4 6] remainderSet=[2] distribution=9 remainder=1 available=[4 0 3 4 1 14 2 4] balance=4.031 "combo remainderSet balance" combo=[2 4 6] remainderSet=[4] distribution=9 remainder=1 available=[3 4 0 14 2 4 4 1] balance=4.031 "combo remainderSet balance" combo=[2 4 6] remainderSet=[6] distribution=9 remainder=1 available=[1 13 2 4 4 1 3 4] balance=3.606 "bestCombo found" distribution=9 bestCombo=[2 4 6] bestRemainder=[6] Signed-off-by: Kevin Klues <kklues@nvidia.com>	2021-11-24 20:49:58 +00:00
Kevin Klues	e284c74d93	Add unit test for CPUManager distribute NUMA algorithm verifying fixes Before Change: "test" description="ensure bestRemainder chosen with NUMA nodes that have enough CPUs to satisfy the request" "combo remainderSet balance" combo=[0 1 2 3] remainderSet=[0 1] distribution=8 remainder=2 available=[-1 -1 0 6] balance=2.915 "combo remainderSet balance" combo=[0 1 2 3] remainderSet=[0 2] distribution=8 remainder=2 available=[-1 0 -1 6] balance=2.915 "combo remainderSet balance" combo=[0 1 2 3] remainderSet=[0 3] distribution=8 remainder=2 available=[5 -1 0 0] balance=2.345 "combo remainderSet balance" combo=[0 1 2 3] remainderSet=[1 2] distribution=8 remainder=2 available=[0 -1 -1 6] balance=2.915 "combo remainderSet balance" combo=[0 1 2 3] remainderSet=[1 3] distribution=8 remainder=2 available=[0 -1 0 5] balance=2.345 "combo remainderSet balance" combo=[0 1 2 3] remainderSet=[2 3] distribution=8 remainder=2 available=[0 0 -1 5] balance=2.345 "bestCombo found" distribution=8 bestCombo=[0 1 2 3] bestRemainder=[0 3] --- FAIL: TestTakeByTopologyNUMADistributed (0.01s) --- FAIL: TestTakeByTopologyNUMADistributed/ensure_bestRemainder_chosen_with_NUMA_nodes_that_have_enough_CPUs_to_satisfy_the_request (0.00s) cpu_assignment_test.go:867: unexpected error [accounting error, not enough CPUs allocated, remaining: 1] After Change: "test" description="ensure bestRemainder chosen with NUMA nodes that have enough CPUs to satisfy the request" "combo remainderSet balance" combo=[0 1 2 3] remainderSet=[3] distribution=8 remainder=2 available=[0 0 0 4] balance=1.732 "bestCombo found" distribution=8 bestCombo=[0 1 2 3] bestRemainder=[3] SUCCESS Signed-off-by: Kevin Klues <kklues@nvidia.com>	2021-11-24 20:45:37 +00:00
Kevin Klues	031f11513d	Fix accounting bug in CPUManager distribute NUMA policy Without this fix, the algorithm may decide to allocate "remainder" CPUs from a NUMA node that has no more CPUs to allocate. Moreover, it was only considering allocation of remainder CPUs from NUMA nodes such that each NUMA node in the remainderSet could only allocate 1 (i.e. 'cpuGroupSize') more CPUs. With these two issues in play, one could end up with an accounting error where not enough CPUs were allocated by the time the algorithm runs to completion. The updated algorithm will now omit any NUMA nodes that have 0 CPUs left from the set of NUMA nodes considered for allocating remainder CPUs. Additionally, we now consider all combinations of nodes from the remainder set of size 1..len(remainderSet). This allows us to find a better solution if allocating CPUs from a smaller set leads to a more balanced allocation. Finally, we loop through all NUMA nodes 1-by-1 in the remainderSet until all rmeainer CPUs have been accounted for and allocated. This ensure that we will not hit an accounting error later on because we explicitly remove CPUs from the remainder set until there are none left. A follow-on commit adds a set of unit tests that will fail before these changes, but succeeds after them. Signed-off-by: Kevin Klues <kklues@nvidia.com>	2021-11-24 19:18:11 +00:00
Kevin Klues	5317a2e2ac	Fix error handling in CPUManager distribute NUMA tests Signed-off-by: Kevin Klues <kklues@nvidia.com>	2021-11-24 16:51:31 +00:00
Kevin Klues	dc4430b663	Add a sum() helper to the CPUManager cpuassignment logic Signed-off-by: Kevin Klues <kklues@nvidia.com>	2021-11-24 16:51:29 +00:00
Kevin Klues	cfacc22459	Allow the map.Values() function in the CPUManager to take a set of keys Signed-off-by: Kevin Klues <kklues@nvidia.com>	2021-11-24 16:51:28 +00:00
Kevin Klues	a160d9a8cd	Fix CPUManager algo to calculate min NUMA nodes needed for distribution Previously the algorithm was too restrictive because it tried to calculate the minimum based on the number of available NUMA nodes and the number of available CPUs on those NUMA nodes. Since there was no (easy) way to tell how many CPUs an individual NUMA node happened to have, the average across them was used. Using this value however, could result in thinking you need more NUMA nodes to possibly satisfy a request than you actually do. By using the total number of NUMA nodes and CPUs per NUMA node, we can get the true minimum number of nodes required to satisfy a request. For a given "current" allocation this may not be the true minimum, but its better to start with fewer and move up than to start with too many and miss out on a better option. Signed-off-by: Kevin Klues <kklues@nvidia.com>	2021-11-24 16:51:26 +00:00
Kevin Klues	209cd20548	Fix unit tests following bug fix in CPUManager for map functions (2/2) Now that the algorithm for balancing CPU distributions across NUMA nodes is correct, this test actually behaves differently for the "packed" vs. "distributed" allocation algorithms (as it should). In the "packed" case we need to ensure that CPUs are allocated such that they are packed onto cores. Since one CPU is already allocated from a core on NUMA node 0, we want the next CPU to be its hyperthreaded pair (even though the first available CPU id is on Socket 1). In the "distributed" case, however, we want to ensure CPUs are allocated such that we have an balanced distribution of CPUs across all NUMA nodes. This points to allocating from Socket 1 if the only other CPU allocated has been done on Socket 0. To allow CPUs allocations to be packed onto full cores, one can allocate them from the "distributed" algorithm with a 'cpuGroupSize' equal to the number of hypthreads per core (in this case 2). We added an explicit test case for this, demonstrating that we get the same result as the "packed" algorithm does, even though the "distributed" algorithm is in use. Signed-off-by: Kevin Klues <kklues@nvidia.com>	2021-11-24 16:51:24 +00:00
Kevin Klues	67f719cb1d	Fix unit tests following bug fix in CPUManager for map functions (1/2) This fixes two related tests to better test our "balanced" distribution algorithm. The first test originally provided an input with the following number of CPUs available on each NUMA node: Node 0: 16 Node 1: 20 Node 2: 20 Node 3: 20 It then attempted to distribute 48 CPUs across them with an expectation that each of the first 3 NUMA nodes would have 16 CPUs taken from them (leaving Node 0 with no more CPUs in the end). This would have resulted in the following amount of CPUs on each node: Node 0: 0 Node 1: 4 Node 2: 4 Node 3: 20 Which results in a standard deviation of 7.6811 However, a more balanced solution would actually be to pull 16 CPUs from NUMA nodes 1, 2, and 3, and leave 0 untouched, i.e.: Node 0: 16 Node 1: 4 Node 2: 4 Node 3: 4 Which results in a standard deviation of 5.1961524227066 To fix this test we changed the original number of available CPUs to start with 4 less CPUs on NUMA node 3, and 2 more CPUs on NUMA node 0, i.e.: Node 0: 18 Node 1: 20 Node 2: 20 Node 3: 16 So that we end up with a result of: Node 0: 2 Node 1: 4 Node 2: 4 Node 3: 16 Which pulls the CPUs from where we want and results in a standard deviation of 5.5452 For the second test, we simply reverse the number of CPUs available for Nodes 0 and 3 as: Node 0: 16 Node 1: 20 Node 2: 20 Node 3: 18 Which forces the allocation to happen just as it did for the first test, except now on NUMA nodes 1, 2, and 3 instead of NUMA nodes 0,1, and 2. Signed-off-by: Kevin Klues <kklues@nvidia.com>	2021-11-24 16:51:23 +00:00
Kevin Klues	4008ea0b4c	Fix bug in CPUManager map.Keys() and map.Values() implementations Previously these would return lists that were too long because we appended to pre-initialized lists with a specific size. Since the primary place these functions are used is in the mean and standard deviation calculations for the NUMA distribution algorithm, it meant that the results of these calculations were often incorrect. As a result, some of the unit tests we have are actually incorrect (because the results we expect do not actually produce the best balanced distribution of CPUs across all NUMA nodes for the input provided). These tests will be patched up in subsequent commits. Signed-off-by: Kevin Klues <kklues@nvidia.com>	2021-11-24 16:51:21 +00:00
Kevin Klues	446c58e0e7	Ensure we balance across all NUMA nodes in NUMA distribution algo Signed-off-by: Kevin Klues <kklues@nvidia.com>	2021-11-24 16:51:19 +00:00
Kevin Klues	c8559bc43e	Short-circuit CPUManager distribute NUMA algo for unusable cpuGroupSize Signed-off-by: Kevin Klues <kklues@nvidia.com>	2021-11-24 16:51:16 +00:00
Kevin Klues	b28c1392d7	Round the CPUManager mean and stddev calculations to the nearest 1000th Signed-off-by: Kevin Klues <kklues@nvidia.com>	2021-11-24 16:51:13 +00:00
ahrtr	b7f22801fe	add more info when failing to call PdhAddEnglishCounter	2021-11-24 13:49:34 +08:00
Kubernetes Prow Robot	ddfc53922c	Merge pull request #106414 from jonyhy96/kubelet-fix-flake kubelet: fix npe in test	2021-11-19 07:06:51 -08:00
haoyun	65ac99eef5	fix: npe in kubelet test Signed-off-by: haoyun <yun.hao@daocloud.io> Co-authored-by: Antonio Ojea <antonio.ojea.garcia@gmail.com>	2021-11-19 17:44:05 +08:00
shuheiktgw	2acdaeb361	Refactor Kubelet config validation tests	2021-11-18 22:38:01 +09:00
shuheiktgw	35ad91ab37	Refactor Kubelet config validations	2021-11-18 22:31:31 +09:00
Shivam Sandbhor	6652c54d83	Remove invalid comment in legacyregistry Signed-off-by: Shivam Sandbhor <shivam.sandbhor@gmail.com>	2021-11-18 15:05:00 +05:30
Kubernetes Prow Robot	d766ab88f7	Merge pull request #106501 from ehashman/cri-graduation-v1 Make CRI v1 the default and allow a fallback to v1alpha2	2021-11-17 19:57:01 -08:00
Kubernetes Prow Robot	91b7fb4dc9	Merge pull request #102915 from wzshiming/feat/graceful-shutdown-based-on-pod-priority Graceful Node Shutdown Based On Pod Priority	2021-11-17 18:45:03 -08:00
Kubernetes Prow Robot	321e22d365	Merge pull request #106505 from ehashman/revert-103980-dkc-metrics Revert "Bump DynamicKubeConfig metric deprecation to 1.23"	2021-11-17 16:55:03 -08:00
Kubernetes Prow Robot	e4952f32b7	Merge pull request #106463 from SergeyKanzhelev/grpcProbe Implement grpc probe action	2021-11-17 12:43:54 -08:00
Elana Hashman	b35c500541	Revert "Bump DynamicKubeConfig metric deprecation to 1.23"	2021-11-17 11:48:49 -08:00
Elana Hashman	31c4273f66	Add test for memory equivalence See https://github.com/kubernetes/kubernetes/pull/106006#issuecomment-971004230 Co-Authored-By: Jordan Liggitt <liggitt@google.com>	2021-11-17 11:07:09 -08:00
Sascha Grunert	de37b9d293	Make CRI `v1` the default and allow a fallback to `v1alpha2` This patch makes the CRI `v1` API the new project-wide default version. To allow backwards compatibility, a fallback to `v1alpha2` has been added as well. This fallback can either used by automatically determined by the kubelet. Signed-off-by: Sascha Grunert <sgrunert@redhat.com>	2021-11-17 11:05:05 -08:00
Sergey Kanzhelev	b7affcced1	implement :grpc probe action	2021-11-17 17:31:23 +00:00
Antonio Ojea	d126b14838	migrate nolint coments to golangci-lint	2021-11-17 13:58:53 +01:00
Hanna Lee	e78b3e8dfe	Use nolint directive instead of stopping ticker, per liggit's suggestion	2021-11-17 08:56:57 +01:00
Hanna Lee	69d029bddb	Add syncTicker.Stop()	2021-11-17 08:56:57 +01:00
Hanna Lee	07a883d8e6	Remove //lint:ignore pragmas that aren't being used anymore	2021-11-17 08:56:54 +01:00
Hanna Lee	1fbf06f5ad	Use time.NewTicker instead of time.Tick to avoid leaking	2021-11-17 08:56:00 +01:00
Hanna Lee	0f3836dcc5	Ignore deprecation warnings with //nolint:staticcheck	2021-11-17 08:55:57 +01:00
Kubernetes Prow Robot	6c357f9996	Merge pull request #106041 from jonyhy96/volumemanager-reconciler-codefmt kubelet: extract multiple ignore errors validate logic to isExpectedError	2021-11-16 22:55:53 -08:00
Shiming Zhang	7a6f792ff3	Add validation for GracefulNodeShutdownBasedOnPodPriority Co-authored-by: Elana Hashman <ehashman@users.noreply.github.com>	2021-11-17 11:47:12 +08:00
Shiming Zhang	545313bdc7	Implement graceful shutdown based on Pod priority	2021-11-17 11:47:12 +08:00
Shiming Zhang	d82f606970	Add field for KubeletConfiguration and Regenerate	2021-11-17 11:47:12 +08:00
Kubernetes Prow Robot	1f6d5caa9a	Merge pull request #105437 from cmssczy/update-kubelet-configuration migrate --register-with-taints to KubeletConfiguration	2021-11-16 17:44:00 -08:00
menglong.qi	b886b9b108	fix: typo	2021-11-17 09:22:57 +08:00
Kubernetes Prow Robot	42d8b2f3b9	Merge pull request #106289 from CatherineF-dev/fix-metrics-AlreadyRegisteredError-in-unit-test Fix metrics AlreadyRegisteredError on TestRecordOperation and TestGetHistogramVecFromGatherer unit test	2021-11-16 16:36:15 -08:00
Kubernetes Prow Robot	6805e6ee41	Merge pull request #104722 from leiyiz/migration turning on the CSIMigrationGCE feature flag	2021-11-16 15:28:32 -08:00
Léiyì Zhang	275fdf0884	fixing unit test failures induced by turning on CSIMigrationGCE disable CSIMigrationGCE in some unit tests	2021-11-16 19:26:30 +00:00
CatherineF-dev	5646120fbb	Use Reset at first	2021-11-16 18:57:24 +00:00
haoyun	b5409adaeb	refactor: extract multiple ignore errors validate to ignoreError Signed-off-by: haoyun <yun.hao@daocloud.io>	2021-11-16 20:43:50 +08:00
caozhiyuan	bad4faf1b9	migrate --register-with-taints to KubeletConfiguration	2021-11-16 19:10:36 +08:00
Kubernetes Prow Robot	1d1d462d2f	Merge pull request #104287 from jsturtevant/windows-stats Reduce the number of expensive calls in the Windows stats queries for dockershim	2021-11-15 18:51:37 -08:00
Kubernetes Prow Robot	0473cab823	Merge pull request #103299 from wgahnagl/addPinned prevents garbage collection from removing pinned images	2021-11-15 18:51:25 -08:00
Kubernetes Prow Robot	39af75af30	Merge pull request #106201 from yxxhero/fea_106111 Add more msg when exec probe timeout	2021-11-15 17:51:37 -08:00
Kubernetes Prow Robot	463802765d	Merge pull request #104650 from yxxhero/initcontainer_oomkiil_as_a_failure fix init container oomkilled as a failure	2021-11-15 17:51:25 -08:00

1 2 3 4 5 ...

9854 Commits