Commit Graph

44631 Commits

Author SHA1 Message Date
Kevin Klues
dc4430b663 Add a sum() helper to the CPUManager cpuassignment logic
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 16:51:29 +00:00
Kevin Klues
cfacc22459 Allow the map.Values() function in the CPUManager to take a set of keys
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 16:51:28 +00:00
Kevin Klues
a160d9a8cd Fix CPUManager algo to calculate min NUMA nodes needed for distribution
Previously the algorithm was too restrictive because it tried to calculate the
minimum based on the number of *available* NUMA nodes and the number of
*available* CPUs on those NUMA nodes. Since there was no (easy) way to tell how
many CPUs an individual NUMA node happened to have, the average across them was
used. Using this value however, could result in thinking you need more NUMA
nodes to possibly satisfy a request than you actually do.

By using the *total* number of NUMA nodes and CPUs per NUMA node, we can get
the true minimum number of nodes required to satisfy a request. For a given
"current" allocation this may not be the true minimum, but its better to start
with fewer and move up than to start with too many and miss out on a better
option.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 16:51:26 +00:00
Kevin Klues
209cd20548 Fix unit tests following bug fix in CPUManager for map functions (2/2)
Now that the algorithm for balancing CPU distributions across NUMA nodes is
correct, this test actually behaves differently for the "packed" vs.
"distributed" allocation algorithms (as it should).

In the "packed" case we need to ensure that CPUs are allocated such that they
are packed onto cores. Since one CPU is already allocated from a core on NUMA
node 0, we want the next CPU to be its hyperthreaded pair (even though the
first available CPU id is on Socket 1).

In the "distributed" case, however, we want to ensure CPUs are allocated such
that we have an balanced distribution of CPUs across all NUMA nodes. This
points to allocating from Socket 1 if the only other CPU allocated has been
done on Socket 0.

To allow CPUs allocations to be packed onto full cores, one can allocate them
from the "distributed" algorithm with a 'cpuGroupSize' equal to the number of
hypthreads per core (in this case 2). We added an explicit test case for this,
demonstrating that we get the same result as the "packed" algorithm does, even
though the "distributed" algorithm is in use.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 16:51:24 +00:00
Kevin Klues
67f719cb1d Fix unit tests following bug fix in CPUManager for map functions (1/2)
This fixes two related tests to better test our "balanced" distribution algorithm.

The first test originally provided an input with the following number of CPUs
available on each NUMA node:

Node 0: 16
Node 1: 20
Node 2: 20
Node 3: 20

It then attempted to distribute 48 CPUs across them with an expectation that
each of the first 3 NUMA nodes would have 16 CPUs taken from them (leaving Node
0 with no more CPUs in the end).

This would have resulted in the following amount of CPUs on each node:

Node 0: 0
Node 1: 4
Node 2: 4
Node 3: 20

Which results in a standard deviation of 7.6811

However, a more balanced solution would actually be to pull 16 CPUs from NUMA
nodes 1, 2, and 3, and leave 0 untouched, i.e.:

Node 0: 16
Node 1: 4
Node 2: 4
Node 3: 4

Which results in a standard deviation of 5.1961524227066

To fix this test we changed the original number of available CPUs to start with
4 less CPUs on NUMA node 3, and 2 more CPUs on NUMA node 0, i.e.:

Node 0: 18
Node 1: 20
Node 2: 20
Node 3: 16

So that we end up with a result of:

Node 0: 2
Node 1: 4
Node 2: 4
Node 3: 16

Which pulls the CPUs from where we want and results in a standard deviation of 5.5452

For the second test, we simply reverse the number of CPUs available for Nodes 0
and 3 as:

Node 0: 16
Node 1: 20
Node 2: 20
Node 3: 18

Which forces the allocation to happen just as it did for the first test, except
now on NUMA nodes 1, 2, and 3 instead of NUMA nodes 0,1, and 2.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 16:51:23 +00:00
Kevin Klues
4008ea0b4c Fix bug in CPUManager map.Keys() and map.Values() implementations
Previously these would return lists that were too long because we appended to
pre-initialized lists with a specific size.

Since the primary place these functions are used is in the mean and standard
deviation calculations for the NUMA distribution algorithm, it meant that the
results of these calculations were often incorrect.

As a result, some of the unit tests we have are actually incorrect (because the
results we expect do not actually produce the best balanced
distribution of CPUs across all NUMA nodes for the input provided).

These tests will be patched up in subsequent commits.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 16:51:21 +00:00
Kevin Klues
446c58e0e7 Ensure we balance across *all* NUMA nodes in NUMA distribution algo
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 16:51:19 +00:00
Kevin Klues
c8559bc43e Short-circuit CPUManager distribute NUMA algo for unusable cpuGroupSize
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 16:51:16 +00:00
Kevin Klues
b28c1392d7 Round the CPUManager mean and stddev calculations to the nearest 1000th
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 16:51:13 +00:00
Dave Chen
8609358975 Graduate PreferNominatedNode to GA
Signed-off-by: Dave Chen <dave.chen@arm.com>
2021-11-24 14:50:53 +08:00
ahrtr
b7f22801fe add more info when failing to call PdhAddEnglishCounter 2021-11-24 13:49:34 +08:00
Kubernetes Prow Robot
a5622f3f6e Merge pull request #106616 from mattcary/pvc-race
Clean up deep copy needed for UpdateStatefulSet
2021-11-23 09:38:17 -08:00
Matthew Cary
0e2b901762 Clean up deep copy needed for UpdateStatefulSet
Change-Id: Id732358183d682d1a945cfee56f83bcaac0d7c31
2021-11-23 06:48:54 -08:00
Kubernetes Prow Robot
f572e4d5b4 Merge pull request #106518 from SergeyKanzhelev/tryProbeFix
Fix the bug with GRPC probe
2021-11-22 15:38:54 -08:00
Aohan Yang
ad4fe13528 fix the error when cleaning up jobs for cronjob 2021-11-22 17:06:22 +08:00
Amim Knabben
8b37bfec8e Enabling kube-proxy metrics on windows kernel mode 2021-11-21 21:23:55 -03:00
Kubernetes Prow Robot
084b28f6d5 Merge pull request #106510 from robscott/topology-ready-fix-controller
Updating TopologyCache to disregard unready endpoints in calculations
2021-11-19 17:07:11 -08:00
Kubernetes Prow Robot
37ae94f9ed Merge pull request #106507 from robscott/topology-ready-fix
Updating kube-proxy to ignore unready endpoints for Topology Hints
2021-11-19 17:06:59 -08:00
Sergey Kanzhelev
f390d49e24 fix the grpc probes 2021-11-20 00:23:53 +00:00
Kubernetes Prow Robot
8f9dd0a14c Merge pull request #105916 from kevindelgado/validation-unify-all
Server Side Strict Field Validation
2021-11-19 14:27:22 -08:00
Kevin Delgado
e50e2bbc88 Server Side Field Validation
Implements server side field validation behind the
`ServerSideFieldValidation` feature gate. With the
feature enabled, any create/update/patch request
with the `fieldValidation` query param set to
"Strict" will error if the object in the request
body have unknown fields. A value of "Warn"
(also the default when the feautre is enabled)
will succeed the request with a warning.

When the feature is disabled (or the query param
has a value of "Ignore"), the request will succeed
as it previously had with no indications of any
unknown or duplicate fields.
2021-11-19 21:24:36 +00:00
Kubernetes Prow Robot
ddfc53922c Merge pull request #106414 from jonyhy96/kubelet-fix-flake
kubelet: fix npe in test
2021-11-19 07:06:51 -08:00
haoyun
65ac99eef5 fix: npe in kubelet test
Signed-off-by: haoyun <yun.hao@daocloud.io>
Co-authored-by: Antonio Ojea <antonio.ojea.garcia@gmail.com>
2021-11-19 17:44:05 +08:00
Rob Scott
1983f41065 Updating kube-proxy to ignore unready endpoints for Topology Hints 2021-11-18 14:04:44 -08:00
Rob Scott
9813ec7e8a Updating TopologyCache to disregard unready endpoints in calculations 2021-11-18 13:54:09 -08:00
shuheiktgw
2acdaeb361 Refactor Kubelet config validation tests 2021-11-18 22:38:01 +09:00
shuheiktgw
35ad91ab37 Refactor Kubelet config validations 2021-11-18 22:31:31 +09:00
Kubernetes Prow Robot
b8af116327 Merge pull request #99728 from mattcary/control
StatefulSet PVC auto-delete implementation
2021-11-18 03:37:02 -08:00
Shivam Sandbhor
6652c54d83 Remove invalid comment in legacyregistry
Signed-off-by: Shivam Sandbhor <shivam.sandbhor@gmail.com>
2021-11-18 15:05:00 +05:30
Kubernetes Prow Robot
d766ab88f7 Merge pull request #106501 from ehashman/cri-graduation-v1
Make CRI v1 the default and allow a fallback to v1alpha2
2021-11-17 19:57:01 -08:00
Kubernetes Prow Robot
91b7fb4dc9 Merge pull request #102915 from wzshiming/feat/graceful-shutdown-based-on-pod-priority
Graceful Node Shutdown Based On Pod Priority
2021-11-17 18:45:03 -08:00
hyschumi
7ad629c864 rm makeNodeWithExtendedResource method in noderesources unit test && doc: update least_allocated strategy detail doc 2021-11-18 09:15:26 +08:00
Kubernetes Prow Robot
321e22d365 Merge pull request #106505 from ehashman/revert-103980-dkc-metrics
Revert "Bump DynamicKubeConfig metric deprecation to 1.23"
2021-11-17 16:55:03 -08:00
Matthew Cary
53b3a6c1d9 controller change for statefulset auto-delete (tests)
Change-Id: I16b50e6853bba65fc89c793d2b9b335581c02407
2021-11-17 16:48:50 -08:00
Matthew Cary
bce87a3e4f controller change for statefulset auto-delete (implementation) 2021-11-17 16:48:50 -08:00
Matthew Cary
f1d5d4df5a tests for statefulset PersistentVolumeClaimDeletePolicy api change 2021-11-17 16:46:48 -08:00
Matthew Cary
98c37f9e2a statefulset PersistentVolumeClaimDeletePolicy api change 2021-11-17 16:46:47 -08:00
Matthew Cary
0f5ffe6cf8 Add StatefulSetAutoDeletePVC feature gate 2021-11-17 15:35:49 -08:00
Kubernetes Prow Robot
e4952f32b7 Merge pull request #106463 from SergeyKanzhelev/grpcProbe
Implement grpc probe action
2021-11-17 12:43:54 -08:00
Elana Hashman
b35c500541 Revert "Bump DynamicKubeConfig metric deprecation to 1.23" 2021-11-17 11:48:49 -08:00
Elana Hashman
31c4273f66 Add test for memory equivalence
See https://github.com/kubernetes/kubernetes/pull/106006#issuecomment-971004230

Co-Authored-By: Jordan Liggitt <liggitt@google.com>
2021-11-17 11:07:09 -08:00
Sascha Grunert
de37b9d293 Make CRI v1 the default and allow a fallback to v1alpha2
This patch makes the CRI `v1` API the new project-wide default version.
To allow backwards compatibility, a fallback to `v1alpha2` has been added
as well. This fallback can either used by automatically determined by
the kubelet.

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
2021-11-17 11:05:05 -08:00
Sergey Kanzhelev
b7affcced1 implement :grpc probe action 2021-11-17 17:31:23 +00:00
Antonio Ojea
d126b14838 migrate nolint coments to golangci-lint 2021-11-17 13:58:53 +01:00
Hanna Lee
e78b3e8dfe Use nolint directive instead of stopping ticker, per liggit's suggestion 2021-11-17 08:56:57 +01:00
Hanna Lee
69d029bddb Add syncTicker.Stop() 2021-11-17 08:56:57 +01:00
Hanna Lee
07a883d8e6 Remove //lint:ignore pragmas that aren't being used anymore 2021-11-17 08:56:54 +01:00
Hanna Lee
c8fde197f5 Add more //nolint:staticcheck for failures caught in PR tests 2021-11-17 08:56:02 +01:00
Hanna Lee
1fbf06f5ad Use time.NewTicker instead of time.Tick to avoid leaking 2021-11-17 08:56:00 +01:00
Hanna Lee
30ea05ae7b Update IPVar and IPPortVar functions to have pointer receivers to fix 'ineffective assignment' 2021-11-17 08:56:00 +01:00