Commit Graph

1061 Commits

Author SHA1 Message Date
Davanum Srinivas
497e9c1971
Cleanup OWNERS files (No Activity in the last year)
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2021-12-15 10:34:02 -05:00
Kubernetes Prow Robot
1d66302c42
Merge pull request #106458 from dims/lint-yaml-in-owners-files
Lint/Beautify yaml in OWNERS files
2021-12-10 06:39:12 -08:00
Davanum Srinivas
9405e9b55e
Check in OWNERS modified by update-yamlfmt.sh
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2021-12-09 21:31:26 -05:00
Kevin Klues
f8511877e2 Add regression test for CPUManager distribute NUMA algorithm
We witnessed this exact allocation attempt in a live cluster and witnessed the
algorithm fail with an accounting error. This test was added to verify that
this case is now handled by the updates to the algorithm and that we don't
regress from it in the future.

"test" description="ensure previous failure encountered on live machine has been fixed (1/1)"
"combo remainderSet balance" combo=[2 4 6] remainderSet=[2 4 6] distribution=9 remainder=1 available=[14 2 4 4 0 3 4 1] balance=4.031
"combo remainderSet balance" combo=[2 4 6] remainderSet=[2 4] distribution=9 remainder=1 available=[0 3 4 1 14 2 4 4] balance=4.031
"combo remainderSet balance" combo=[2 4 6] remainderSet=[2 6] distribution=9 remainder=1 available=[1 14 2 4 4 0 3 4] balance=4.031
"combo remainderSet balance" combo=[2 4 6] remainderSet=[4 6] distribution=9 remainder=1 available=[1 3 4 0 14 2 4 4] balance=4.031
"combo remainderSet balance" combo=[2 4 6] remainderSet=[2] distribution=9 remainder=1 available=[4 0 3 4 1 14 2 4] balance=4.031
"combo remainderSet balance" combo=[2 4 6] remainderSet=[4] distribution=9 remainder=1 available=[3 4 0 14 2 4 4 1] balance=4.031
"combo remainderSet balance" combo=[2 4 6] remainderSet=[6] distribution=9 remainder=1 available=[1 13 2 4 4 1 3 4] balance=3.606
"bestCombo found" distribution=9 bestCombo=[2 4 6] bestRemainder=[6]

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 20:49:58 +00:00
Kevin Klues
e284c74d93 Add unit test for CPUManager distribute NUMA algorithm verifying fixes
Before Change:
"test" description="ensure bestRemainder chosen with NUMA nodes that have enough CPUs to satisfy the request"
"combo remainderSet balance" combo=[0 1 2 3] remainderSet=[0 1] distribution=8 remainder=2 available=[-1 -1 0 6] balance=2.915
"combo remainderSet balance" combo=[0 1 2 3] remainderSet=[0 2] distribution=8 remainder=2 available=[-1 0 -1 6] balance=2.915
"combo remainderSet balance" combo=[0 1 2 3] remainderSet=[0 3] distribution=8 remainder=2 available=[5 -1 0 0] balance=2.345
"combo remainderSet balance" combo=[0 1 2 3] remainderSet=[1 2] distribution=8 remainder=2 available=[0 -1 -1 6] balance=2.915
"combo remainderSet balance" combo=[0 1 2 3] remainderSet=[1 3] distribution=8 remainder=2 available=[0 -1 0 5] balance=2.345
"combo remainderSet balance" combo=[0 1 2 3] remainderSet=[2 3] distribution=8 remainder=2 available=[0 0 -1 5] balance=2.345
"bestCombo found" distribution=8 bestCombo=[0 1 2 3] bestRemainder=[0 3]

--- FAIL: TestTakeByTopologyNUMADistributed (0.01s)
    --- FAIL: TestTakeByTopologyNUMADistributed/ensure_bestRemainder_chosen_with_NUMA_nodes_that_have_enough_CPUs_to_satisfy_the_request (0.00s)
        cpu_assignment_test.go:867: unexpected error [accounting error, not enough CPUs allocated, remaining: 1]

After Change:
"test" description="ensure bestRemainder chosen with NUMA nodes that have enough CPUs to satisfy the request"
"combo remainderSet balance" combo=[0 1 2 3] remainderSet=[3] distribution=8 remainder=2 available=[0 0 0 4] balance=1.732
"bestCombo found" distribution=8 bestCombo=[0 1 2 3] bestRemainder=[3]

SUCCESS

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 20:45:37 +00:00
Kevin Klues
031f11513d Fix accounting bug in CPUManager distribute NUMA policy
Without this fix, the algorithm may decide to allocate "remainder" CPUs from a
NUMA node that has no more CPUs to allocate. Moreover, it was only considering
allocation of remainder CPUs from NUMA nodes such that each NUMA node in the
remainderSet could only allocate 1 (i.e. 'cpuGroupSize') more CPUs. With these
two issues in play, one could end up with an accounting error where not enough
CPUs were allocated by the time the algorithm runs to completion.

The updated algorithm will now omit any NUMA nodes that have 0 CPUs left from
the set of NUMA nodes considered for allocating remainder CPUs. Additionally,
we now consider *all* combinations of nodes from the remainder set of size
1..len(remainderSet). This allows us to find a better solution if allocating
CPUs from a smaller set leads to a more balanced allocation. Finally, we loop
through all NUMA nodes 1-by-1 in the remainderSet until all rmeainer CPUs have
been accounted for and allocated. This ensure that we will not hit an
accounting error later on because we explicitly remove CPUs from the remainder
set until there are none left.

A follow-on commit adds a set of unit tests that will fail before these
changes, but succeeds after them.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 19:18:11 +00:00
Kevin Klues
5317a2e2ac Fix error handling in CPUManager distribute NUMA tests
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 16:51:31 +00:00
Kevin Klues
dc4430b663 Add a sum() helper to the CPUManager cpuassignment logic
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 16:51:29 +00:00
Kevin Klues
cfacc22459 Allow the map.Values() function in the CPUManager to take a set of keys
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 16:51:28 +00:00
Kevin Klues
a160d9a8cd Fix CPUManager algo to calculate min NUMA nodes needed for distribution
Previously the algorithm was too restrictive because it tried to calculate the
minimum based on the number of *available* NUMA nodes and the number of
*available* CPUs on those NUMA nodes. Since there was no (easy) way to tell how
many CPUs an individual NUMA node happened to have, the average across them was
used. Using this value however, could result in thinking you need more NUMA
nodes to possibly satisfy a request than you actually do.

By using the *total* number of NUMA nodes and CPUs per NUMA node, we can get
the true minimum number of nodes required to satisfy a request. For a given
"current" allocation this may not be the true minimum, but its better to start
with fewer and move up than to start with too many and miss out on a better
option.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 16:51:26 +00:00
Kevin Klues
209cd20548 Fix unit tests following bug fix in CPUManager for map functions (2/2)
Now that the algorithm for balancing CPU distributions across NUMA nodes is
correct, this test actually behaves differently for the "packed" vs.
"distributed" allocation algorithms (as it should).

In the "packed" case we need to ensure that CPUs are allocated such that they
are packed onto cores. Since one CPU is already allocated from a core on NUMA
node 0, we want the next CPU to be its hyperthreaded pair (even though the
first available CPU id is on Socket 1).

In the "distributed" case, however, we want to ensure CPUs are allocated such
that we have an balanced distribution of CPUs across all NUMA nodes. This
points to allocating from Socket 1 if the only other CPU allocated has been
done on Socket 0.

To allow CPUs allocations to be packed onto full cores, one can allocate them
from the "distributed" algorithm with a 'cpuGroupSize' equal to the number of
hypthreads per core (in this case 2). We added an explicit test case for this,
demonstrating that we get the same result as the "packed" algorithm does, even
though the "distributed" algorithm is in use.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 16:51:24 +00:00
Kevin Klues
67f719cb1d Fix unit tests following bug fix in CPUManager for map functions (1/2)
This fixes two related tests to better test our "balanced" distribution algorithm.

The first test originally provided an input with the following number of CPUs
available on each NUMA node:

Node 0: 16
Node 1: 20
Node 2: 20
Node 3: 20

It then attempted to distribute 48 CPUs across them with an expectation that
each of the first 3 NUMA nodes would have 16 CPUs taken from them (leaving Node
0 with no more CPUs in the end).

This would have resulted in the following amount of CPUs on each node:

Node 0: 0
Node 1: 4
Node 2: 4
Node 3: 20

Which results in a standard deviation of 7.6811

However, a more balanced solution would actually be to pull 16 CPUs from NUMA
nodes 1, 2, and 3, and leave 0 untouched, i.e.:

Node 0: 16
Node 1: 4
Node 2: 4
Node 3: 4

Which results in a standard deviation of 5.1961524227066

To fix this test we changed the original number of available CPUs to start with
4 less CPUs on NUMA node 3, and 2 more CPUs on NUMA node 0, i.e.:

Node 0: 18
Node 1: 20
Node 2: 20
Node 3: 16

So that we end up with a result of:

Node 0: 2
Node 1: 4
Node 2: 4
Node 3: 16

Which pulls the CPUs from where we want and results in a standard deviation of 5.5452

For the second test, we simply reverse the number of CPUs available for Nodes 0
and 3 as:

Node 0: 16
Node 1: 20
Node 2: 20
Node 3: 18

Which forces the allocation to happen just as it did for the first test, except
now on NUMA nodes 1, 2, and 3 instead of NUMA nodes 0,1, and 2.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 16:51:23 +00:00
Kevin Klues
4008ea0b4c Fix bug in CPUManager map.Keys() and map.Values() implementations
Previously these would return lists that were too long because we appended to
pre-initialized lists with a specific size.

Since the primary place these functions are used is in the mean and standard
deviation calculations for the NUMA distribution algorithm, it meant that the
results of these calculations were often incorrect.

As a result, some of the unit tests we have are actually incorrect (because the
results we expect do not actually produce the best balanced
distribution of CPUs across all NUMA nodes for the input provided).

These tests will be patched up in subsequent commits.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 16:51:21 +00:00
Kevin Klues
446c58e0e7 Ensure we balance across *all* NUMA nodes in NUMA distribution algo
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 16:51:19 +00:00
Kevin Klues
c8559bc43e Short-circuit CPUManager distribute NUMA algo for unusable cpuGroupSize
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 16:51:16 +00:00
Kevin Klues
b28c1392d7 Round the CPUManager mean and stddev calculations to the nearest 1000th
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 16:51:13 +00:00
Sascha Grunert
de37b9d293
Make CRI v1 the default and allow a fallback to v1alpha2
This patch makes the CRI `v1` API the new project-wide default version.
To allow backwards compatibility, a fallback to `v1alpha2` has been added
as well. This fallback can either used by automatically determined by
the kubelet.

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
2021-11-17 11:05:05 -08:00
Antonio Ojea
d126b14838 migrate nolint coments to golangci-lint 2021-11-17 13:58:53 +01:00
Neha Lohia
fa1b6765d5
move pkg/util/node to component-helpers/node/util (#105347)
Signed-off-by: Neha Lohia <nehapithadiya444@gmail.com>
2021-11-12 07:52:27 -08:00
Kubernetes Prow Robot
3ca3daac76
Merge pull request #103415 from tiloso/staticcheck-kubelet
Fix staticcheck failure in pkg/kubelet/cm/cpuset
2021-11-11 15:15:13 -08:00
Kubernetes Prow Robot
359b722c19
Merge pull request #102882 from fromanirh/device-manager-checkpoints
devicemanager: checkpoint: support pre-1.20 data
2021-11-02 16:56:57 -07:00
Kubernetes Prow Robot
08bf54678e
Merge pull request #101909 from nolancon/cpu-mgr-testing
Additional cases for reconcileState testing
2021-10-30 00:01:17 -07:00
Francesco Romani
2f426fdba6 devicemanager: checkpoint: support pre-1.20 data
The commit a8b8995ef2
changed the content of the data kubelet writes in the checkpoint.
Unfortunately, the checkpoint restore code was not updated,
so if we upgrade kubelet from pre-1.20 to 1.20+, the
device manager cannot anymore restore its state correctly.

The only trace of this misbehaviour is this line in the
kubelet logs:
```
W0615 07:31:49.744770    4852 manager.go:244] Continue after failing to read checkpoint file. Device allocation info may NOT be up-to-date. Err: json: cannot unmarshal array into Go struct field PodDevicesEntry.Data.PodDeviceEntries.DeviceIDs of type checkpoint.DevicesPerNUMA
```

If we hit this bug, the device allocation info is
indeed NOT up-to-date up until the device plugins register
themselves again. This can take up to few minutes, depending
on the specific device plugin.

While the device manager state is inconsistent:
1. the kubelet will NOT update the device availability to zero, so
   the scheduler will send pods towards the inconsistent kubelet.
2. at pod admission time, the device manager allocation will not
   trigger, so pods will be admitted without devices actually
   being allocated to them.

To fix these issues, we add support to the device manager to
read pre-1.20 checkpoint data. We retroactively call this
format "v1".

Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-10-26 09:54:11 +02:00
Kevin Klues
86f9c266bc Add optimizations to reduce iterations in distributed NUMA algorithm
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-10-18 08:53:25 +00:00
Kevin Klues
70e0f47191 Support full-pcpus-only with the new NUMA distribution policy option
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-10-16 19:31:02 +00:00
Kevin Klues
d54445a84d Generalize the NUMA distribution algorithm to take cpuGroupSize
This parameter ensures that CPUs are always allocated in groups of size
'cpuGroupSize'. This is important, for example, to ensure that all CPUs (i.e.
hyperthreads) from the same core are handed out together.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-10-16 19:31:02 +00:00
Kevin Klues
1436e33642 Add more extensive testing for NUMA distribution algorithm in CPUManager
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-10-16 19:31:02 +00:00
Kevin Klues
cf3afb8602 Add 2 distinguishing test cases between the 2 takeByTopology algorithms
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-10-16 19:31:02 +00:00
Kevin Klues
eb78e2406b Add a new TestTakeByTopologyNUMADistributed() test to the CPUManager
As part of this, pull out all of the existing "TakeByTopology" tests and have
them be called by the original TestTakeByTopologyNUMAPacked() as well as the
new TestTakeByTopologyNUMADistributed() test. In a subsequent commit, we will
add some tests that should differ between these two algorithms.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-10-16 19:31:02 +00:00
Kevin Klues
876dd9b078 Added algorithm to CPUManager to distribute CPUs across NUMA nodes
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-10-16 19:31:02 +00:00
Kevin Klues
462544d079 Split CPUManager takeByTopology() into two different algorithms
The first implements the original algorithm which packs CPUs onto NUMA nodes if
more than one NUMA node is required to satisfy the allocation. The second
disitributes CPUs across NUMA nodes if they can't all fit into one.

The "distributing" algorithm is currently a noop and just returns an error of
"unimplemented". A subsequent commit will add the logic to implement this
algorithm according to KEP 2902:

https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2902-cpumanager-distribute-cpus-policy-option

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-10-16 14:46:19 +00:00
Kevin Klues
0e7928edce Add new CPUManager policy option for "distribute-cpus-across-numa"
This commit only adds the option to the policy options framework. A
subsequent commit will add the logic to utilize it.

The KEP describing this new option can be found here:
https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2902-cpumanager-distribute-cpus-policy-option

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-10-16 14:46:19 +00:00
Francesco Romani
4bae656835 cpumanager: test NUMA node support for CPU assign (2)
This batch of tests adds a fake topology on which each numa node
has multiple sockets. We didn't find yet a real HW topology in the wild
like this, but we need one to fully exercise the code.

So, until we find a HW topology, we add a fake one flipping
the NUMA/socket config of the existing xeon dual gold 6320.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-10-15 10:29:21 +00:00
Francesco Romani
547996f3f6 cpumanager: test NUMA node support for CPU assign (1)
This batch of tests adds a real topology on which each physical socket
has multiple NUMA zones. Taken by a real dual xeon 6320 gold.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-10-15 10:29:21 +00:00
Francesco Romani
f6ccc4426a cpumanager: test: use proper subtests
The exisiting unit tests where performing subtests without
actually using the full features of the testing package
(https://pkg.go.dev/testing#hdr-Subtests_and_Sub_benchmarks)

Update them with fairly minimal changes. The patch is deceptively
large because we need to move the code inside a new block.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-10-15 10:29:21 +00:00
Francesco Romani
15caa134b2 cpumanager: topology: use rich cmp package
User the `cmp.Diff` package in the unit tests, moving away from
`reflect.DeepEqual`. This gives us a clearer picture of the differences
when the tests fail.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-10-15 10:29:21 +00:00
Kevin Klues
aff54a0914 Abstract out whether NUMA or Sockets come first in the memory hierarchy
This allows us to get rid of the check for determining which one is higher all
throughout the code. Now we just check once and instantiate an interface of the
appropriate type that makes sure the ordering in the hierarchy is preserved
through the appropriate calls.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-10-15 10:29:15 +00:00
Kevin Klues
17c7e86c6d Add NUMA support to the CPU assignment algorithm in the CPUManager
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-10-15 08:35:59 +00:00
nolancon
6bbb36df10 Additional cases for reconcileState testing 2021-10-11 16:17:21 +00:00
Kubernetes Prow Robot
63f66e6c99
Merge pull request #105012 from fromanirh/cpumanager-policy-options-beta
node: graduate CPUManagerPolicyOptions to beta
2021-10-08 07:32:59 -07:00
Alexey Perevalov
5d9032007a Return only isolated cpus in podresources interface
Co-Authored-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>
2021-10-07 15:34:08 +01:00
Kubernetes Prow Robot
c4d802b0b5
Merge pull request #103289 from AlexeyPerevalov/DoNotExportEmptyTopology
podresources: do not export empty NUMA topology
2021-10-07 07:11:46 -07:00
Kubernetes Prow Robot
c91f9bdc60
Merge pull request #104689 from cynepco3hahue/memory_manager_restricted_policy_fix
kubelet: memory manager: fix preferred topology hints calculation
2021-10-05 06:47:08 -07:00
Kubernetes Prow Robot
883250145c
Merge pull request #104788 from 249043822/memorymanager-br
Fix initContainersReusableMemory delete bug in MemoryManager
2021-10-01 05:27:22 -07:00
Francesco Romani
077c0aa1be node: graduate CPUManagerPolicyOptions to beta
We graduate the `CPUManagerPolicyOptions` feature to beta
in the 1.23 cycle, and we add new experimental feature gates
to guard new options which are planned in the 1.23 and in the
following cycles.

We introduce additional feature gate called `CPUManagerPolicyAlphaOptions` and
`CPUManagerPolicyBetaOptions`. The basic idea is to avoid the
cumbersome process of adding a feature gate for each option, and to have
feature gates which track the maturity level of _groups_ of options.
Besides this change, the graduation process, and the process in general,
for adding new policy options is still unchanged.

The `full-pcpus-only` option added in the 1.22 cycle is intentionally
moved into the beta policy options

For more details:
- KEP: https://github.com/kubernetes/enhancements/pull/2933
- sig-arch discussion:
  https://groups.google.com/u/1/g/kubernetes-sig-architecture/c/Nxsc7pfe5rw

Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-09-29 11:40:03 +02:00
Kubernetes Prow Robot
2541fcf256
Merge pull request #104123 from fromanirh/podresources-not-report-unhealthy-devices
devicemanager: skip unhealthy devices in GetAllocatable
2021-09-23 05:39:21 -07:00
Francesco Romani
1b6efa5e21 devicemanager: skip unhealthy devs in GetAllocatable
The GetAllocatableDevices, needed to support the podresources
API, doesn't take into account the device health when computing
its output.

In this PR we address this gap and add unit tests along the way
to prevent regressions. This gives us a good initial coverage,
E2E tests to cover this case are much harder to write, because
we would need to inject faults to trigger the unhealthy status.
We will evaluate if adding these tests into later PRs.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-09-22 19:20:04 +02:00
Ricardo Pchevuzinske Katz
37d11bcdaf Move node and networking related helpers from pkg/util to component helpers
Signed-off-by: Ricardo Katz <rkatz@vmware.com>
2021-09-16 17:00:19 -03:00
KeZhang
a629ceeb58 Fix initContainersReusableMemory delete bug 2021-09-15 10:04:49 +08:00
Shiming Zhang
7706d3d281 pkg/kubelet/cm/memorymanager: Fix ErrorS key/value pair 2021-09-06 17:37:04 +08:00