Previously, callers of `Exists()` would not know why the cGroup was or
was not existing. In one call-site in particular, the `kubelet` would
entirely fail to start if the cGroup validation did not succeed. In
these cases we MUST explain what went wrong and pass that information
clearly to the caller. Previously, some but not all of the reasons for
invalidation were logged at a low log-level instead. This led to poor
UX.
The original method was retained on the interface so as to make this
diff small.
Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
For the 'single-numa' and 'restricted' TopologyManager policies, pods are only
admitted if all of their containers have perfect alignment across the set of
resources they are requesting. The best-effort policy, on the other hand, will
prefer allocations that have perfect alignment, but fall back to a non-preferred
alignment if perfect alignment can't be achieved.
The existing algorithm of how to choose the best hint from the set of
"non-preferred" hints is fairly naive and often results in choosing a
sub-optimal hint. It works fine in cases where all resources would end up
coming from a single NUMA node (even if its not the same NUMA nodes), but
breaks down as soon as multiple NUMA nodes are required for the "best"
alignment. We will never be able to achieve perfect alignment with these
non-preferred hints, but we should try and do something more intelligent than
simply choosing the hint with the narrowest mask.
In an ideal world, we would have the TopologyManager return a set of
"resources-relative" hints (as opposed to a common hint for all resources as is
done today). Each resource-relative hint would indicate how many other
resources could be aligned to it on a given NUMA node, and a hint provider
would use this information to allocate its resources in the most aligned way
possible. There are likely some edge cases to consider here, but such an
algorithm would allow us to do partial-perfect-alignment of "some" resources,
even if all resources could not be perfectly aligned.
Unfortunately, supporting something like this would require a major redesign to
how the TopologyManager interacts with its hint providers (as well as how those
hint providers make decisions based on the hints they get back).
That said, we can still do better than the naive algorithm we have today, and
this patch provides a mechanism to do so.
We start by looking at the set of hints passed into the TopologyManager for
each resource and generate a list of the minimum number of NUMA nodes required
to satisfy an allocation for a given resource. Each entry in this list then
contains the 'minNUMAAffinity.Count()' for a given resources. Once we have this
list, we find the *maximum* 'minNUMAAffinity.Count()' from the list and mark
that as the 'bestNonPreferredAffinityCount' that we would like to have
associated with whatever "bestHint" we ultimately generate. The intuition being
that we would like to (at the very least) get alignment for those resources
that *require* multiple NUMA nodes to satisfy their allocation. If we can't
quite get there, then we should try to come as close to it as possible.
Once we have this 'bestNonPreferredAffinityCount', the algorithm proceeds as
follows:
If the mergedHint and bestHint are both non-preferred, then try and find a hint
whose affinity count is as close to (but not higher than) the
bestNonPreferredAffinityCount as possible. To do this we need to consider the
following cases and react accordingly:
1. bestHint.NUMANodeAffinity.Count() > bestNonPreferredAffinityCount
2. bestHint.NUMANodeAffinity.Count() == bestNonPreferredAffinityCount
3. bestHint.NUMANodeAffinity.Count() < bestNonPreferredAffinityCount
For case (1), the current bestHint is larger than the
bestNonPreferredAffinityCount, so updating to any narrower mergeHint is
preferred over staying where we are.
For case (2), the current bestHint is equal to the
bestNonPreferredAffinityCount, so we would like to stick with what we have
*unless* the current mergedHint is also equal to bestNonPreferredAffinityCount
and it is narrower.
For case (3), the current bestHint is less than bestNonPreferredAffinityCount,
so we would like to creep back up to bestNonPreferredAffinityCount as close as
we can. There are three cases to consider here:
3a. mergedHint.NUMANodeAffinity.Count() > bestNonPreferredAffinityCount
3b. mergedHint.NUMANodeAffinity.Count() == bestNonPreferredAffinityCount
3c. mergedHint.NUMANodeAffinity.Count() < bestNonPreferredAffinityCount
For case (3a), we just want to stick with the current bestHint because choosing
a new hint that is greater than bestNonPreferredAffinityCount would be
counter-productive.
For case (3b), we want to immediately update bestHint to the current
mergedHint, making it now equal to bestNonPreferredAffinityCount.
For case (3c), we know that *both* the current bestHint and the current
mergedHint are less than bestNonPreferredAffinityCount, so we want to choose
one that brings us back up as close to bestNonPreferredAffinityCount as
possible. There are three cases to consider here:
3ca. mergedHint.NUMANodeAffinity.Count() > bestHint.NUMANodeAffinity.Count()
3cb. mergedHint.NUMANodeAffinity.Count() < bestHint.NUMANodeAffinity.Count()
3cc. mergedHint.NUMANodeAffinity.Count() == bestHint.NUMANodeAffinity.Count()
For case (3ca), we want to immediately update bestHint to mergedHint because
that will bring us closer to the (higher) value of
bestNonPreferredAffinityCount.
For case (3cb), we want to stick with the current bestHint because choosing the
current mergedHint would strictly move us further away from the
bestNonPreferredAffinityCount.
Finally, for case (3cc), we know that the current bestHint and the current
mergedHint are equal, so we simply choose the narrower of the 2.
This patch implements this algorithm for the case where we must choose from a
set of non-preferred hints and provides a set of unit-tests to verify its
correctness.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
The package says:
> the libcontainer SELinux package is only built for Linux, so it is
> necessary to have a NOP wrapper which is built for non-Linux platforms
This is not true, Kubernetes now imports
github.com/opencontainers/selinux/go-selinux and it has proper
multiplatform support (i.e. NOOP on non-Linux platforms).
Removing the whole package and calling go-selinux directly.
Before this fix, hint permutations such as:
permutation: [{11 true} {0101 true}]
Could result in merged hints of:
mergedHint: {01 true}
This was possible because both hints in the permutation container a "preferred"
allocation (i.e. the full set of NUMA nodes set in the affinity bitmask are
*required* to satisfy the allocation). With this in place, the simplified logic
we had simply kept the merged hint as preferred as well.
However, what we really want is to ensure that the merged hint is only
preferred if *true* alignment of all resources is possible (i.e. if all hints
in the permutation are preferred AND their affinities are exactly equal).
The only exception to this is if *no* topology information is provided by a
given hint provider. In this case, we assume alignment doesn't matter and only
consider the resources that actually have hints provided for them.
This changes the semantics of permutations of the form:
permutation: [{111 true} {011 true}]
To now result in the merged hint of:
mergedHint: {011 false}
Instead of:
mergedHint: {011 true}
This is arguably how it should always have been though (because a hint should
not be preferred if true alignment isn't possible), and two tests have had to
change to accomodate these new semantics.
This commit changes the merge function to implement the updated logic, adds a
test to verify it is functioning correctly, and updates the two tests mentioned
above to adjust to the new semantics.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Without this fix, the algorithm may decide to allocate "remainder" CPUs from a
NUMA node that has no more CPUs to allocate. Moreover, it was only considering
allocation of remainder CPUs from NUMA nodes such that each NUMA node in the
remainderSet could only allocate 1 (i.e. 'cpuGroupSize') more CPUs. With these
two issues in play, one could end up with an accounting error where not enough
CPUs were allocated by the time the algorithm runs to completion.
The updated algorithm will now omit any NUMA nodes that have 0 CPUs left from
the set of NUMA nodes considered for allocating remainder CPUs. Additionally,
we now consider *all* combinations of nodes from the remainder set of size
1..len(remainderSet). This allows us to find a better solution if allocating
CPUs from a smaller set leads to a more balanced allocation. Finally, we loop
through all NUMA nodes 1-by-1 in the remainderSet until all rmeainer CPUs have
been accounted for and allocated. This ensure that we will not hit an
accounting error later on because we explicitly remove CPUs from the remainder
set until there are none left.
A follow-on commit adds a set of unit tests that will fail before these
changes, but succeeds after them.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Previously the algorithm was too restrictive because it tried to calculate the
minimum based on the number of *available* NUMA nodes and the number of
*available* CPUs on those NUMA nodes. Since there was no (easy) way to tell how
many CPUs an individual NUMA node happened to have, the average across them was
used. Using this value however, could result in thinking you need more NUMA
nodes to possibly satisfy a request than you actually do.
By using the *total* number of NUMA nodes and CPUs per NUMA node, we can get
the true minimum number of nodes required to satisfy a request. For a given
"current" allocation this may not be the true minimum, but its better to start
with fewer and move up than to start with too many and miss out on a better
option.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Now that the algorithm for balancing CPU distributions across NUMA nodes is
correct, this test actually behaves differently for the "packed" vs.
"distributed" allocation algorithms (as it should).
In the "packed" case we need to ensure that CPUs are allocated such that they
are packed onto cores. Since one CPU is already allocated from a core on NUMA
node 0, we want the next CPU to be its hyperthreaded pair (even though the
first available CPU id is on Socket 1).
In the "distributed" case, however, we want to ensure CPUs are allocated such
that we have an balanced distribution of CPUs across all NUMA nodes. This
points to allocating from Socket 1 if the only other CPU allocated has been
done on Socket 0.
To allow CPUs allocations to be packed onto full cores, one can allocate them
from the "distributed" algorithm with a 'cpuGroupSize' equal to the number of
hypthreads per core (in this case 2). We added an explicit test case for this,
demonstrating that we get the same result as the "packed" algorithm does, even
though the "distributed" algorithm is in use.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
This fixes two related tests to better test our "balanced" distribution algorithm.
The first test originally provided an input with the following number of CPUs
available on each NUMA node:
Node 0: 16
Node 1: 20
Node 2: 20
Node 3: 20
It then attempted to distribute 48 CPUs across them with an expectation that
each of the first 3 NUMA nodes would have 16 CPUs taken from them (leaving Node
0 with no more CPUs in the end).
This would have resulted in the following amount of CPUs on each node:
Node 0: 0
Node 1: 4
Node 2: 4
Node 3: 20
Which results in a standard deviation of 7.6811
However, a more balanced solution would actually be to pull 16 CPUs from NUMA
nodes 1, 2, and 3, and leave 0 untouched, i.e.:
Node 0: 16
Node 1: 4
Node 2: 4
Node 3: 4
Which results in a standard deviation of 5.1961524227066
To fix this test we changed the original number of available CPUs to start with
4 less CPUs on NUMA node 3, and 2 more CPUs on NUMA node 0, i.e.:
Node 0: 18
Node 1: 20
Node 2: 20
Node 3: 16
So that we end up with a result of:
Node 0: 2
Node 1: 4
Node 2: 4
Node 3: 16
Which pulls the CPUs from where we want and results in a standard deviation of 5.5452
For the second test, we simply reverse the number of CPUs available for Nodes 0
and 3 as:
Node 0: 16
Node 1: 20
Node 2: 20
Node 3: 18
Which forces the allocation to happen just as it did for the first test, except
now on NUMA nodes 1, 2, and 3 instead of NUMA nodes 0,1, and 2.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Previously these would return lists that were too long because we appended to
pre-initialized lists with a specific size.
Since the primary place these functions are used is in the mean and standard
deviation calculations for the NUMA distribution algorithm, it meant that the
results of these calculations were often incorrect.
As a result, some of the unit tests we have are actually incorrect (because the
results we expect do not actually produce the best balanced
distribution of CPUs across all NUMA nodes for the input provided).
These tests will be patched up in subsequent commits.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
This patch makes the CRI `v1` API the new project-wide default version.
To allow backwards compatibility, a fallback to `v1alpha2` has been added
as well. This fallback can either used by automatically determined by
the kubelet.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
The commit a8b8995ef2
changed the content of the data kubelet writes in the checkpoint.
Unfortunately, the checkpoint restore code was not updated,
so if we upgrade kubelet from pre-1.20 to 1.20+, the
device manager cannot anymore restore its state correctly.
The only trace of this misbehaviour is this line in the
kubelet logs:
```
W0615 07:31:49.744770 4852 manager.go:244] Continue after failing to read checkpoint file. Device allocation info may NOT be up-to-date. Err: json: cannot unmarshal array into Go struct field PodDevicesEntry.Data.PodDeviceEntries.DeviceIDs of type checkpoint.DevicesPerNUMA
```
If we hit this bug, the device allocation info is
indeed NOT up-to-date up until the device plugins register
themselves again. This can take up to few minutes, depending
on the specific device plugin.
While the device manager state is inconsistent:
1. the kubelet will NOT update the device availability to zero, so
the scheduler will send pods towards the inconsistent kubelet.
2. at pod admission time, the device manager allocation will not
trigger, so pods will be admitted without devices actually
being allocated to them.
To fix these issues, we add support to the device manager to
read pre-1.20 checkpoint data. We retroactively call this
format "v1".
Signed-off-by: Francesco Romani <fromani@redhat.com>
This parameter ensures that CPUs are always allocated in groups of size
'cpuGroupSize'. This is important, for example, to ensure that all CPUs (i.e.
hyperthreads) from the same core are handed out together.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
As part of this, pull out all of the existing "TakeByTopology" tests and have
them be called by the original TestTakeByTopologyNUMAPacked() as well as the
new TestTakeByTopologyNUMADistributed() test. In a subsequent commit, we will
add some tests that should differ between these two algorithms.
Signed-off-by: Kevin Klues <kklues@nvidia.com>