kubernetes/pkg/kubelet
Kevin Klues 99c57828ce Update TopologyManager algorithm for selecting "best" non-preferred hint
For the 'single-numa' and 'restricted' TopologyManager policies, pods are only
admitted if all of their containers have perfect alignment across the set of
resources they are requesting. The best-effort policy, on the other hand, will
prefer allocations that have perfect alignment, but fall back to a non-preferred
alignment if perfect alignment can't be achieved.

The existing algorithm of how to choose the best hint from the set of
"non-preferred" hints is fairly naive and often results in choosing a
sub-optimal hint. It works fine in cases where all resources would end up
coming from a single NUMA node (even if its not the same NUMA nodes), but
breaks down as soon as multiple NUMA nodes are required for the "best"
alignment.  We will never be able to achieve perfect alignment with these
non-preferred hints, but we should try and do something more intelligent than
simply choosing the hint with the narrowest mask.

In an ideal world, we would have the TopologyManager return a set of
"resources-relative" hints (as opposed to a common hint for all resources as is
done today). Each resource-relative hint would indicate how many other
resources could be aligned to it on a given NUMA node, and a  hint provider
would use this information to allocate its resources in the most aligned way
possible. There are likely some edge cases to consider here, but such an
algorithm would allow us to do partial-perfect-alignment of "some" resources,
even if all resources could not be perfectly aligned.

Unfortunately, supporting something like this would require a major redesign to
how the TopologyManager interacts with its hint providers (as well as how those
hint providers make decisions based on the hints they get back).

That said, we can still do better than the naive algorithm we have today, and
this patch provides a mechanism to do so.

We start by looking at the set of hints passed into the TopologyManager for
each resource and generate a list of the minimum number of NUMA nodes required
to satisfy an allocation for a given resource. Each entry in this list then
contains the 'minNUMAAffinity.Count()' for a given resources. Once we have this
list, we find the *maximum* 'minNUMAAffinity.Count()' from the list and mark
that as the 'bestNonPreferredAffinityCount' that we would like to have
associated with whatever "bestHint" we ultimately generate. The intuition being
that we would like to (at the very least) get alignment for those resources
that *require* multiple NUMA nodes to satisfy their allocation. If we can't
quite get there, then we should try to come as close to it as possible.

Once we have this 'bestNonPreferredAffinityCount', the algorithm proceeds as
follows:

If the mergedHint and bestHint are both non-preferred, then try and find a hint
whose affinity count is as close to (but not higher than) the
bestNonPreferredAffinityCount as possible. To do this we need to consider the
following cases and react accordingly:

  1. bestHint.NUMANodeAffinity.Count() >  bestNonPreferredAffinityCount
  2. bestHint.NUMANodeAffinity.Count() == bestNonPreferredAffinityCount
  3. bestHint.NUMANodeAffinity.Count() <  bestNonPreferredAffinityCount

For case (1), the current bestHint is larger than the
bestNonPreferredAffinityCount, so updating to any narrower mergeHint is
preferred over staying where we are.

For case (2), the current bestHint is equal to the
bestNonPreferredAffinityCount, so we would like to stick with what we have
*unless* the current mergedHint is also equal to bestNonPreferredAffinityCount
and it is narrower.

For case (3), the current bestHint is less than bestNonPreferredAffinityCount,
so we would like to creep back up to bestNonPreferredAffinityCount as close as
we can. There are three cases to consider here:

  3a. mergedHint.NUMANodeAffinity.Count() >  bestNonPreferredAffinityCount
  3b. mergedHint.NUMANodeAffinity.Count() == bestNonPreferredAffinityCount
  3c. mergedHint.NUMANodeAffinity.Count() <  bestNonPreferredAffinityCount

For case (3a), we just want to stick with the current bestHint because choosing
a new hint that is greater than bestNonPreferredAffinityCount would be
counter-productive.

For case (3b), we want to immediately update bestHint to the current
mergedHint, making it now equal to bestNonPreferredAffinityCount.

For case (3c), we know that *both* the current bestHint and the current
mergedHint are less than bestNonPreferredAffinityCount, so we want to choose
one that brings us back up as close to bestNonPreferredAffinityCount as
possible. There are three cases to consider here:

  3ca. mergedHint.NUMANodeAffinity.Count() >  bestHint.NUMANodeAffinity.Count()
  3cb. mergedHint.NUMANodeAffinity.Count() <  bestHint.NUMANodeAffinity.Count()
  3cc. mergedHint.NUMANodeAffinity.Count() == bestHint.NUMANodeAffinity.Count()

For case (3ca), we want to immediately update bestHint to mergedHint because
that will bring us closer to the (higher) value of
bestNonPreferredAffinityCount.

For case (3cb), we want to stick with the current bestHint because choosing the
current mergedHint would strictly move us further away from the
bestNonPreferredAffinityCount.

Finally, for case (3cc), we know that the current bestHint and the current
mergedHint are equal, so we simply choose the narrower of the 2.

This patch implements this algorithm for the case where we must choose from a
set of non-preferred hints and provides a set of unit-tests to verify its
correctness.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2022-03-01 14:38:26 +00:00
..
apis remove DynamicKubeletConfig logic from kubelet 2022-01-19 22:38:04 +00:00
cadvisor Clean up logic for deprecated flag --container-runtime in kubelet 2022-02-10 13:26:59 +02:00
certificate Check in OWNERS modified by update-yamlfmt.sh 2021-12-09 21:31:26 -05:00
checkpointmanager remove fakefs to drop spf13/afero dependency 2021-06-24 09:51:34 -04:00
client hack/update-bazel.sh 2021-02-28 15:17:29 -08:00
cloudresource Apply suggestions from code review 2021-03-05 23:59:23 +05:30
cm Update TopologyManager algorithm for selecting "best" non-preferred hint 2022-03-01 14:38:26 +00:00
config Clean up logic for deprecated flag --container-runtime in kubelet 2022-02-10 13:26:59 +02:00
configmap Migrate to k8s.io/utils/clock in pkg/kubelet 2021-09-10 12:20:09 +02:00
container Clean up logic for deprecated flag --container-runtime in kubelet 2022-02-10 13:26:59 +02:00
cri Add support for CRI verbose fields 2022-02-10 17:12:26 +01:00
custommetrics hack/update-bazel.sh 2021-02-28 15:17:29 -08:00
envvars hack/update-bazel.sh 2021-02-28 15:17:29 -08:00
events kubelet: add shutdown events 2021-06-23 16:44:19 -05:00
eviction avoid klog Info calls without verbosity 2022-01-12 07:48:36 +01:00
images Remove no-longer used selflink code from kubelet 2022-01-14 10:38:23 +01:00
kubeletconfig remove DynamicKubeletConfig logic from kubelet 2022-01-19 22:38:04 +00:00
kuberuntime Merge pull request #107945 from saschagrunert/cri-verbose 2022-02-14 17:58:12 -08:00
leaky hack/update-bazel.sh 2021-02-28 15:17:29 -08:00
lifecycle Merge pull request #103934 from boenn/tainttoleration 2022-02-09 16:53:46 -08:00
logs Add support for CRI verbose fields 2022-02-10 17:12:26 +01:00
metrics remove DynamicKubeletConfig logic from kubelet 2022-01-19 22:38:04 +00:00
network Change the name of the constant 2021-12-14 22:42:57 +09:00
nodeshutdown fix: data race when hijack klog 2022-01-24 15:01:49 +08:00
nodestatus parse ipv6 address before comparison (#107736) 2022-01-26 18:38:49 -08:00
oom generated: Run hack/update-gofmt.sh 2021-08-24 15:47:49 -04:00
pleg avoid klog Info calls without verbosity 2022-01-12 07:48:36 +01:00
pluginmanager Cleanup OWNERS files (No Activity in the last year) 2021-12-15 10:34:02 -05:00
pod Move kubelet secret and configmap manager calls to sync_Pod functions 2022-01-27 10:09:13 -05:00
preemption migrated preemption.go, stateful.go, resource_allocation.go to structured logging 2021-11-08 22:52:47 +05:30
prober Check in OWNERS modified by update-yamlfmt.sh 2021-12-09 21:31:26 -05:00
qos Only system-node-critical pods should be OOM Killed last 2021-03-03 16:34:27 -05:00
runtimeclass hack/update-bazel.sh 2021-02-28 15:17:29 -08:00
secret Migrate to k8s.io/utils/clock in pkg/kubelet 2021-09-10 12:20:09 +02:00
server Merge pull request #106458 from dims/lint-yaml-in-owners-files 2021-12-10 06:39:12 -08:00
stats Clean up logic for deprecated flag --container-runtime in kubelet 2022-02-10 13:26:59 +02:00
status avoid klog Info calls without verbosity 2022-01-12 07:48:36 +01:00
sysctl Upgrade preparation to verify sysctl values containing forward slashes by regex 2021-11-04 11:49:56 +08:00
token Check in OWNERS modified by update-yamlfmt.sh 2021-12-09 21:31:26 -05:00
types Clean up dockershim flags in the kubelet 2022-01-14 16:02:50 +02:00
util Include pod UID in secret/configmap cache key 2022-01-27 22:21:52 -05:00
volumemanager Remove verult from OWNERS files 2022-02-10 18:25:38 -08:00
winstats add more info when failing to call PdhAddEnglishCounter 2021-11-24 13:49:34 +08:00
active_deadline_test.go Migrate to k8s.io/utils/clock in pkg/kubelet 2021-09-10 12:20:09 +02:00
active_deadline.go Migrate to k8s.io/utils/clock in pkg/kubelet 2021-09-10 12:20:09 +02:00
doc.go
errors.go
kubelet_getters_test.go
kubelet_getters.go pkg/kubelet: improve the node informer sync check 2021-04-21 22:46:27 +03:00
kubelet_network_linux.go move IPv6DualStack feature to stable. (#104691) 2021-09-24 16:30:22 -07:00
kubelet_network_others.go generated: Run hack/update-gofmt.sh 2021-08-24 15:47:49 -04:00
kubelet_network_test.go generated: Run hack/update-gofmt.sh 2021-08-24 15:47:49 -04:00
kubelet_network.go Make CRI v1 the default and allow a fallback to v1alpha2 2021-11-17 11:05:05 -08:00
kubelet_node_status_others.go generated: Run hack/update-gofmt.sh 2021-08-24 15:47:49 -04:00
kubelet_node_status_test.go [kubelet]: Sync label periodically 2021-11-05 18:47:43 -04:00
kubelet_node_status_windows.go generated: Run hack/update-gofmt.sh 2021-08-24 15:47:49 -04:00
kubelet_node_status.go migrate --register-with-taints to KubeletConfiguration 2021-11-16 19:10:36 +08:00
kubelet_pods_linux_test.go Clean up logic for deprecated flag --container-runtime in kubelet 2022-02-10 13:26:59 +02:00
kubelet_pods_test.go ignore CRI PodSandboxNetworkStatus for host network pods 2022-02-04 18:41:57 +01:00
kubelet_pods_windows_test.go Clean up logic for deprecated flag --container-runtime in kubelet 2022-02-10 13:26:59 +02:00
kubelet_pods.go Clean up logic for deprecated flag --container-runtime in kubelet 2022-02-10 13:26:59 +02:00
kubelet_resources_test.go
kubelet_resources.go Migrate pkg/kubelet/kubeletconfig to Structured Logging 2021-03-15 15:42:34 -07:00
kubelet_test.go Clean up logic for deprecated flag --container-runtime in kubelet 2022-02-10 13:26:59 +02:00
kubelet_volumes_linux_test.go generated: Run hack/update-gofmt.sh 2021-08-24 15:47:49 -04:00
kubelet_volumes_test.go fixing unit test failures induced by turning on CSIMigrationGCE 2021-11-16 19:26:30 +00:00
kubelet_volumes.go Keep pod worker running until pod is truly complete 2021-07-06 15:55:22 -04:00
kubelet.go Merge pull request #108070 from jsafrane/remove-selinux 2022-02-11 18:19:47 -08:00
OWNERS Check in OWNERS modified by update-yamlfmt.sh 2021-12-09 21:31:26 -05:00
pod_container_deletor_test.go
pod_container_deletor.go Structured Logging migration: modify volume and container part logs of kubelet. 2021-03-17 08:59:03 +08:00
pod_workers_test.go kubelet: Clean up a static pod that has been terminated before starting 2022-02-02 16:05:32 -05:00
pod_workers.go kubelet: Clean up a static pod that has been terminated before starting 2022-02-02 16:05:32 -05:00
reason_cache_test.go
reason_cache.go
runonce_test.go Merge pull request #104933 from vikramcse/automate_mockery 2021-09-30 18:33:21 -07:00
runonce.go Keep pod worker running until pod is truly complete 2021-07-06 15:55:22 -04:00
runtime.go
time_cache_test.go
time_cache.go
volume_host.go use node informer to check volumes attachment status before backoff 2021-12-20 11:57:05 -05:00