The logic has been updated to match the logic of the best-effort policy
except in two places:
1) The hint filtering frunction has been updated to allow "don't care"
hints encoded with a `nil` affinity mask, to pass through the filter in
addition to hints that have just a single NUMA bit set.
2) After calculating the `bestHint` we transform "don't care" affinities
encoded as having all NUMA bits set in their affinity masks into "don't
care" affinities encoded as `nil`.
- Initialize best Hint to TopologyHint{}
- Update checks.
- Move generic unit test case into policy specific tests and updated
expected outcome to reflect changes.
- Restructure function
- Remove bug fix for catching {nil true} - To be fixed in later commit
- Restore unit tests to original state for testing filterHints
This is to keep consistency with the other policies.
This change may be made across all policies in a future PR, but removing it
from the scope of this PR for now.
- Best Effort Policy: Return hint with nil affinity as opposed to
defaultAffinity when provider has no preference for NUMA affinty or no
possible NUMA affinities.
- Single NUMA Node Policy: Remove defaultHint from mergeProvidersHints.
Instead return appropriate TopologyHint where required.
- Update unit tests to reflect changes. Some test cases moved into
individual policy test functions due to differing returned affinties
per policy.
- Remove getHintMatch method.
- Replace with simplified versions of mergePermutation and
iterateAllProviderTopologyHints methods - as used in best-effort.
- Remove getHintMatch unit tests.
- Update filterHints test to reflect changes in previous commit.
- Some common test cases achieve differing expected results based on
policy due to independent merge strategies. These cases are moved into
individual policy based test functions.
- Only append valid preferred-true hints to filtered
- Return true if allResourceHints only consist of
nil-affinity/preferred-true hints: {nil true}, update defaultHint
preference accordingly.
Explanation taken from original commit:
- Change the current method of finding the best hint.
Instead of going over all permutations, sort the hints and find
the narrowest hint common to all resources.
- Break out early when merging to a preferred hint is not possible
- Remove need to pass policy and numaNodes as arguments
- Remove PolicySingleNUMANode special case check in policy_best_effort
- Add mergeProviderHints base to policy_single_numa_node for upcoming
commit
This check is redundant since we protect this call with a call to
`m.sourcesReady.AllReady()` earlier on. Moreover, having this check in
place means that we will leave some stale state around in cases where
there are actually no active pods in the system and this loop hasn't
cleaned them up yet. This can happen, for example, if a pod exits while
the kubelet is down for some reason. We see this exact case being
triggered in our e2e tests, where a test has been failing since October
when this change was first introduced.
This change is to prevent problems when we remove the V1->V2 migration
code in the future. Without this, the checksums of all checkpoints would
be hashed with the name CPUManagerCheckpointV2 embedded inside of them,
which is undesirable. We want the checkpoints to be hashed with the name
CPUManagerCheckpoint instead.
The updated CPUManager from PR #84462 implements logic to migrate the
CPUManager checkpoint file from an old format to a new one. To do so, it
defines the following types:
```
type CPUManagerCheckpoint = CPUManagerCheckpointV2
type CPUManagerCheckpointV1 struct { ... }
type CPUManagerCheckpointV2 struct { ... }
```
This replaces the old definition of just:
```
type CPUManagerCheckpoint struct { ... }
```
Code was put in place to ensure proper migration from checkpoints in V1
format to checkpoints in V2 format. However (and this is a big however),
all of the unit tests were performed on V1 checkpoints that were
generated using the type name `CPUManagerCheckpointV1` and not the
original type name of `CPUManagerCheckpoint`. As such, the checksum in
the checkpoint file uses the `CPUManagerCheckpointV1` type to calculate
its checksum and not the original type name of `CPUManagerCheckpoint`.
This causes problems in the real world since all pre-1.18 checkpoint
files will have been generated with the original type name of
`CPUManagerCheckpoint`. When verifying the checksum of the checkpoint
file across an upgrade to 1.18, the checksum is calculated assuming
a type name of `CPUManagerCheckpointV1` (which is incorrect) and the
file is seen to be corrupt.
This patch ensures that all V1 checksums are verified against a type
name of `CPUManagerCheckpoint` instead of ``CPUManagerCheckpointV1`.
It also locks the algorithm used to calculate the checksum in place,
since it wil never change in the future (for pre-1.18 checkpoint
files at least).
These information associatedd with these containers is used to migrate
the CPUManager state from it's old format to its new (i.e. keyed off of
podUID and containerName instead of containerID).
For now, we just pass 'nil' as the set of 'initialContainers' for
migrating from old state semantics to new ones. In a subsequent commit
will we pull this information from higher layers so that we can pass it
down at this stage properly.
Previously, the state was keyed off of containerID intead of podUID and
containerName. Unfortunately, this is no longer possible as we move to a
to model where we we allocate CPUs to containers at pod adit time rather
than container start time.
This patch is the first step towards full migration to the new
semantics. Only the unit tests in cpumanager/state are passing. In
subsequent commits we will update the CPUManager itself to use these new
semantics.
This patch also includes code to do migration from the old checkpoint format
to the new one, assuming the existence of a ContainerMap with the proper
mapping of (containerID)->(podUID, containerName). A subsequent commit
will update code in higher layers to make sure that this ContainerMap is
made available to this state logic.
This patch removes pkg/util/mount completely, and replaces it with the
mount package now located at k8s.io/utils/mount. The code found at
k8s.io/utils/mount was moved there from pkg/util/mount, so the code is
identical, just no longer in-tree to k/k.
This patch moves fake.go to mount_fake.go, and follows to principle of
always returning a discrete type rather than an Interface. All callers
of "FakeMounter" are changed to instead use "NewFakeMounter()". The
FakeMounter "Log" struct member is changed to not be exported, and
instead only access through a new "GetLog()" method.
cause by kubelet startup be interrupted on setting list of cgroups
In the 'cgroupManagerImpl.Exists' not check&recreate the hugetlb cgroup dir. Then setting the limits in non-exist cgroup dir will cause kubelet start failed.
Signed-off-by: bingshen.wbs <bingshen.wbs@alibaba-inc.com>
This ensures that we have the most up-to-date state when generating
topology hints for a container. Without this, it's possible that some
resources will be seen as allocated, when they are actually free.
This will become especially important as we move to a model where
exclusive CPUs are assigned at pod admission time rather than at pod
creation time.
Having this function will allow us to do garbage collection on these
CPUs anytime we are about to allocate CPUs to a new set of containers,
in addition to reclaiming state periodically in the reconcileState()
loop.
These changes make it so that a set of common test cases can be used for
all merge strategies, with specific test cases being able to be
specified on a policy-by-policy basis.
This is in preparation for removing the special-case of the
SingleNumaNode policy in mergeProvidersHints() in favor of a custom
merging strategy with much less overhead.
This abstraction moves the responsibility of merging topology hints to
the individual policies themselves. As part of this, it removes the
CanAdmitPodResult() API from the policy abstraction, and rolls it into a
second return value from Merge()
These policies only differ on whether they admit the pod or not when a
TopologyHint is preferred or not. As such, the restricted policy should
simply inherit whatever it can from the best effort policy and only
overwrite what is necessary.
This does not matter for now, but will become important when we add a
new 'Merge()' abstraction to a Policy later on.
This patch fixes an issue where best-effort pods were not admitted
to the node if the single-numa-node policy was set.
This was because the Admit policy in single-numa-node policy does
not admit any pod where the hint is anything but single NUMA node. The 'best hint' in this case is {<set bits for num. Numa Nodes on machine>, true}
So on a machine with 2 NUMA nodes the best hint for a best-effort pod is {11,true} as best-effort pods have no Topology preferences.
The single-numa-node policy fails any pod with a not preferred hint OR a hint where > 1 bits are set, thus the above example resulting in termintaed pods with a Topology Affinity Error.
This is a short term fix for the single-numa-node policy, as there will be code refactoring for the 1.17 release.
This patch fixes an issue in the TopologyManager that wouldn't allow
pods to be admitted if pods were launched with the SingleNUMANode policy
and any of the hint providers had no NUMA preferences.
This is due to 2 factors:
1) Any hint provider that passes back a `nil` as its hints, has its hint
automatically transformed into a single {11 true} hint before merging
2) We added a special casing for the SingleNumaNodePolicy() in the
TopologyManager that essentially turns these hints into a
{11 false} anytime a {11 true} is seen.
The current patch reworks this logic so the that TopologyManager can
tell the difference between a "don't care" hint and a true "{11 true}"
hint returned by the hint provider. Only true "{11 true}" hints will be
converted by the special casing for the SingleNumaNodePolicy(), while
"don't care" hints will not.
This is a short term fix for this issue until we do a larger refactoring
of this code for the 1.17 release.
- As discussed in reviews and other public channels,
this abstraction is used to represent numa nodes, not sockets.
- There is nothing inherently related to sockets in this package anyway.
Added one off fix for single-numa-node policy to correctly
reject pod admission on a resource allocation that spans
NUMA nodes
Co-authored-by: Kevin Klues <kklues@nvidia.com>
Previously it only took a bool, which limited the logic it could perform
to determine if a pod should be admitted or not based on the merged hint
from the policy.
As part of this, update the logic to use the NUMA information instead of
the Socket information when generating and consuming TopologyHints in
the CPUManager.
Unfortunately, the NUMA information is not readily available from
cadvisor, so we have to roll the logic to discover it by hand. In the
future, we should remove this custiom code to use the information
provided by cadvisor once it is made available.