Previously, the state was keyed off of containerID intead of podUID and
containerName. Unfortunately, this is no longer possible as we move to a
to model where we we allocate CPUs to containers at pod adit time rather
than container start time.
This patch is the first step towards full migration to the new
semantics. Only the unit tests in cpumanager/state are passing. In
subsequent commits we will update the CPUManager itself to use these new
semantics.
This patch also includes code to do migration from the old checkpoint format
to the new one, assuming the existence of a ContainerMap with the proper
mapping of (containerID)->(podUID, containerName). A subsequent commit
will update code in higher layers to make sure that this ContainerMap is
made available to this state logic.
This patch removes pkg/util/mount completely, and replaces it with the
mount package now located at k8s.io/utils/mount. The code found at
k8s.io/utils/mount was moved there from pkg/util/mount, so the code is
identical, just no longer in-tree to k/k.
This patch moves fake.go to mount_fake.go, and follows to principle of
always returning a discrete type rather than an Interface. All callers
of "FakeMounter" are changed to instead use "NewFakeMounter()". The
FakeMounter "Log" struct member is changed to not be exported, and
instead only access through a new "GetLog()" method.
cause by kubelet startup be interrupted on setting list of cgroups
In the 'cgroupManagerImpl.Exists' not check&recreate the hugetlb cgroup dir. Then setting the limits in non-exist cgroup dir will cause kubelet start failed.
Signed-off-by: bingshen.wbs <bingshen.wbs@alibaba-inc.com>
This ensures that we have the most up-to-date state when generating
topology hints for a container. Without this, it's possible that some
resources will be seen as allocated, when they are actually free.
This will become especially important as we move to a model where
exclusive CPUs are assigned at pod admission time rather than at pod
creation time.
Having this function will allow us to do garbage collection on these
CPUs anytime we are about to allocate CPUs to a new set of containers,
in addition to reclaiming state periodically in the reconcileState()
loop.
These changes make it so that a set of common test cases can be used for
all merge strategies, with specific test cases being able to be
specified on a policy-by-policy basis.
This is in preparation for removing the special-case of the
SingleNumaNode policy in mergeProvidersHints() in favor of a custom
merging strategy with much less overhead.
This abstraction moves the responsibility of merging topology hints to
the individual policies themselves. As part of this, it removes the
CanAdmitPodResult() API from the policy abstraction, and rolls it into a
second return value from Merge()
These policies only differ on whether they admit the pod or not when a
TopologyHint is preferred or not. As such, the restricted policy should
simply inherit whatever it can from the best effort policy and only
overwrite what is necessary.
This does not matter for now, but will become important when we add a
new 'Merge()' abstraction to a Policy later on.
This patch fixes an issue where best-effort pods were not admitted
to the node if the single-numa-node policy was set.
This was because the Admit policy in single-numa-node policy does
not admit any pod where the hint is anything but single NUMA node. The 'best hint' in this case is {<set bits for num. Numa Nodes on machine>, true}
So on a machine with 2 NUMA nodes the best hint for a best-effort pod is {11,true} as best-effort pods have no Topology preferences.
The single-numa-node policy fails any pod with a not preferred hint OR a hint where > 1 bits are set, thus the above example resulting in termintaed pods with a Topology Affinity Error.
This is a short term fix for the single-numa-node policy, as there will be code refactoring for the 1.17 release.
This patch fixes an issue in the TopologyManager that wouldn't allow
pods to be admitted if pods were launched with the SingleNUMANode policy
and any of the hint providers had no NUMA preferences.
This is due to 2 factors:
1) Any hint provider that passes back a `nil` as its hints, has its hint
automatically transformed into a single {11 true} hint before merging
2) We added a special casing for the SingleNumaNodePolicy() in the
TopologyManager that essentially turns these hints into a
{11 false} anytime a {11 true} is seen.
The current patch reworks this logic so the that TopologyManager can
tell the difference between a "don't care" hint and a true "{11 true}"
hint returned by the hint provider. Only true "{11 true}" hints will be
converted by the special casing for the SingleNumaNodePolicy(), while
"don't care" hints will not.
This is a short term fix for this issue until we do a larger refactoring
of this code for the 1.17 release.
- As discussed in reviews and other public channels,
this abstraction is used to represent numa nodes, not sockets.
- There is nothing inherently related to sockets in this package anyway.
Added one off fix for single-numa-node policy to correctly
reject pod admission on a resource allocation that spans
NUMA nodes
Co-authored-by: Kevin Klues <kklues@nvidia.com>
Previously it only took a bool, which limited the logic it could perform
to determine if a pod should be admitted or not based on the merged hint
from the policy.
As part of this, update the logic to use the NUMA information instead of
the Socket information when generating and consuming TopologyHints in
the CPUManager.
Unfortunately, the NUMA information is not readily available from
cadvisor, so we have to roll the logic to discover it by hand. In the
future, we should remove this custiom code to use the information
provided by cadvisor once it is made available.
At present, there is no way for a hint provider to return distinct hints
for different resource types via a call to GetTopologyHints(). This
means that hint providers that govern multiple resource types (e.g. the
devicemanager) must do some sort of "pre-merge" on the hints it
generates for each resource type before passing them back to the
TopologyManager.
This patch changes the GetTopologyHints() interface to allow a hint
provider to pass back raw hints for each resource type, and allow the
TopologyManager to merge them using a single unified strategy.
This change also allows the TopologyManager to recognize which
resource type a set of hints originated from, should this information
become useful in the future.
isKernelPid should explicitly check the error returned from os.Readlink and return true
only if the error value is ENOENT. Without this fix, if Readlink
returned say ENAMETOOLONG or EACESS, we would still count the process as
a kernel process (which is not true).
The container manager used in kubelet checks for docker daemon process either via pidfile
or process name. While the pidfile points to the docker daemon process PID, the
dockerProcessName constant stores a docker cli name ( docker ) instead of docker daemon
name ( dockerd ).
This patch fixes a bug in the CPUManager, whereby it doesn't honor the
"effective requests/limits" of a Pod as defined by:
https://kubernetes.io/docs/concepts/workloads/pods/init-containers/#resources
The rule states that a Pod’s "effective request/limit" for a resource
should be the larger of:
* The highest of any particular resource request or limit
defined on all init Containers
* The sum of all app Containers request/limit for a
resource
Moreover, the rule states that:
* The effective QoS tier is the same for init Containers
and app containers alike
This means that the resource requests of init Containers and app
Containers should be able to overlap, such that the larger of the two
becomes the "effective resource request/limit" for the Pod. Likewise,
if a QoS tier of "Guaranteed" is determined for the Pod, then both init
Containers and app Containers should run in this tier.
In its current implementation, the CPU manager honors the effective QoS
tier for both init and app containers, but doesn't honor the "effective
request/limit" correctly.
Instead, it treats the "effective request/limit" as:
* The sum of all init Containers plus the sum of all app
Containers request/limit for a resource
It does this by not proactively removing the CPUs given to previous init
containers when new containers are being created. In the worst case,
this causes the CPUManager to give non-overlapping CPUs to all
containers (whether init or app) in the "Guaranteed" QoS tier before any
of the containers in the Pod actually start.
This effectively blocks these Pods from running if the total number of
CPUs being requested across init and app Containers goes beyond the
limits of the system.
This patch fixes this problem by updating the CPUManager static policy
so that it proactively removes any guaranteed CPUs it has granted to
init Containers before allocating CPUs to app containers. Since all init
container are run sequentially, it also makes sure this proactive
removal happens for previous init containers when allocating CPUs to
later ones.
Previously, the topologymanager would simply fall back to the None() policy
if an invalid policy was specified. This patch updates this to return an
error when an invalid policy is passed, forcing the kubelet to fail
fast when this occurs.
These semantics should be preferable because an invalid policy likely
indicates operator error in setting the policy flag on the kubelet
correctly (e.g. misspelling 'strict' as 'striict'). In this case it is
better to fail fast so the operator can detect this and correct the
mistake, than to mask the error and essentially disable the
topologymanager unexpectedly.
Previously, the cpumanager would simply fall back to the None() policy
if an invalid policy was specified. This patch updates this to return an
error when an invalid policy is passed, forcing the kubelet to fail
fast when this occurs.
These semantics should be preferable because an invalid policy likely
indicates operator error in setting the policy flag on the kubelet
correctly (e.g. misspelling 'static' as 'statiic'). In this case it is
better to fail fast so the operator can detect this and correct the
mistake, than to mask the error and essentially disable the cpumanager
unexpectedly.
Use the exported list from runc that uses "KB" and not "kB".
This issue breaks kubelet on AArch64 (arm 64).
var HugePageSizeUnitList = []string{"B", "KB", "MB", "GB", "TB", "PB"}
The hugetlb cgroup control files (introduced here in 2012:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=abb8206cb0773)
use "KB" and not "kB"
(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/hugetlb_cgroup.c?h=v5.0#n349).
The behavior in the kernel has not changed since the introduction, and
the current code using "kB" will therefore fail on devices with huge
pages smaller than 1MiB. This is the case for AArch64.
As seen from the code in "mem_fmt" inside hugetlb_cgroup.c, only "KB",
"MB" and "GB" are used, so the others may be removed as well.
Here is a real world example of the files inside the
"/sys/kernel/mm/hugepages/" directory:
- "hugepages-64kB"
- "hugepages-2048kB"
- "hugepages-32768kB"
- "hugepages-1048576kB"
And the corresponding cgroup files:
- "hugetlb.64KB._____"
- "hugetlb.2MB._____"
- "hugetlb.32MB._____"
- "hugetlb.1GB._____"
Signed-off-by: Odin Ugedal <odin@ugedal.com>
Fix an issue in which, when trying to specify the `--kube-reserved-cgroup`
(or `--system-reserved-cgroup`) with `--cgroup-driver=systemd`, we will
not properly convert the `systemd` cgroup name into the internal cgroup
name that k8s expects. Without this change, specifying
`--kube-reserved-cgroup=/test.slice --cgroup-driver=systemd` will fail,
and only `--kube-reserved-cgroup=/test --crgroup-driver=systemd` will succeed,
even if the actual cgroup existing on the host is `/test.slice`.
Additionally, add light unit testing of our process from converting to a
systemd cgroup name to kubernetes internal cgroup name.
These updates are based on discussions had about the preferred semantics
of the TopologyManager and will be reflected in changes to an upcoming
PR that adds the actual TopologyManager implementation.
The cpumanager loops through all init Containers and app Containers when
reconciling its state. However, the current implementation of
findContainerIDByName(), which is call by the reconciler, does not
resolve for init Containers.
This patch updates findContainerIDByName() to account for init
Containers and adds a regression test that fails before the change and
succeeds after.