Commit Graph

1030 Commits

Author SHA1 Message Date
Kevin Klues
876dd9b078 Added algorithm to CPUManager to distribute CPUs across NUMA nodes
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-10-16 19:31:02 +00:00
Kevin Klues
462544d079 Split CPUManager takeByTopology() into two different algorithms
The first implements the original algorithm which packs CPUs onto NUMA nodes if
more than one NUMA node is required to satisfy the allocation. The second
disitributes CPUs across NUMA nodes if they can't all fit into one.

The "distributing" algorithm is currently a noop and just returns an error of
"unimplemented". A subsequent commit will add the logic to implement this
algorithm according to KEP 2902:

https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2902-cpumanager-distribute-cpus-policy-option

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-10-16 14:46:19 +00:00
Kevin Klues
0e7928edce Add new CPUManager policy option for "distribute-cpus-across-numa"
This commit only adds the option to the policy options framework. A
subsequent commit will add the logic to utilize it.

The KEP describing this new option can be found here:
https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2902-cpumanager-distribute-cpus-policy-option

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-10-16 14:46:19 +00:00
Francesco Romani
4bae656835 cpumanager: test NUMA node support for CPU assign (2)
This batch of tests adds a fake topology on which each numa node
has multiple sockets. We didn't find yet a real HW topology in the wild
like this, but we need one to fully exercise the code.

So, until we find a HW topology, we add a fake one flipping
the NUMA/socket config of the existing xeon dual gold 6320.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-10-15 10:29:21 +00:00
Francesco Romani
547996f3f6 cpumanager: test NUMA node support for CPU assign (1)
This batch of tests adds a real topology on which each physical socket
has multiple NUMA zones. Taken by a real dual xeon 6320 gold.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-10-15 10:29:21 +00:00
Francesco Romani
f6ccc4426a cpumanager: test: use proper subtests
The exisiting unit tests where performing subtests without
actually using the full features of the testing package
(https://pkg.go.dev/testing#hdr-Subtests_and_Sub_benchmarks)

Update them with fairly minimal changes. The patch is deceptively
large because we need to move the code inside a new block.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-10-15 10:29:21 +00:00
Francesco Romani
15caa134b2 cpumanager: topology: use rich cmp package
User the `cmp.Diff` package in the unit tests, moving away from
`reflect.DeepEqual`. This gives us a clearer picture of the differences
when the tests fail.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-10-15 10:29:21 +00:00
Kevin Klues
aff54a0914 Abstract out whether NUMA or Sockets come first in the memory hierarchy
This allows us to get rid of the check for determining which one is higher all
throughout the code. Now we just check once and instantiate an interface of the
appropriate type that makes sure the ordering in the hierarchy is preserved
through the appropriate calls.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-10-15 10:29:15 +00:00
Kevin Klues
17c7e86c6d Add NUMA support to the CPU assignment algorithm in the CPUManager
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-10-15 08:35:59 +00:00
Kubernetes Prow Robot
63f66e6c99
Merge pull request #105012 from fromanirh/cpumanager-policy-options-beta
node: graduate CPUManagerPolicyOptions to beta
2021-10-08 07:32:59 -07:00
Alexey Perevalov
5d9032007a Return only isolated cpus in podresources interface
Co-Authored-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>
2021-10-07 15:34:08 +01:00
Kubernetes Prow Robot
c4d802b0b5
Merge pull request #103289 from AlexeyPerevalov/DoNotExportEmptyTopology
podresources: do not export empty NUMA topology
2021-10-07 07:11:46 -07:00
Kubernetes Prow Robot
c91f9bdc60
Merge pull request #104689 from cynepco3hahue/memory_manager_restricted_policy_fix
kubelet: memory manager: fix preferred topology hints calculation
2021-10-05 06:47:08 -07:00
Kubernetes Prow Robot
883250145c
Merge pull request #104788 from 249043822/memorymanager-br
Fix initContainersReusableMemory delete bug in MemoryManager
2021-10-01 05:27:22 -07:00
Francesco Romani
077c0aa1be node: graduate CPUManagerPolicyOptions to beta
We graduate the `CPUManagerPolicyOptions` feature to beta
in the 1.23 cycle, and we add new experimental feature gates
to guard new options which are planned in the 1.23 and in the
following cycles.

We introduce additional feature gate called `CPUManagerPolicyAlphaOptions` and
`CPUManagerPolicyBetaOptions`. The basic idea is to avoid the
cumbersome process of adding a feature gate for each option, and to have
feature gates which track the maturity level of _groups_ of options.
Besides this change, the graduation process, and the process in general,
for adding new policy options is still unchanged.

The `full-pcpus-only` option added in the 1.22 cycle is intentionally
moved into the beta policy options

For more details:
- KEP: https://github.com/kubernetes/enhancements/pull/2933
- sig-arch discussion:
  https://groups.google.com/u/1/g/kubernetes-sig-architecture/c/Nxsc7pfe5rw

Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-09-29 11:40:03 +02:00
Kubernetes Prow Robot
2541fcf256
Merge pull request #104123 from fromanirh/podresources-not-report-unhealthy-devices
devicemanager: skip unhealthy devices in GetAllocatable
2021-09-23 05:39:21 -07:00
Francesco Romani
1b6efa5e21 devicemanager: skip unhealthy devs in GetAllocatable
The GetAllocatableDevices, needed to support the podresources
API, doesn't take into account the device health when computing
its output.

In this PR we address this gap and add unit tests along the way
to prevent regressions. This gives us a good initial coverage,
E2E tests to cover this case are much harder to write, because
we would need to inject faults to trigger the unhealthy status.
We will evaluate if adding these tests into later PRs.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-09-22 19:20:04 +02:00
Ricardo Pchevuzinske Katz
37d11bcdaf Move node and networking related helpers from pkg/util to component helpers
Signed-off-by: Ricardo Katz <rkatz@vmware.com>
2021-09-16 17:00:19 -03:00
KeZhang
a629ceeb58 Fix initContainersReusableMemory delete bug 2021-09-15 10:04:49 +08:00
Shiming Zhang
7706d3d281 pkg/kubelet/cm/memorymanager: Fix ErrorS key/value pair 2021-09-06 17:37:04 +08:00
Artyom Lukianov
9ea9798759 kubelet: memory manager: fix topology preferred topology hints calculation
Prevent starting pods with resources satisfied by a single NUMA node on multiple NUMA nodes.
The code returned before it updated the minimal amount of NUMA nodes that can satisfy the container
requests.

Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
2021-08-31 17:46:59 +03:00
Kubernetes Prow Robot
cbd0611d49
Merge pull request #104528 from kolyshkin/runc-1.0.2
vendor: bump runc to 1.0.2
2021-08-25 18:17:23 -07:00
Stephen Augustus
481cf6fbe7
generated: Run hack/update-gofmt.sh
Signed-off-by: Stephen Augustus <foo@auggie.dev>
2021-08-24 15:47:49 -04:00
Alexey Perevalov
bb81101570 podresource: do not export NUMA topology if it's empty
If device plugin returns device without topology, keep it internaly
as NUMA node -1, it helps at podresources level to not export NUMA
topology, otherwise topology is exported with NUMA node id 0,
which is not accurate.

It's imposible to unveile this bug just by tracing json.Marshal(resp)
in podresource client, because NUMANodes field ID has json property
omitempty, in this case when ID=0 shown as emtpy NUMANode.
To reproduce it, better to iterate on devices and just
trace dev.Topology.Nodes[0].ID.

Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>
2021-08-24 15:38:21 +00:00
Kir Kolyshkin
c06a851042 pkg/kubelet/cm: use SkipFreezeOnSet
This is a knob added by runc 1.0.2 specifically for kubernetes,
which tells runc/libcontainer/cgroups/systemd v1 manager to not
freeze the cgroup in Set().

We set this knob here because this code is only used for pods
(rather than containers) management, and in this place we create or
update the pod cgroup with no device limits set, so we can skip the
freeze.

If this knob is not set, libcontainer's cgroup v1 manager tries to
figure out whether the freeze is needed or not, but it's a somewhat
expensive check to perform, thus the knob is a shortcut.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-08-23 13:41:51 -07:00
Kubernetes Prow Robot
a9aad7e034
Merge pull request #103107 from pacoxu/fix-93300
ResourceConfigForPod: check initContainers as other QoS func
2021-08-17 11:41:37 -07:00
Artyom Lukianov
73a5cce3e6 device manager: do not clean admitted pods from the state
Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
2021-08-08 16:46:06 +03:00
Artyom Lukianov
93a237abd8 memory manager: do not clean admitted pods from the state
Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
2021-08-08 16:46:06 +03:00
Artyom Lukianov
66babd1a90 cpu manager: do not clean admitted pods from the state
Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
2021-08-08 16:46:06 +03:00
Wesley Williams
ff165c8823
Replace usage of Whitelist with Allowlist within Kubelet's sysctl package (#102298)
* Change uses of whitelist to allowlist in kubelet sysctl

* Rename whitelist files to allowlist in Kubelet sysctl

* Further renames of whitelist to allowlist in Kubelet

* Rename podsecuritypolicy uses of whitelist to allowlist

* Update pkg/kubelet/kubelet.go

Co-authored-by: Danielle <dani@builds.terrible.systems>

Co-authored-by: Danielle <dani@builds.terrible.systems>
2021-08-04 18:59:35 -07:00
Kir Kolyshkin
e5b434e990 kubelet/cm: don't set Devices
Since runc 1.0.0 it is now sufficient to have SkipDevices: true.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-07-16 12:45:35 -07:00
Francesco Romani
23abdab2b7 smtalign: propagate policy options to policies
Consume in the static policy the cpu manager policy options from
the cpumanager instance.
Validate in the none policy if any option is given, and fail if so -
this is almost surely a configuration mistake.

Add new cpumanager.Options type to hold the options and translate from
user arguments to flags.

Co-authored-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-07-08 23:15:37 +02:00
Francesco Romani
6dcec345df smtalign: cm: factor out admission response
Introduce a new `admission` subpackage to factor out the responsability
to create `PodAdmitResult` objects. This enables resource manager
to report specific errors in Allocate() and to bubble up them
in the relevant fields of the `PodAdmitResult`.

To demonstrate the approach we refactor TopologyAffinityError as a
proper error.

Co-authored-by: Kevin Klues <kklues@nvidia.com>
Co-authored-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-07-08 23:15:37 +02:00
Francesco Romani
c5cb263dcf smtalign: propagate policy options to cpumanager
The CPUManagerPolicyOptions received from the kubelet config/command line args
is propogated to the Container Manager.

We defer the consumption of the options to a later patch(set).

Co-authored-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-07-08 23:15:35 +02:00
Li Bo
c3d9b10ca8 feature: support Memory QoS for cgroups v2 2021-07-08 09:26:46 +08:00
Akihiro Suda
dbe0155139
kubelet/cm: ignore sysctl error when running in userns
Errors during setting the following sysctl values are ignored:
- vm.overcommit_memory
- vm.panic_on_oom
- kernel.panic
- kernel.panic_on_oops
- kernel.keys.root_maxkeys
- kernel.keys.root_maxbytes

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2021-07-07 14:23:29 +09:00
Kubernetes Prow Robot
eae87bfe7e
Merge pull request #103483 from odinuge/revert-102508-runc-1.0
Revert "Update runc to 1.0.0"
2021-07-06 10:42:56 -07:00
Artyom Lukianov
bb6d5b1f95 memory manager: provide unittests for init containers re-use
- provide tests for static policy allocation, when init containers
requested memory bigger than the memory requested by app containers
- provide tests for static policy allocation, when init containers
requested memory smaller than the memory requested by app containers
- provide tests to verify that init containers removed from the state
file once the app container started

Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
2021-07-05 20:52:25 +03:00
Artyom Lukianov
960da7895c memory manager: remove init containers once app container started
Remove init containers from the state file once the app container started,
it will release the memory allocated for the init container and can intense
the density of containers on the NUMA node in cases when the memory allocated
for init containers is bigger than the memory allocated for app containers.

Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
2021-07-05 20:52:25 +03:00
Artyom Lukianov
b965502c49 memory manager: re-use the memory allocated for init containers
The idea that during allocation phase we will:

- during call to `Allocate` and `GetTopologyHints`  we will take into account the init containers reusable memory,
which means that we will re-use the memory and update container memory blocks accordingly.
For example for the pod with two init containers that requested: 1Gi and 2Gi,
and app container that requested 4Gi, we can re-use 2Gi of memory.

Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
2021-07-05 20:52:25 +03:00
Odin Ugedal
61d88af9e4
Revert "Update runc to 1.0.0" 2021-07-05 14:03:04 +02:00
Kir Kolyshkin
ab5b77944e kubelet/cm: don't set Devices
Since runc 1.0.0 it is now sufficient to have SkipDevices: true.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-30 16:17:35 -07:00
pacoxu
f2eec0a816 ResourceConfigForPod: check initContainers as other QoS func
Signed-off-by: pacoxu <paco.xu@daocloud.io>
2021-06-28 19:22:42 +08:00
Kubernetes Prow Robot
07358f1663
Merge pull request #103146 from tech-geek29/fix-95380
Change log level to Debug
2021-06-25 07:44:45 -07:00
Rishabh Jain
8f08db9164 Change log level to Debug 2021-06-24 14:23:06 +05:30
Kenta Tada
89a4d4b071 kubelet: modify the function of getCgroupSubsystemsV2 to use libcontainer API 2021-06-24 16:58:05 +09:00
Kubernetes Prow Robot
985ac8ae50
Merge pull request #101030 from cynepco3hahue/pod_resources_memory_interface
Extend pod resource API response to return the information from memory manager
2021-06-22 06:35:58 -07:00
Artyom Lukianov
03830db82d Implement all necessary methods to provide memory manager data under pod resources metrics
Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
2021-06-22 13:06:32 +03:00
Kubernetes Prow Robot
3bd29bc53d
Merge pull request #102829 from snowplayfire/update-devicemanager
Add resource capacity to ListAndWatch grpc logging
2021-06-21 16:28:09 -07:00
jingxueli
45d18acbcc add info for possible failed listAndWatch grpc call 2021-06-17 16:25:20 +08:00