Commit Graph

1127 Commits

Author SHA1 Message Date
Kubernetes Prow Robot
127f33f63d Merge pull request #111221 from inosato/remove-ioutil-from-kubelet
Remove ioutil in kubelet/kubeadm and its tests
2022-09-17 21:56:28 -07:00
Kubernetes Prow Robot
c45ca46cdb Merge pull request #112387 from mythi/kubelet-devicemanager-topologyinfo
devicemanager: do not leak empty TopologyInfo to TopologyManager
2022-09-14 07:17:00 -07:00
Mikko Ylinen
68bb0935bd devicemanager: do not leak empty TopologyInfo to TopologyManager
Device Plugins that wish to leverage the Topology Manager can send back a populated
TopologyInfo struct as part of the device registration, along with the device IDs
and the health of the device. TopologyInfo is converted to TopologyHints and
used by TopologyManager to find the optimal/desired resource allocation for a Pod.

If a plugin sends an empty but non-nil instance of TopologyInfo for a resource,
devicemanager passes it on as an empty instance of TopologyHint which is
currently interpreted as "Hint Provider has no possible NUMA affinities
for resource" which further means that pods requesting that resource will fail.

To not block device resources that pass TopologyInfo{Nodes:[]*NUMANode{}} from being
used, interprete that as nil set of hints and not a []TopologyHint{}.

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2022-09-14 16:13:31 +03:00
Dmitry Verkhoturov
d0f9e6dc36 clarify CPUCFSQuotaPeriod values, set the minimum to 1ms
cpu.cfs_period_us is measured in microseconds in the kernel but
provided in time.Duration by the user, that change clarifies the code
to make this evident to the reader.

Also, the minimum value for that feature is 1ms and not 1μs, and this
change alters the validation to reject values smaller than 1ms.
2022-09-08 23:29:13 +02:00
Kubernetes Prow Robot
bc9f48b841 Merge pull request #112024 from cndoit18/remove-redundant-judgment
style: remove redundant judgment
2022-08-25 07:28:18 -07:00
Kubernetes Prow Robot
2b5475b3fa Merge pull request #111554 from paskal/paskal/clarify_default_cfs_period
Clarify cpu.cfs_period_us default value
2022-08-25 07:28:07 -07:00
cndoit18
ec43037d0f style: remove redundant judgment
Signed-off-by: cndoit18 <cndoit18@outlook.com>
2022-08-25 12:07:36 +08:00
Kubernetes Prow Robot
442574f3a7 Merge pull request #111513 from jingxu97/july/localstorage
Promote Local storage capacity isolation feature to GA
2022-08-03 13:05:59 -07:00
jinxu
0064010cdd Promote Local storage capacity isolation feature to GA
This change is to promote local storage capacity isolation feature to GA

At the same time, to allow rootless system disable this feature due to
unable to get root fs, this change introduced a new kubelet config
"localStorageCapacityIsolation". By default it is set to true. For
rootless systems, they can set this configuration to false to disable
the feature. Once it is set, user cannot set ephemeral-storage
request/limit because capacity and allocatable will not be set.

Change-Id: I48a52e737c6a09e9131454db6ad31247b56c000a
2022-08-02 23:45:48 -07:00
Vinay Kulkarni
0ef263c3b0 CRI changes to support implementation of in-place pod resize.
KEP: /enhancements/keps/sig-node/1287-in-place-update-pod-resources
2022-08-02 15:08:25 -07:00
Arpit Singh
d92fd8392d Adding unit test for align-by-socket policy option
Also addressed MR comments as part of same commit.
2022-08-02 11:02:07 -07:00
Arpit Singh
06f347f645 Adding validity checks for topology manager align-by-socket 2022-08-02 11:02:07 -07:00
Arpit Singh
35849bf7fb KEP-3327: Add CPUManager policy option to align CPUs by Socket instead of by NUMA node 2022-08-02 11:02:07 -07:00
inosato
3b95d3b076 Remove ioutil in kubelet and its tests
Signed-off-by: inosato <si17_21@yahoo.co.jp>
2022-07-30 12:35:26 +09:00
Kubernetes Prow Robot
5d446b205e Merge pull request #106244 from cncal/fix-state-checkpoint-testcase
fix test for CheckpointStateRestore
2022-07-29 15:41:14 -07:00
Dmitry Verkhoturov
5126192548 clarify cpu.cfs_period_us default value
cpu.cfs_period_us is 100μs by default despite having an "ms" unit
for some unfortunate reason. Documentation:
https://www.kernel.org/doc/html/latest/scheduler/sched-bwc.html#management

The desired effect of that change is more clarity on the default value
so users would be aware that the 10ms custom value would be
not 0.1x of the default, but 100x of it.
2022-07-29 23:02:35 +02:00
Kubernetes Prow Robot
3ffdfbe286 Merge pull request #111254 from dims/update-to-golang-1.19-rc2
[golang] Update to 1.19rc2 (from 1.18.3)
2022-07-26 14:25:09 -07:00
Kubernetes Prow Robot
631a5a849a Merge pull request #109778 from mythi/grpc-go-update
grpc: move to use grpc.WithTransportCredentials()
2022-07-26 12:45:09 -07:00
Davanum Srinivas
a9593d634c Generate and format files
- Run hack/update-codegen.sh
- Run hack/update-generated-device-plugin.sh
- Run hack/update-generated-protobuf.sh
- Run hack/update-generated-runtime.sh
- Run hack/update-generated-swagger-docs.sh
- Run hack/update-openapi-spec.sh
- Run hack/update-gofmt.sh

Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2022-07-26 13:14:05 -04:00
Kubernetes Prow Robot
0f3bf88a91 Merge pull request #108682 from chymy/nilpointer
Method call 'err.Error()' might lead to a nil pointer dereference for pkg/kubelet/cm/cpumanager/cpu_assignment_test.go
2022-06-27 19:15:56 -07:00
Kubernetes Prow Robot
60902b7caf Merge pull request #109692 from yxxhero/remove_ioutil_in_kubelet
remove ioutil in kubelet
2022-06-03 09:30:51 -07:00
Mikko Ylinen
2c8bfad910 grpc: move to use grpc.WithTransportCredentials()
v1.43.0 marked grpc.WithInsecure() deprecated so this commit moves to use
what is the recommended replacement:

grpc.WithTransportCredentials(insecure.NewCredentials())

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2022-05-30 21:41:47 +03:00
Paco Xu
0ec7e38ef0 fix data race in device manager plugin hander 2022-05-07 11:18:23 +08:00
Kubernetes Prow Robot
dbf2f1d833 Merge pull request #109103 from Dingshujie/fix_memory_leak
cpu/memory manager containerMap memory leak
2022-05-03 18:24:43 -07:00
Kubernetes Prow Robot
05e3919b45 Merge pull request #109016 from klueska/refactor-devicemanager
Refactor all device-plugin logic into separate 'plugin' package under the devicemanager
2022-05-03 18:24:12 -07:00
Kevin Klues
57f8b31b42 Update tests to accommodate devicemanager refactoring
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2022-04-29 10:52:37 +00:00
Kevin Klues
f6eaa25b71 Move DevicePluginStub implementation into new plugin package
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2022-04-29 10:52:37 +00:00
Kevin Klues
db88676c20 Refactor all device plugin logic into separate 'plugin' package
This is the first step towards being able to support a new plugin API version
in parallel with the existing one.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2022-04-29 10:52:37 +00:00
yxxhero
4fac7486d4 remove ioutil in kubelet
Signed-off-by: yxxhero <aiopsclub@163.com>
2022-04-27 21:08:42 +08:00
cncal
ab945d21ad reorder the import packages 2022-04-09 11:30:26 +08:00
cncal
fa1d1edbef use require to simplify testcases 2022-04-09 11:30:26 +08:00
cncal
a64b9cee21 fix test for CheckpointStateRestore 2022-04-09 11:30:26 +08:00
DingShujie
fb3636da40 cpu manager policy set to none, no one remove container id from container map, lead memory leak 2022-03-30 23:25:05 +08:00
Jack Francis
ab14cba2cf kubelet: more resilient node allocatable ephemeral-storage data getter 2022-03-29 18:13:57 -07:00
Kir Kolyshkin
37761a329e pkg/kubelet: changes to update runc to 1.1.0
The changes (mostly in pkg/kubelet/cm) are there to adopt changed
runc 1.1 API, and simplify things a bit. In particular:

1. simplify cgroup manager instantiation, using a new, easier way of
   libcontainers/cgroups/manager.New;

2. replace libcontainerAdapter with a boolean variable (all it did
   was passing on whether systemd manager should be used);

3. trivial change due to removed cgroupfs.HugePageSizes and added
    cgroups.HugePageSizes();

4. do not calculate cgroup paths in update / destroy, since libcontainer
   cgroup managers now calculate the paths upon creation (previously,
   they were doing that only in Apply, so using e.g. Set or Destroy right
   after creation was impossible without specifying paths).

We currently still calculate cgroup paths in Exists -- this is to be
addressed separately.

Co-Authored-By: Elana Hashman <ehashman@redhat.com>
2022-03-28 16:23:20 -07:00
Kubernetes Prow Robot
dbd37cb8a8 Merge pull request #108831 from waynepeking348/skip_re_allocate_logic_if_pod_id_already_removed
skip re-allocate logic if pod is already removed to avoid panic
2022-03-27 11:37:21 -07:00
waynepeking348
6157d3cc4a skip deleted activePods and return nil 2022-03-27 20:35:09 +08:00
Kubernetes Prow Robot
75b19b242c Merge pull request #108597 from kolyshkin/prepare-for-runc-1.1
kubelet/cm: refactor, prepare for runc 1.1 bump
2022-03-23 11:20:30 -07:00
waynepeking348
35a456b0c6 skip reallocate logic if pod is already removed 2022-03-20 21:09:47 +08:00
chymy
5374f6fad8 Fix comment typo
Signed-off-by: chymy <chang.min1@zte.com.cn>
2022-03-14 16:53:29 +08:00
chymy
7ed6fa7b2e Method call 'err.Error()' might lead to a nil pointer dereference for pkg/kubelet/cm/cpumanager/cpu_assignment_test.go
Signed-off-by: chymy <chang.min1@zte.com.cn>
2022-03-14 16:35:11 +08:00
Steve Kuznetsov
8f2bc39f72 kubelet: cgroups: be verbose about validation
Previously, callers of `Exists()` would not know why the cGroup was or
was not existing. In one call-site in particular, the `kubelet` would
entirely fail to start if the cGroup validation did not succeed. In
these cases we MUST explain what went wrong and pass that information
clearly to the caller. Previously, some but not all of the reasons for
invalidation were logged at a low log-level instead. This led to poor
UX.

The original method was retained on the interface so as to make this
diff small.

Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
2022-03-10 07:25:33 -08:00
Kir Kolyshkin
de5a69d847 pkg/kubelet/cm: fix potential nil dereference in enforceExistingCgroup
Move the rl == nil check to before we dereference it.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-03-08 17:05:46 -08:00
Kir Kolyshkin
9652d0cedc pkg/kubelet/cm: move common code to libctCgroupConfig
Instead of doing (almost) the same thing from the three different
methods (Create, Update, Destroy), move the functionality to
libctCgroupConfig, replacing updateSystemdCgroupInfo.

The needResources bool is needed because we do not need resources
during Destroy, so we skip the unneeded resource conversion.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-03-08 17:05:46 -08:00
Kir Kolyshkin
11b0d57c93 pkg/kubelet/cm/cgroup_manager: simplify setting hugetlb
Commit 79be8be10e made hugetlb settings optional if cgroup v2 is used and
hugetlb is not available, fixing issue 92933. Note at that time this was only
needed for v2, because for v1 the resources were set one-by-one, and only for
supported resources.

Commit d312ef7eb6 switched the code to using Set from runc/libcontainer
cgroups manager, and expanded the check to cgroup v1 as well.

Move this check earlier, to inside m.toResources, so instead of
converting all hugetlb resources from ResourceConfig to libcontainers's
Resources.HugetlbLimit, and then setting it to nil, we can skip the
conversion entirely if hugetlb is not supported, thus not doing the work
that is not needed.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-03-08 17:05:46 -08:00
Kir Kolyshkin
59148e22d0 pkg/kubelet/cm: rm dup code
Commit ecd6361f added setting PidsLimit to Create and Update.

Commit bce9d5f2 added setting PidsLimit to m.toResources.

Now, PidsLimit is assigned twice.

Remove the duplicate.

Fixes: bce9d5f2
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-03-08 17:05:46 -08:00
Kir Kolyshkin
a673b64864 kubelet/cm: speed up cgroup creation
There's no need to call m.Update (which will create another instance of
libcontainer cgroup manager, convert all the resources and then set
them). All this is already done here, except for Set().

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-03-08 17:05:46 -08:00
Kubernetes Prow Robot
422001df8b Merge pull request #108154 from klueska/fix-topology-manager
Update TopologyManager algorithm for selecting "best" non-preferred hint
2022-03-02 04:13:13 -08:00
Kevin Klues
e370b7335c Add extensive unit testing for TopologyManager hint generation algorithm
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2022-03-01 17:30:24 +00:00
Kevin Klues
99c57828ce Update TopologyManager algorithm for selecting "best" non-preferred hint
For the 'single-numa' and 'restricted' TopologyManager policies, pods are only
admitted if all of their containers have perfect alignment across the set of
resources they are requesting. The best-effort policy, on the other hand, will
prefer allocations that have perfect alignment, but fall back to a non-preferred
alignment if perfect alignment can't be achieved.

The existing algorithm of how to choose the best hint from the set of
"non-preferred" hints is fairly naive and often results in choosing a
sub-optimal hint. It works fine in cases where all resources would end up
coming from a single NUMA node (even if its not the same NUMA nodes), but
breaks down as soon as multiple NUMA nodes are required for the "best"
alignment.  We will never be able to achieve perfect alignment with these
non-preferred hints, but we should try and do something more intelligent than
simply choosing the hint with the narrowest mask.

In an ideal world, we would have the TopologyManager return a set of
"resources-relative" hints (as opposed to a common hint for all resources as is
done today). Each resource-relative hint would indicate how many other
resources could be aligned to it on a given NUMA node, and a  hint provider
would use this information to allocate its resources in the most aligned way
possible. There are likely some edge cases to consider here, but such an
algorithm would allow us to do partial-perfect-alignment of "some" resources,
even if all resources could not be perfectly aligned.

Unfortunately, supporting something like this would require a major redesign to
how the TopologyManager interacts with its hint providers (as well as how those
hint providers make decisions based on the hints they get back).

That said, we can still do better than the naive algorithm we have today, and
this patch provides a mechanism to do so.

We start by looking at the set of hints passed into the TopologyManager for
each resource and generate a list of the minimum number of NUMA nodes required
to satisfy an allocation for a given resource. Each entry in this list then
contains the 'minNUMAAffinity.Count()' for a given resources. Once we have this
list, we find the *maximum* 'minNUMAAffinity.Count()' from the list and mark
that as the 'bestNonPreferredAffinityCount' that we would like to have
associated with whatever "bestHint" we ultimately generate. The intuition being
that we would like to (at the very least) get alignment for those resources
that *require* multiple NUMA nodes to satisfy their allocation. If we can't
quite get there, then we should try to come as close to it as possible.

Once we have this 'bestNonPreferredAffinityCount', the algorithm proceeds as
follows:

If the mergedHint and bestHint are both non-preferred, then try and find a hint
whose affinity count is as close to (but not higher than) the
bestNonPreferredAffinityCount as possible. To do this we need to consider the
following cases and react accordingly:

  1. bestHint.NUMANodeAffinity.Count() >  bestNonPreferredAffinityCount
  2. bestHint.NUMANodeAffinity.Count() == bestNonPreferredAffinityCount
  3. bestHint.NUMANodeAffinity.Count() <  bestNonPreferredAffinityCount

For case (1), the current bestHint is larger than the
bestNonPreferredAffinityCount, so updating to any narrower mergeHint is
preferred over staying where we are.

For case (2), the current bestHint is equal to the
bestNonPreferredAffinityCount, so we would like to stick with what we have
*unless* the current mergedHint is also equal to bestNonPreferredAffinityCount
and it is narrower.

For case (3), the current bestHint is less than bestNonPreferredAffinityCount,
so we would like to creep back up to bestNonPreferredAffinityCount as close as
we can. There are three cases to consider here:

  3a. mergedHint.NUMANodeAffinity.Count() >  bestNonPreferredAffinityCount
  3b. mergedHint.NUMANodeAffinity.Count() == bestNonPreferredAffinityCount
  3c. mergedHint.NUMANodeAffinity.Count() <  bestNonPreferredAffinityCount

For case (3a), we just want to stick with the current bestHint because choosing
a new hint that is greater than bestNonPreferredAffinityCount would be
counter-productive.

For case (3b), we want to immediately update bestHint to the current
mergedHint, making it now equal to bestNonPreferredAffinityCount.

For case (3c), we know that *both* the current bestHint and the current
mergedHint are less than bestNonPreferredAffinityCount, so we want to choose
one that brings us back up as close to bestNonPreferredAffinityCount as
possible. There are three cases to consider here:

  3ca. mergedHint.NUMANodeAffinity.Count() >  bestHint.NUMANodeAffinity.Count()
  3cb. mergedHint.NUMANodeAffinity.Count() <  bestHint.NUMANodeAffinity.Count()
  3cc. mergedHint.NUMANodeAffinity.Count() == bestHint.NUMANodeAffinity.Count()

For case (3ca), we want to immediately update bestHint to mergedHint because
that will bring us closer to the (higher) value of
bestNonPreferredAffinityCount.

For case (3cb), we want to stick with the current bestHint because choosing the
current mergedHint would strictly move us further away from the
bestNonPreferredAffinityCount.

Finally, for case (3cc), we know that the current bestHint and the current
mergedHint are equal, so we simply choose the narrower of the 2.

This patch implements this algorithm for the case where we must choose from a
set of non-preferred hints and provides a set of unit-tests to verify its
correctness.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2022-03-01 14:38:26 +00:00