Commit Graph

162 Commits

Author SHA1 Message Date
Kevin Klues
347d5f57ac Move CPUManager ContainerMap to its own package 2019-12-11 22:59:00 +01:00
Kubernetes Prow Robot
73b2c82b28 Merge pull request #83592 from jianzzha/opt-reserved-cpus
added --reserved-cpus kubelet command option
2019-11-06 22:14:42 -08:00
Kubernetes Prow Robot
08e5781b41 Merge pull request #84525 from klueska/upstream-fix-hint-generation-after-kubelet-restart
Fix bug in TopologyManager hint generation after kubelet restart
2019-11-06 15:33:50 -08:00
Jianzhu Zhang
89dfd24483 added --reserved-cpus kubelet command option 2019-11-06 07:33:52 -05:00
yuxiaobo
81e9f21f83 Correct spelling mistakes
Signed-off-by: yuxiaobo <yuxiaobogo@163.com>
2019-11-06 20:25:19 +08:00
Kevin Klues
9dc116eb08 Ensure CPUManager TopologyHints are regenerated after kubelet restart
This patch also includes test to make sure the newly added logic works
as expected.
2019-11-05 15:48:51 +00:00
Kevin Klues
58f3554ebe Sync all CPU and device state before generating TopologyHints for them
This ensures that we have the most up-to-date state when generating
topology hints for a container. Without this, it's possible that some
resources will be seen as allocated, when they are actually free.
2019-11-05 13:00:20 +00:00
Kevin Klues
d9adf20360 Abstract removeStaleState from reconcileState in CPUManager
This will become especially important as we move to a model where
exclusive CPUs are assigned at pod admission time rather than at pod
creation time.

Having this function will allow us to do garbage collection on these
CPUs anytime we are about to allocate CPUs to a new set of containers,
in addition to reclaiming state periodically in the reconcileState()
loop.
2019-11-05 12:45:11 +00:00
Adrian Chiris
b17706b149 Added LessThan() and IsEqual() methods for TopologyHints 2019-11-04 18:43:07 +01:00
Kubernetes Prow Robot
17a57f99d5 Merge pull request #81344 from zouyee/cpm
fix cpumanager reconcileState without sourceready
2019-10-30 23:33:36 -07:00
Kubernetes Prow Robot
d60bda1971 Merge pull request #83043 from ConnorDoyle/cleanup-cpumanger-topo-hints
Delegate topology hint gen to CPU manager policy
2019-10-05 00:59:39 -07:00
Kevin Klues
d2b53af7d7 Add klueska as reviewer for CPUManager and devicemanager 2019-10-03 13:01:41 -07:00
Connor Doyle
389853894d Delegate topology hint gen to CPU manager policy
- The previous implementation depended on a fixed set of policies.
2019-09-27 22:29:02 -07:00
zouyee
b1f6974f7b using online instead to fix kubelet service failed with wrong number of possible NUMA nodes
Signed-off-by: Zou Nengren <zouyee1989@gmail.com>
2019-09-26 21:48:50 +08:00
zouyee
594fc0f4b9 fix cpumanager reconcileState without sourceready
Signed-off-by: Zou Nengren <zouyee1989@gmail.com>
2019-09-25 10:39:06 +08:00
Connor Doyle
e35301c19f Rename package socketmask to bitmask.
- As discussed in reviews and other public channels,
  this abstraction is used to represent numa nodes, not sockets.
- There is nothing inherently related to sockets in this package anyway.
2019-09-23 17:08:45 -07:00
Kubernetes Prow Robot
9165f7bf56 Merge pull request #82104 from klueska/upstream-fix-cpu-manager-topology-bug
Fix bug in CPUManager with setting topology for policies
2019-08-30 08:00:44 -07:00
Kevin Klues
eb0216e54e Update semantics to set Preferred field in TopologyHint generation
We now only set Preferred to true if resources can be allocated with a
size equal to the minimimum _possible_ mask when all resources are
available.
2019-08-29 14:32:10 -05:00
Kevin Klues
e0e8b3e4fd Update CPUManager topology helpers to accept multiple ids 2019-08-29 13:22:54 -05:00
Kevin Klues
ddfd9ac0ca Fix bug in CPUManager with setting topology for policies
Also add a check in the unit tests to avoid regressions
2019-08-28 17:32:25 -05:00
Kevin Klues
f4dbd29cdb Rename TopologyHint.SocketAffinity to TopologyHint.NUMANodeAffinity
As part of this, update the logic to use the NUMA information instead of
the Socket information when generating and consuming TopologyHints in
the CPUManager.
2019-08-27 16:51:05 -05:00
Kevin Klues
ecc14fe661 Update CPUManager to include NUMANodeID in CPUTopology
Unfortunately, the NUMA information is not readily available from
cadvisor, so we have to roll the logic to discover it by hand. In the
future, we should remove this custiom code to use the information
provided by cadvisor once it is made available.
2019-08-27 16:51:05 -05:00
Kevin Klues
869962fa48 Cache the discovered topology in the CPUManager instead of MachineInfo 2019-08-27 16:23:07 -05:00
Tim Allclair
a2c51674cf Cleanup more static check issues (S1*,ST*) 2019-08-21 10:40:21 -07:00
Tim Allclair
8a495cb5e4 Clean up error messages (ST1005) 2019-08-21 10:40:21 -07:00
Tim Allclair
6510d26b6a Fix misc static check issues 2019-08-21 10:40:21 -07:00
Tim Allclair
3f510c69f6 Remove dead code from pkg/kubelet/... 2019-08-21 10:40:21 -07:00
Kevin Klues
4fdd52b058 Update GetTopologyHints() API to return a map
At present, there is no way for a hint provider to return distinct hints
for different resource types via a call to GetTopologyHints(). This
means that hint providers that govern multiple resource types (e.g. the
devicemanager) must do some sort of "pre-merge" on the hints it
generates for each resource type before passing them back to the
TopologyManager.

This patch changes the GetTopologyHints() interface to allow a hint
provider to pass back raw hints for each resource type, and allow the
TopologyManager to merge them using a single unified strategy.

This change also allows the TopologyManager to recognize which
resource type a set of hints originated from, should this information
become useful in the future.
2019-08-16 08:06:12 +02:00
Kevin Klues
b3f4bed97f Add CPUManager tests for TopologyHint consumption 2019-08-14 06:22:56 +02:00
Kevin Klues
8278d1134c Consume TopologyHints in the CPUManager
Co-Authored-By: Conor Nolan <conor.nolan@intel.com>
2019-08-14 06:22:56 +02:00
Sreemanti Ghosh
7c626a2a00 Add CPUManager tests for TopologyHint generation
Co-Authored-By: Conor Nolan <conor.nolan@intel.com>
Co-Authored-By: Kevin Klues <kklues@nvidia.com>
2019-08-14 06:22:56 +02:00
Kevin Klues
156b3f6af8 Generate TopologyHints from the CPUManager 2019-08-14 06:22:56 +02:00
Conor Nolan
e33af11add Add stub support for TopologyManager to CPUManager
Co-Authored-By: Louise Daly <louise.m.daly@intel.com>
2019-08-07 15:56:05 +02:00
Kevin Klues
9f36f1a173 Add tests for proactive init Container removal in the CPUManager static policy 2019-07-26 14:34:51 +02:00
Kevin Klues
6a7db380de Add tests for new containertMap type in the CPUManager 2019-07-26 14:34:51 +02:00
Kevin Klues
c6d9bbcb74 Proactively remove init Containers in CPUManager static policy
This patch fixes a bug in the CPUManager, whereby it doesn't honor the
"effective requests/limits" of a Pod as defined by:

    https://kubernetes.io/docs/concepts/workloads/pods/init-containers/#resources

The rule states that a Pod’s "effective request/limit" for a resource
should be the larger of:
    * The highest of any particular resource request or limit
      defined on all init Containers
    * The sum of all app Containers request/limit for a
      resource

Moreover, the rule states that:
    * The effective QoS tier is the same for init Containers
      and app containers alike

This means that the resource requests of init Containers and app
Containers should be able to overlap, such that the larger of the two
becomes the "effective resource request/limit" for the Pod. Likewise,
if a QoS tier of "Guaranteed" is determined for the Pod, then both init
Containers and app Containers should run in this tier.

In its current implementation, the CPU manager honors the effective QoS
tier for both init and app containers, but doesn't honor the "effective
request/limit" correctly.

Instead, it treats the "effective request/limit" as:
    * The sum of all init Containers plus the sum of all app
      Containers request/limit for a resource

It does this by not proactively removing the CPUs given to previous init
containers when new containers are being created. In the worst case,
this causes the CPUManager to give non-overlapping CPUs to all
containers (whether init or app) in the "Guaranteed" QoS tier before any
of the containers in the Pod actually start.

This effectively blocks these Pods from running if the total number of
CPUs being requested across init and app Containers goes beyond the
limits of the system.

This patch fixes this problem by updating the CPUManager static policy
so that it proactively removes any guaranteed CPUs it has granted to
init Containers before allocating CPUs to app containers. Since all init
container are run sequentially, it also makes sure this proactive
removal happens for previous init containers when allocating CPUs to
later ones.
2019-07-26 14:34:51 +02:00
Kevin Klues
5dc5f1de06 Update the cpumanager to error out if an invalid policy is given
Previously, the cpumanager would simply fall back to the None() policy
if an invalid policy was specified. This patch updates this to return an
error when an invalid policy is passed, forcing the kubelet to fail
fast when this occurs.

These semantics should be preferable because an invalid policy likely
indicates operator error in setting the policy flag on the kubelet
correctly (e.g. misspelling 'static' as 'statiic'). In this case it is
better to fail fast so the operator can detect this and correct the
mistake, than to mask the error and essentially disable the cpumanager
unexpectedly.
2019-07-18 13:24:09 +02:00
Kubernetes Prow Robot
b276043051 Merge pull request #77421 from tedyu/cpu-free-no-sort
Obtain unsorted slice in cpuAccumulator#freeCores
2019-05-16 16:26:53 -07:00
Kubernetes Prow Robot
b4211dea98 Merge pull request #77422 from tedyu/policy-set-union
Union all CPUSets in one round
2019-05-06 14:02:05 -07:00
Ted Yu
e967c37068 Union all CPUSets in one round 2019-05-03 14:40:33 -07:00
Ted Yu
f83bac61a4 Obtain unsorted slice in cpuAccumulator#freeCores 2019-05-03 14:07:47 -07:00
Kubernetes Prow Robot
98c4c1e2d8 Merge pull request #77291 from tedyu/cpu-pod-stat
Query pod status outside loop over containers
2019-05-01 23:28:56 -07:00
Kubernetes Prow Robot
a5a70b4de3 Merge pull request #74859 from ahadas/static_policy
kubelet/cm: code optimization for the static policy
2019-05-01 23:28:19 -07:00
Ted Yu
3fc16a7e82 Log pod name when pod status cannot be queried 2019-05-01 15:01:56 -07:00
Ted Yu
66ce52578a Query pod status outside loop over containers 2019-04-30 19:35:32 -07:00
Kevin Klues
ef27f5f1a5 Add ability to find init Container IDs in cpumanager reconcileState()
The cpumanager loops through all init Containers and app Containers when
reconciling its state. However, the current implementation of
findContainerIDByName(), which is call by the reconciler, does not
resolve for init Containers.

This patch updates findContainerIDByName() to account for init
Containers and adds a regression test that fails before the change and
succeeds after.
2019-04-27 06:18:55 -07:00
Davanum Srinivas
33081c1f07 New staging repository for cri-api
Change-Id: I2160b0b0ec4b9870a2d4452b428e395bbe12afbb
2019-03-26 18:21:04 -04:00
Arik Hadas
4a47148afe kubelet/cm: fix test description
Signed-off-by: Arik Hadas <ahadas@redhat.com>
2019-03-07 21:23:15 +02:00
Arik Hadas
26e1c1cee7 kubelet/cm: code optimization for the static policy
Minor optimization in the code that attempts to assign whole
sockets/cores in case the number of CPUs requested is higher
than CPUs-per-socket/core: check if the number of requested
CPUs is higher than CPUs-per-socket/core before retrieving
and iterating the free sockets/cores, and break the loops
when that is no longer the case.

Signed-off-by: Arik Hadas <ahadas@redhat.com>
2019-03-07 21:23:15 +02:00
Arik Hadas
c3a533e5b2 Cleanup in topology.go
1. Find the minimal thread number within a core using a
single loop rather than by sorting the thread numbers.

2. Inline getUniqueCoreID#err and Discover#numCPUs variables.

3. Narrow the scope of Discover#coreID and Discover#err variables.

Signed-off-by: Arik Hadas <ahadas@redhat.com>
2019-02-14 16:55:37 +02:00