Commit Graph

89 Commits

Author SHA1 Message Date
Tim Allclair
a2c51674cf Cleanup more static check issues (S1*,ST*) 2019-08-21 10:40:21 -07:00
Tim Allclair
8a495cb5e4 Clean up error messages (ST1005) 2019-08-21 10:40:21 -07:00
Tim Allclair
6510d26b6a Fix misc static check issues 2019-08-21 10:40:21 -07:00
Tim Allclair
3f510c69f6 Remove dead code from pkg/kubelet/... 2019-08-21 10:40:21 -07:00
Kevin Klues
4fdd52b058 Update GetTopologyHints() API to return a map
At present, there is no way for a hint provider to return distinct hints
for different resource types via a call to GetTopologyHints(). This
means that hint providers that govern multiple resource types (e.g. the
devicemanager) must do some sort of "pre-merge" on the hints it
generates for each resource type before passing them back to the
TopologyManager.

This patch changes the GetTopologyHints() interface to allow a hint
provider to pass back raw hints for each resource type, and allow the
TopologyManager to merge them using a single unified strategy.

This change also allows the TopologyManager to recognize which
resource type a set of hints originated from, should this information
become useful in the future.
2019-08-16 08:06:12 +02:00
Kevin Klues
b3f4bed97f Add CPUManager tests for TopologyHint consumption 2019-08-14 06:22:56 +02:00
Kevin Klues
8278d1134c Consume TopologyHints in the CPUManager
Co-Authored-By: Conor Nolan <conor.nolan@intel.com>
2019-08-14 06:22:56 +02:00
Sreemanti Ghosh
7c626a2a00 Add CPUManager tests for TopologyHint generation
Co-Authored-By: Conor Nolan <conor.nolan@intel.com>
Co-Authored-By: Kevin Klues <kklues@nvidia.com>
2019-08-14 06:22:56 +02:00
Kevin Klues
156b3f6af8 Generate TopologyHints from the CPUManager 2019-08-14 06:22:56 +02:00
Conor Nolan
e33af11add Add stub support for TopologyManager to CPUManager
Co-Authored-By: Louise Daly <louise.m.daly@intel.com>
2019-08-07 15:56:05 +02:00
Kevin Klues
9f36f1a173 Add tests for proactive init Container removal in the CPUManager static policy 2019-07-26 14:34:51 +02:00
Kevin Klues
6a7db380de Add tests for new containertMap type in the CPUManager 2019-07-26 14:34:51 +02:00
Kevin Klues
c6d9bbcb74 Proactively remove init Containers in CPUManager static policy
This patch fixes a bug in the CPUManager, whereby it doesn't honor the
"effective requests/limits" of a Pod as defined by:

    https://kubernetes.io/docs/concepts/workloads/pods/init-containers/#resources

The rule states that a Pod’s "effective request/limit" for a resource
should be the larger of:
    * The highest of any particular resource request or limit
      defined on all init Containers
    * The sum of all app Containers request/limit for a
      resource

Moreover, the rule states that:
    * The effective QoS tier is the same for init Containers
      and app containers alike

This means that the resource requests of init Containers and app
Containers should be able to overlap, such that the larger of the two
becomes the "effective resource request/limit" for the Pod. Likewise,
if a QoS tier of "Guaranteed" is determined for the Pod, then both init
Containers and app Containers should run in this tier.

In its current implementation, the CPU manager honors the effective QoS
tier for both init and app containers, but doesn't honor the "effective
request/limit" correctly.

Instead, it treats the "effective request/limit" as:
    * The sum of all init Containers plus the sum of all app
      Containers request/limit for a resource

It does this by not proactively removing the CPUs given to previous init
containers when new containers are being created. In the worst case,
this causes the CPUManager to give non-overlapping CPUs to all
containers (whether init or app) in the "Guaranteed" QoS tier before any
of the containers in the Pod actually start.

This effectively blocks these Pods from running if the total number of
CPUs being requested across init and app Containers goes beyond the
limits of the system.

This patch fixes this problem by updating the CPUManager static policy
so that it proactively removes any guaranteed CPUs it has granted to
init Containers before allocating CPUs to app containers. Since all init
container are run sequentially, it also makes sure this proactive
removal happens for previous init containers when allocating CPUs to
later ones.
2019-07-26 14:34:51 +02:00
Kevin Klues
5dc5f1de06 Update the cpumanager to error out if an invalid policy is given
Previously, the cpumanager would simply fall back to the None() policy
if an invalid policy was specified. This patch updates this to return an
error when an invalid policy is passed, forcing the kubelet to fail
fast when this occurs.

These semantics should be preferable because an invalid policy likely
indicates operator error in setting the policy flag on the kubelet
correctly (e.g. misspelling 'static' as 'statiic'). In this case it is
better to fail fast so the operator can detect this and correct the
mistake, than to mask the error and essentially disable the cpumanager
unexpectedly.
2019-07-18 13:24:09 +02:00
Kubernetes Prow Robot
b276043051 Merge pull request #77421 from tedyu/cpu-free-no-sort
Obtain unsorted slice in cpuAccumulator#freeCores
2019-05-16 16:26:53 -07:00
Kubernetes Prow Robot
b4211dea98 Merge pull request #77422 from tedyu/policy-set-union
Union all CPUSets in one round
2019-05-06 14:02:05 -07:00
Ted Yu
e967c37068 Union all CPUSets in one round 2019-05-03 14:40:33 -07:00
Ted Yu
f83bac61a4 Obtain unsorted slice in cpuAccumulator#freeCores 2019-05-03 14:07:47 -07:00
Kubernetes Prow Robot
98c4c1e2d8 Merge pull request #77291 from tedyu/cpu-pod-stat
Query pod status outside loop over containers
2019-05-01 23:28:56 -07:00
Kubernetes Prow Robot
a5a70b4de3 Merge pull request #74859 from ahadas/static_policy
kubelet/cm: code optimization for the static policy
2019-05-01 23:28:19 -07:00
Ted Yu
3fc16a7e82 Log pod name when pod status cannot be queried 2019-05-01 15:01:56 -07:00
Ted Yu
66ce52578a Query pod status outside loop over containers 2019-04-30 19:35:32 -07:00
Kevin Klues
ef27f5f1a5 Add ability to find init Container IDs in cpumanager reconcileState()
The cpumanager loops through all init Containers and app Containers when
reconciling its state. However, the current implementation of
findContainerIDByName(), which is call by the reconciler, does not
resolve for init Containers.

This patch updates findContainerIDByName() to account for init
Containers and adds a regression test that fails before the change and
succeeds after.
2019-04-27 06:18:55 -07:00
Davanum Srinivas
33081c1f07 New staging repository for cri-api
Change-Id: I2160b0b0ec4b9870a2d4452b428e395bbe12afbb
2019-03-26 18:21:04 -04:00
Arik Hadas
4a47148afe kubelet/cm: fix test description
Signed-off-by: Arik Hadas <ahadas@redhat.com>
2019-03-07 21:23:15 +02:00
Arik Hadas
26e1c1cee7 kubelet/cm: code optimization for the static policy
Minor optimization in the code that attempts to assign whole
sockets/cores in case the number of CPUs requested is higher
than CPUs-per-socket/core: check if the number of requested
CPUs is higher than CPUs-per-socket/core before retrieving
and iterating the free sockets/cores, and break the loops
when that is no longer the case.

Signed-off-by: Arik Hadas <ahadas@redhat.com>
2019-03-07 21:23:15 +02:00
Arik Hadas
c3a533e5b2 Cleanup in topology.go
1. Find the minimal thread number within a core using a
single loop rather than by sorting the thread numbers.

2. Inline getUniqueCoreID#err and Discover#numCPUs variables.

3. Narrow the scope of Discover#coreID and Discover#err variables.

Signed-off-by: Arik Hadas <ahadas@redhat.com>
2019-02-14 16:55:37 +02:00
Roy Lenferink
b43c04452f Updated OWNERS files to include link to docs 2019-02-04 22:33:12 +01:00
ailusazh
10995f661d clean containers in reconcileState of cpuManager 2019-01-15 16:09:28 +08:00
yanghaichao12
982d1778f8 Fix comment error of 'cpuManagerStateFileName' 2018-11-19 08:07:04 -05:00
Davanum Srinivas
954996e231 Move from glog to klog
- Move from the old github.com/golang/glog to k8s.io/klog
- klog as explicit InitFlags() so we add them as necessary
- we update the other repositories that we vendor that made a similar
change from glog to klog
  * github.com/kubernetes/repo-infra
  * k8s.io/gengo/
  * k8s.io/kube-openapi/
  * github.com/google/cadvisor
- Entirely remove all references to glog
- Fix some tests by explicit InitFlags in their init() methods

Change-Id: I92db545ff36fcec83afe98f550c9e630098b3135
2018-11-10 07:50:31 -05:00
choury
36b92b9b29 cpumanager: rollback state if updateContainerCPUSet failed 2018-08-17 18:08:58 +08:00
Lin Yang
b7e1f0bf17 kubelet/cm/cpumanager: Fix unused variable "skipIfPermissionsError"
The variable "skipIfPermissionsError" is not needed even when
permission error happened.
2018-08-02 17:24:33 -07:00
Ismo Puustinen
3bb5ca9257 cpumanager: add test for available CPUs in static policy.
Test the cases where the number of CPUs available in the system is
smaller or larger than the number of CPUs known in the state, which
should lead to a panic. This covers both CPU onlining and offlining. The
case where the number of CPUs matches is already covered by the
"non-corrupted state" test.
2018-07-31 10:20:37 +03:00
Ismo Puustinen
4f604eb73c cpumanager: validate topology in static policy.
This patch adds a check for the static policy state validation. The
check fails if the CPU topology obtained from cadvisor doesn't match
with the current topology in the state file.

If the CPU topology has changed in a node, cpu manager static policy
might try to assign non-present cores to containers.

For example in my test case, static policy had the default CPU set of
0-1,4-7. Then kubelet was shut down and CPU 7 was offlined. After
restarting the kubelet, CPU manager tries to assign the non-existent CPU
7 to containers which don't have exclusive allocations assigned to them:

 Error response from daemon: Requested CPUs are not available - requested 0-1,4-7, available: 0-6)

This breaks the exclusivity, since the CPUs from the shared pool don't
get assigned to non-exclusive containers, meaning that they can execute
on the exclusive CPUs.
2018-07-30 08:49:13 +03:00
choury
8e4b62a74b Remove duplicate check line
There is a same [line](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/policy_static.go#L81).
2018-07-05 11:07:56 +08:00
Kubernetes Submit Queue
991a84758f Merge pull request #59214 from kdembler/cpumanager-checkpointing
Automatic merge from submit-queue (batch tested with PRs 59214, 65330). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Migrate cpumanager to use checkpointing manager

**What this PR does / why we need it**:
This PR migrates `cpumanager` to use new kubelet level node checkpointing feature (#56040) to decrease code redundancy and improve consistency.

**Which issue(s) this PR fixes**:
Fixes #58339

**Notes**:
At point of submitting PR the most straightforward approach was used - `state_checkpoint` implementation of `State` interface was added. However, with checkpointing implementation there might be no point to keep `State` interface and just use single implementation with checkpoint backend and in case of different backend than filestore needed just supply `cpumanager` with custom `CheckpointManager` implementation.

/kind feature
/sig node
cc @flyingcougar @ConnorDoyle
2018-06-25 18:19:00 -07:00
Jeff Grafton
23ceebac22 Run hack/update-bazel.sh 2018-06-22 16:22:57 -07:00
Klaudiusz Dembler
a9df2acc4b Typo fix 2018-06-07 12:08:48 +02:00
Klaudiusz Dembler
9384937f2f Update bazel 2018-05-21 17:39:51 +02:00
Klaudiusz Dembler
de1063bc7d Add compatibility tests 2018-05-21 14:50:31 +02:00
Klaudiusz Dembler
3d09101b6f Add docstrings 2018-05-21 11:40:04 +02:00
Klaudiusz Dembler
aa325ec2d9 Change JSON letter case in tests 2018-05-15 18:43:48 +02:00
Klaudiusz Dembler
7bb047ec75 Rebase and backward compatibility 2018-05-15 18:34:53 +02:00
Klaudiusz Dembler
ba8d82c96a Update error indicating unexistent checkpoint 2018-05-14 09:51:27 +02:00
Klaudiusz Dembler
0b1a73e94b Make cpuManagerCheckpoint exported 2018-05-14 09:51:27 +02:00
Klaudiusz Dembler
cc3fa67bda Add comments to MockCheckpoint functions and gofmt 2018-05-14 09:51:27 +02:00
Klaudiusz Dembler
0fbd19bc06 Tweaks 2018-05-14 09:51:26 +02:00
Klaudiusz Dembler
3991ed5d2f Add tests 2018-05-14 09:51:26 +02:00
Klaudiusz Dembler
6bfceed4ab Migrate cpumanager to use checkpointing manager 2018-05-14 09:45:58 +02:00