Commit Graph

791 Commits

Author SHA1 Message Date
Kubernetes Prow Robot
c64f81d082
Merge pull request #78653 from sjenning/add-sjenning-owners
kubelet: add sjenning to kubelet subdirectory owners files
2019-06-25 14:47:15 -07:00
nolancon
2d7ac702d6 Add Policy None for Topology Manager
Update naming of test functions.
2019-06-25 03:24:31 +01:00
rafatio
08c258add9 Ignore cgroup pid support if related feature gates are disabled 2019-06-15 18:45:27 -03:00
Kubernetes Prow Robot
d30fbab4b8
Merge pull request #77915 from SataQiu/fix-golint-util-20190515
Fix golint failures of pkg/util/parsers pkg/util/sysctl pkg/util/system
2019-06-14 00:29:00 -07:00
mattjmcnaughton
5539e61032
Fix reserved cgroup systemd
Fix an issue in which, when trying to specify the `--kube-reserved-cgroup`
(or `--system-reserved-cgroup`) with `--cgroup-driver=systemd`, we will
not properly convert the `systemd` cgroup name into the internal cgroup
name that k8s expects. Without this change, specifying
`--kube-reserved-cgroup=/test.slice --cgroup-driver=systemd` will fail,
and only `--kube-reserved-cgroup=/test --crgroup-driver=systemd` will succeed,
even if the actual cgroup existing on the host is `/test.slice`.

Additionally, add light unit testing of our process from converting to a
systemd cgroup name to kubernetes internal cgroup name.
2019-06-07 10:48:42 -04:00
Seth Jennings
89dc2c65e4 kubelet: add sjenning to kubelet subdirectory owners files 2019-06-03 08:26:24 -05:00
Alexander Kanevskiy
89481f8c27 Use go standard library for common bit operations
PR#72913 introduced own versions of the bit operations that are
less efficient than ones from standard library.
2019-06-01 19:54:38 +03:00
Kubernetes Prow Robot
9ac58bae56
Merge pull request #78515 from klueska/upstream-socketmask-updates
Updates to the SocketMask abstraction for the TopologyManager
2019-06-01 09:50:16 -07:00
Kubernetes Prow Robot
46c74629cf
Merge pull request #78516 from klueska/upstream-topology-manager-interface-updates
Update the TopologyManager interfaces
2019-06-01 08:00:19 -07:00
Kubernetes Prow Robot
fe37733a12
Merge pull request #73891 from taragu/plugin-manager
Add kubelet plugin manager
2019-05-31 07:12:29 -07:00
Kubernetes Prow Robot
f49fe2a750
Merge pull request #72787 from dashpole/cadvisor_prefix_whitelist
Only collect metrics for cgroups required by the summary API
2019-05-31 00:28:26 -07:00
Ted Yu
1a755d13a6 Remove unnecessary string() 2019-05-30 19:48:26 -07:00
Tara Gu
5e18554442 Implement plugin manager - a controller that manages plugin registration/unregistration 2019-05-30 19:00:59 -04:00
Kevin Klues
0a43d21c26 Add IsNarrowerThan() function to socketmask abstraction 2019-05-30 06:00:22 -07:00
Kevin Klues
617a1fa394 Update the TopologyManager interfaces
These updates are based on discussions had about the preferred semantics
of the TopologyManager and will be reflected in changes to an upcoming
PR that adds the actual TopologyManager implementation.
2019-05-30 05:52:11 -07:00
Kevin Klues
cdb59d3c7a Fix incorrect names for tests in socketmask 2019-05-30 04:16:53 -07:00
nolancon
0244c0e658 remove dependency on implementation from policy preferred and strict
update build
2019-05-30 05:57:39 +01:00
nolancon
ef9baf313d Update unit tests for TopologyHints - Topology Manager Policies 2019-05-30 05:44:01 +01:00
nolancon
e82fa41fb2 More Intuitive TopologyHints - topology manager policies 2019-05-30 05:44:01 +01:00
Sreemanti Ghosh
4e503597b8 Unit test for Topology Manager policy_strict and policy_preferred 2019-05-30 05:44:01 +01:00
nolancon
eff568e496 Add Policies Strict and Preferred for Topology Manager 2019-05-30 05:44:01 +01:00
Ted Yu
c46ec66a1f Avoid unnecessary concatenation of errors 2019-05-29 17:25:53 -07:00
lmdaly
c1a4457573 Update Bazel files to include SocketMask 2019-05-29 02:21:51 +01:00
Conor Nolan
d99bac12e6 Update Remove/AddPod to Container (#26)
More intuitive TopologyHints
2019-05-29 02:11:15 +01:00
lmdaly
e64c558a11 Added BUILD files and updates to Boilerplates 2019-05-29 02:11:15 +01:00
lmdaly
71bbc6d538 Add Topology Manager Interfaces
*Topology Manager
*Policy
2019-05-29 02:10:46 +01:00
Kubernetes Prow Robot
3b4473f45a
Merge pull request #72913 from nolancon/topology-manager-socket-mask
Add Socket Mask for Topology Manager
2019-05-28 10:58:41 -07:00
nolancon
b7f6b8f8f1 Updated unit test for socketmask 2019-05-28 05:00:04 +01:00
nolancon
283dff9335 Update SocketMask based on feedback
TODO: Unit tests to be updated
2019-05-27 07:19:03 +01:00
Richard Chen
c9f1b57b5b Reset extended resources only when node is recreated. 2019-05-21 14:16:54 -07:00
Kubernetes Prow Robot
e476a60ccb
Merge pull request #73241 from vikaschoudhary16/selinux-label
Add correct selinux label at plugin socket directory
2019-05-20 11:07:17 -07:00
vikaschoudhary16
58d1b4d564 Add correct selinux label at plugin socket directory 2019-05-18 12:35:17 +05:30
Kubernetes Prow Robot
3c02a38fdc
Merge pull request #77609 from tedyu/union-all-test
Add test for CPUSet#UnionAll
2019-05-16 20:39:26 -07:00
Kubernetes Prow Robot
b276043051
Merge pull request #77421 from tedyu/cpu-free-no-sort
Obtain unsorted slice in cpuAccumulator#freeCores
2019-05-16 16:26:53 -07:00
Ted Yu
52f797188f Add test for CPUSet#UnionAll
Signed-off-by: Ted Yu <yute@vmware.com>
2019-05-16 12:13:33 -07:00
SataQiu
b36d8d431f fix golint failures of pkg/util/parsers pkg/util/sysctl pkg/util/system 2019-05-15 23:19:47 +08:00
nolancon
e8566caa3f Update to unit test and comment bug fixed 2019-05-13 06:41:44 +01:00
David Ashpole
f8dff6bd5b only collect metrics for cgroups required by the summary API 2019-05-10 12:12:41 -07:00
Andrew Kim
c919139245 update import of generic featuregate code from k8s.io/apiserver/pkg/util/feature -> k8s.io/component-base/featuregate 2019-05-08 10:01:50 -04:00
nolancon
7c525ffaa8 More intuitive TopologyHints - socketmask.go 2019-05-08 04:22:39 +01:00
Kubernetes Prow Robot
b4211dea98
Merge pull request #77422 from tedyu/policy-set-union
Union all CPUSets in one round
2019-05-06 14:02:05 -07:00
Ted Yu
e967c37068 Union all CPUSets in one round 2019-05-03 14:40:33 -07:00
Ted Yu
f83bac61a4 Obtain unsorted slice in cpuAccumulator#freeCores 2019-05-03 14:07:47 -07:00
Ted Yu
89c8a91c0f Check error return from Update
Signed-off-by: Ted Yu <yute@vmware.com>
2019-05-02 09:56:40 -07:00
Kubernetes Prow Robot
98c4c1e2d8
Merge pull request #77291 from tedyu/cpu-pod-stat
Query pod status outside loop over containers
2019-05-01 23:28:56 -07:00
Kubernetes Prow Robot
a5a70b4de3
Merge pull request #74859 from ahadas/static_policy
kubelet/cm: code optimization for the static policy
2019-05-01 23:28:19 -07:00
Ted Yu
3fc16a7e82 Log pod name when pod status cannot be queried 2019-05-01 15:01:56 -07:00
Ted Yu
66ce52578a Query pod status outside loop over containers 2019-04-30 19:35:32 -07:00
Kevin Klues
ef27f5f1a5 Add ability to find init Container IDs in cpumanager reconcileState()
The cpumanager loops through all init Containers and app Containers when
reconciling its state. However, the current implementation of
findContainerIDByName(), which is call by the reconciler, does not
resolve for init Containers.

This patch updates findContainerIDByName() to account for init
Containers and adds a regression test that fails before the change and
succeeds after.
2019-04-27 06:18:55 -07:00
WanLinghao
62d8081eda Fix a log info error 2019-03-29 13:27:10 +08:00
Davanum Srinivas
33081c1f07
New staging repository for cri-api
Change-Id: I2160b0b0ec4b9870a2d4452b428e395bbe12afbb
2019-03-26 18:21:04 -04:00
Arik Hadas
4a47148afe kubelet/cm: fix test description
Signed-off-by: Arik Hadas <ahadas@redhat.com>
2019-03-07 21:23:15 +02:00
Arik Hadas
26e1c1cee7 kubelet/cm: code optimization for the static policy
Minor optimization in the code that attempts to assign whole
sockets/cores in case the number of CPUs requested is higher
than CPUs-per-socket/core: check if the number of requested
CPUs is higher than CPUs-per-socket/core before retrieving
and iterating the free sockets/cores, and break the loops
when that is no longer the case.

Signed-off-by: Arik Hadas <ahadas@redhat.com>
2019-03-07 21:23:15 +02:00
Sreemanti-Ghosh
ce56956409 Socket mask unit test (#4) 2019-03-05 08:00:04 +00:00
nolancon
a273333f1f Add BUILD files and Boilerplates
Updates based on comments
* Export comments added
* glog changed to klog
* Other small edits
2019-03-05 07:59:51 +00:00
nolancon
f10e76962f Add Socket Mask for Topology Manager 2019-03-01 07:20:47 +00:00
Kubernetes Prow Robot
4b1282d925
Merge pull request #74016 from ahadas/topology_cleanup
Cleanup in topology.go
2019-02-27 22:49:24 -08:00
danielqsj
79a3eb816c rename latency to duration in metrics 2019-02-18 17:40:04 +08:00
danielqsj
9fd99a48f5 Change kubelet metrics to conform guideline 2019-02-18 14:01:58 +08:00
Kubernetes Prow Robot
c88dcee3e9
Merge pull request #73824 from jiayingz/reallocate
Checks whether we have cached runtime state before starting a container
2019-02-15 20:35:30 -08:00
Arik Hadas
c3a533e5b2 Cleanup in topology.go
1. Find the minimal thread number within a core using a
single loop rather than by sorting the thread numbers.

2. Inline getUniqueCoreID#err and Discover#numCPUs variables.

3. Narrow the scope of Discover#coreID and Discover#err variables.

Signed-off-by: Arik Hadas <ahadas@redhat.com>
2019-02-14 16:55:37 +02:00
Kubernetes Prow Robot
888ff4097a
Merge pull request #73651 from RobertKrawitz/node_pids_limit
Support total process ID limiting for nodes
2019-02-13 17:31:18 -08:00
Robert Krawitz
2597a1d97e Implement SupportNodePidsLimit, hand-tested 2019-02-13 14:56:17 -05:00
Kubernetes Prow Robot
b50c643be0
Merge pull request #73540 from rlenferink/patch-5
Updated OWNERS files to include link to docs
2019-02-08 09:05:56 -08:00
Jiaying Zhang
00b88c14b0 Checks whether we have cached runtime state before starting a container
that requests any device plugin resource. If not, re-issue Allocate
grpc calls. This allows us to handle the edge case that a pod got
assigned to a node even before it populates its extended resource
capacity.
2019-02-07 11:12:36 -08:00
Kubernetes Prow Robot
dc1244c6cd
Merge pull request #72785 from derekwaynecarr/hugepages-ga
Graduate HugePages feature to GA
2019-02-05 13:56:51 -08:00
Roy Lenferink
b43c04452f Updated OWNERS files to include link to docs 2019-02-04 22:33:12 +01:00
Kubernetes Prow Robot
03b434c9d4
Merge pull request #58122 from tianshapjq/nit-int-is-enough
Len() is already int
2019-02-03 12:02:24 -08:00
Derek Carr
deae071d78 Graduate HugePages feature to GA 2019-02-02 00:21:10 -05:00
Andrew Kim
84191eb99b replace pkg/util/file with k8s.io/utils/path 2019-01-29 15:20:13 -05:00
Bernhard Altendorfer
736f35ec29 Fix golint failures 2019-01-24 00:14:25 +01:00
David Ashpole
2b8bc85f75 fix panic in NodeAllocatable node e2e test 2019-01-17 10:57:09 -08:00
ailusazh
10995f661d clean containers in reconcileState of cpuManager 2019-01-15 16:09:28 +08:00
Kubernetes Prow Robot
0dbc99719a
Merge pull request #72076 from derekwaynecarr/pid-limiting
SupportPodPidsLimit feature beta with tests
2019-01-10 01:18:30 -08:00
Kubernetes Prow Robot
d88994cf9f
Merge pull request #71306 from ping035627/k8s-181121
fix some typos
2019-01-09 09:06:31 -08:00
Derek Carr
bce9d5f204 SupportPodPidsLimit feature beta with tests 2019-01-09 10:50:59 -05:00
Kubernetes Prow Robot
4e8bea4bb7
Merge pull request #71194 from yanghaichao12/dev1119-1
Fix comment error of 'cpuManagerStateFileName'
2018-12-17 20:28:19 -08:00
yuexiao-wang
7b6f60f085 modify BUILD
Signed-off-by: yuexiao-wang <wang.yuexiao@zte.com.cn>
2018-12-11 11:22:06 +08:00
yuexiao-wang
f3353c358d [scheduler cleanup phase 2]: Rename to
Signed-off-by: yuexiao-wang <wang.yuexiao@zte.com.cn>
2018-12-11 11:21:12 +08:00
k8s-ci-robot
79e5cb2cb7
Merge pull request #71302 from liggitt/verify-unit-test-feature-gates
Split mutable and read-only access to feature gates, limit tests to readonly access
2018-11-29 21:45:12 -08:00
saad-ali
a7c5582bba Permit use of deprecated dir in device plugin. 2018-11-21 18:37:31 -08:00
saad-ali
8f666d9e41 Modify kubelet watcher to support old versions
Modify kubelet plugin watcher to support older CSI drivers that use an
the old plugins directory for socket registration.
Also modify CSI plugin registration to support multiple versions of CSI
registering with the same name.
2018-11-21 18:37:31 -08:00
PingWang
9d541911bb fix some typos
Signed-off-by: PingWang <wang.ping5@zte.com.cn>

fix typo

Signed-off-by: PingWang <wang.ping5@zte.com.cn>
2018-11-22 08:27:14 +08:00
Jordan Liggitt
70ad4dff48 Fix unit tests calling SetFeatureGateDuringTest incorrectly 2018-11-21 11:51:33 -05:00
yanghaichao12
982d1778f8 Fix comment error of 'cpuManagerStateFileName' 2018-11-19 08:07:04 -05:00
Vladimir Vivien
b195396154 Kubelet Plugin Registration v1 update fix 2018-11-15 17:40:35 -05:00
David Ashpole
630cb53f82 add kubelet grpc server for pod-resources service 2018-11-15 09:43:20 -08:00
Davanum Srinivas
954996e231
Move from glog to klog
- Move from the old github.com/golang/glog to k8s.io/klog
- klog as explicit InitFlags() so we add them as necessary
- we update the other repositories that we vendor that made a similar
change from glog to klog
  * github.com/kubernetes/repo-infra
  * k8s.io/gengo/
  * k8s.io/kube-openapi/
  * github.com/google/cadvisor
- Entirely remove all references to glog
- Fix some tests by explicit InitFlags in their init() methods

Change-Id: I92db545ff36fcec83afe98f550c9e630098b3135
2018-11-10 07:50:31 -05:00
David Ashpole
d4f6ae3615 fix slice sharing bug in cgroup manager 2018-11-05 17:42:42 -08:00
Pengfei Ni
856c83e637 Enable allocatable support for Windows nodes 2018-10-30 11:17:23 +08:00
Christoph Blecker
97b2992dc1
Update gofmt for go1.11 2018-10-05 12:59:38 -07:00
k8s-ci-robot
3fe21e5433
Merge pull request #68922 from BenTheElder/version-staging
move pkg/util/version to staging
2018-09-26 22:59:42 -07:00
k8s-ci-robot
0ca25b8db7
Merge pull request #68816 from FengyunPan2/cgroup-info
Add helpful log for checking cgrop path
2018-09-26 18:10:46 -07:00
FengyunPan2
34a8b1fd9f Add helpful log for checking cgrop path
Currently I just get 'xxx cgroup does not exist', but I don't know
which path has missed. Let's add log for it.
2018-09-25 10:10:12 +08:00
k8s-ci-robot
8346631860
Merge pull request #68053 from Pingan2017/rmifblock
clean up unneeded else block
2018-09-24 17:17:29 -07:00
Benjamin Elder
8b56eb8588 hack/update-gofmt.sh 2018-09-24 12:21:29 -07:00
Benjamin Elder
f828c6f662 hack/update-bazel.sh 2018-09-24 12:03:24 -07:00
Benjamin Elder
088cf3c37b find & replace version import 2018-09-24 12:03:24 -07:00
Renaud Gaubert
8dd1d27c03 Updated the device manager pluginwatcher handler 2018-09-06 15:34:46 +02:00
Sandor Szücs
588d2808b7
fix #51135 make CFS quota period configurable, adds a cli flag and config option to kubelet to be able to set cpu.cfs_period and defaults to 100ms as before.
It requires to enable feature gate CustomCPUCFSQuotaPeriod.

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
2018-09-01 20:19:59 +02:00
Pingan2017
2f1284bc34 cleanup unneeded if block 2018-08-30 17:18:56 +08:00
Kubernetes Submit Queue
c491d48cde
Merge pull request #67430 from choury/cpumanager
Automatic merge from submit-queue (batch tested with PRs 67430, 67550). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

cpumanager: rollback state if updateContainerCPUSet failed

**What this PR does / why we need it**:

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #63018

If `updateContainerCPUSet`  failed, the container will start failed. We should rollback the state to avoid CPU leak.
**Special notes for your reviewer**:

**Release note**:

```release-note
cpumanager: rollback state if updateContainerCPUSet failed
```
2018-08-21 23:20:58 -07:00
Ismo Puustinen
dd3eeb3f46 device manager: don't do operations on nil pointer.
If grpc.DialContext() fails, a nil connection is returned. Check the
error before calling conn.Close().
2018-08-21 15:20:36 +03:00
Kubernetes Submit Queue
d017bebf6b
Merge pull request #67145 from jiayingz/reboot-fix
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Fail container start if its requested device plugin resource is unknown.

With the change, Kubelet device manager now checks whether it has cached option state for the requested device plugin resource to make sure the resource is in ready state when we start the container.



**What this PR does / why we need it**:

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes https://github.com/kubernetes/kubernetes/issues/67107

**Special notes for your reviewer**:

**Release note**:

```release-note
Fail container start if its requested device plugin resource hasn't registered after Kubelet restart.
```
2018-08-21 01:48:54 -07:00
choury
36b92b9b29 cpumanager: rollback state if updateContainerCPUSet failed 2018-08-17 18:08:58 +08:00
tianshapjq
81081dc9e7 nits in manager.go 2018-08-15 08:16:04 +08:00
Jiaying Zhang
7b1ae66432 Fail container start if its requested device plugin resource doesn't
have cached option state to make sure the device plugin resource is
in ready state when we start the container.
2018-08-08 13:11:36 -07:00
Kubernetes Submit Queue
60ac433922
Merge pull request #66946 from LinEricYang/unused-variable
Automatic merge from submit-queue (batch tested with PRs 66512, 66946, 66083). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

kubelet/cm/cpumanager: Fix unused variable "skipIfPermissionsError"

The variable "skipIfPermissionsError" is not needed even when
permission error happened.
2018-08-06 19:44:04 -07:00
Kubernetes Submit Queue
d114692a58
Merge pull request #58058 from tianshapjq/cleanup-useless-var-deviceplugin/types.go
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

clean up useless variables in deviceplugin/types.go

**What this PR does / why we need it**:
some variables is useless for reasons, I think we need a clean up.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #

**Special notes for your reviewer**:

**Release note**:

```release-note

```NONE
2018-08-06 16:33:54 -07:00
Lin Yang
b7e1f0bf17 kubelet/cm/cpumanager: Fix unused variable "skipIfPermissionsError"
The variable "skipIfPermissionsError" is not needed even when
permission error happened.
2018-08-02 17:24:33 -07:00
Kubernetes Submit Queue
266cf70ac0
Merge pull request #66617 from pravisankar/fix-pod-cgroup-parent
Automatic merge from submit-queue (batch tested with PRs 66190, 66871, 66617, 66293, 66891). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Do not set cgroup parent when --cgroups-per-qos is disabled

When --cgroups-per-qos=false (default is true), kubelet sets pod
container management to podContainerManagerNoop implementation and
GetPodContainerName() returns '/' as cgroup parent (default cgroup root).

(1) In case of 'systemd' cgroup driver, '/' is invalid parent as
docker daemon expects '.slice' suffix and throws this error:
'cgroup-parent for systemd cgroup should be a valid slice named as \"xxx.slice\"'
(5fc12449d8/daemon/daemon_unix.go (L618))
'/' corresponds to '-.slice' (root slice) in systemd but I don't think
we want to assign root slice instead of runtime specific default value.
In case of docker runtime, this will be 'system.slice'
(e2593239d9/daemon/oci_linux.go (L698))

(2) In case of 'cgroupfs' cgroup driver, '/' is valid parent but I don't
think we want to assign root instead of runtime specific default value.
In case of docker runtime, this will be '/docker'
(e2593239d9/daemon/oci_linux.go (L695))

Current fix will not set the cgroup parent when --cgroups-per-qos is disabled.

```release-note
Fix pod launch by kubelet when --cgroups-per-qos=false and --cgroup-driver="systemd"
```
2018-08-02 15:42:16 -07:00
Kubernetes Submit Queue
2f21394859
Merge pull request #66190 from linyouchong/issue-66189
Automatic merge from submit-queue (batch tested with PRs 66190, 66871, 66617, 66293, 66891). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

fix nil pointer dereference in node_container_manager#enforceExisting

**What this PR does / why we need it**:
fix nil pointer dereference in node_container_manager#enforceExisting

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #66189

**Special notes for your reviewer**:
NONE

**Release note**:
```release-note
kubelet: fix nil pointer dereference while enforce-node-allocatable flag is not config properly
```
2018-08-02 15:42:09 -07:00
Kubernetes Submit Queue
c2536e2b0d
Merge pull request #61159 from linyouchong/linyouchong-20180314
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Skip checking when failSwapOn=false

**What this PR does / why we need it**:
Skip checking when failSwapOn=false

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #

**Special notes for your reviewer**:
NONE
**Release note**:
```
NONE
```
2018-08-02 14:09:39 -07:00
Kubernetes Submit Queue
f2c6473e25
Merge pull request #66718 from ipuustin/cpu-manager-validate-offline
Automatic merge from submit-queue (batch tested with PRs 66623, 66718). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

cpumanager: validate topology in static policy

**What this PR does / why we need it**:

This patch adds a check for the static policy state validation. The check fails if the CPU topology obtained from cadvisor doesn't match with the current topology in the state file.

If the CPU topology has changed in a node, cpumanager static policy might try to assign non-present cores to containers.

For example in my test case, static policy had the default CPU set of `0-1,4-7`. Then kubelet was shut down and CPU 7 was offlined. After restarting the kubelet, CPU manager tries to assign the non-existent CPU 7 to containers which don't have exclusive allocations assigned to them:

    Error response from daemon: Requested CPUs are not available - requested 0-1,4-7, available: 0-6)

This breaks the exclusivity, since the CPUs from the shared pool don't get assigned to non-exclusive containers, meaning that they can execute on the exclusive CPUs.

**Release note**:

```release-note
Added CPU Manager state validation in case of changed CPU topology.
```
2018-07-31 08:05:06 -07:00
Ismo Puustinen
3bb5ca9257 cpumanager: add test for available CPUs in static policy.
Test the cases where the number of CPUs available in the system is
smaller or larger than the number of CPUs known in the state, which
should lead to a panic. This covers both CPU onlining and offlining. The
case where the number of CPUs matches is already covered by the
"non-corrupted state" test.
2018-07-31 10:20:37 +03:00
Ismo Puustinen
4f604eb73c cpumanager: validate topology in static policy.
This patch adds a check for the static policy state validation. The
check fails if the CPU topology obtained from cadvisor doesn't match
with the current topology in the state file.

If the CPU topology has changed in a node, cpu manager static policy
might try to assign non-present cores to containers.

For example in my test case, static policy had the default CPU set of
0-1,4-7. Then kubelet was shut down and CPU 7 was offlined. After
restarting the kubelet, CPU manager tries to assign the non-existent CPU
7 to containers which don't have exclusive allocations assigned to them:

 Error response from daemon: Requested CPUs are not available - requested 0-1,4-7, available: 0-6)

This breaks the exclusivity, since the CPUs from the shared pool don't
get assigned to non-exclusive containers, meaning that they can execute
on the exclusive CPUs.
2018-07-30 08:49:13 +03:00
hui luo
7101c17498 While reviewing devicemanager code, found
the caching layer on endpoint is redundant.

Here are the 3 related objects in picture:
devicemanager <-> endpoint <-> plugin

Plugin is the source of truth for devices
and device health status.

devicemanager maintain healthyDevices,
unhealthyDevices, allocatedDevices based on updates
from plugin.

So there is no point for endpoint caching devices,
this patch is removing this caching layer on endpoint,

Also removing the Manager.Devices() since i didn't
find any caller of this other than test, i am adding a
notification channel to facilitate testing,

If we need to get all devices from manager in future,
it just need to return healthyDevices + unhealthyDevices,
we don't have to call endpoint after all.

This patch makes code more readable, data model been simplified.
2018-07-29 21:07:14 -07:00
Kubernetes Submit Queue
32e38b6659
Merge pull request #58755 from vikaschoudhary16/probing-mode
Automatic merge from submit-queue (batch tested with PRs 58755, 66414). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Use probe based plugin watcher mechanism in Device Manager

**What this PR does / why we need it**:
Uses this probe based utility in the device plugin manager.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #56944 

**Notes For Reviewers**:
Changes are backward compatible and existing device plugins will continue to work. At the same time, any new plugins that has required support for probing model (Identity service implementation), will also work. 


**Release note**
```release-note
Add support kubelet plugin watcher in device manager.
```
/sig node
/area hw-accelerators
/cc /cc @jiayingz @RenaudWasTaken @vishh @ScorpioCPH @sjenning @derekwaynecarr @jeremyeder @lichuqiang @tengqm @saad-ali @chakri-nelluri @ConnorDoyle
2018-07-27 15:20:06 -07:00
bingshen.wbs
b1bdd043c4 fix kubelet npe on device plugin return zero container
Signed-off-by: bingshen.wbs <bingshen.wbs@alibaba-inc.com>
2018-07-25 10:15:30 +08:00
Ravi Sankar Penta
0282720e29 Do not set cgroup parent when --cgroups-per-qos is disabled
When --cgroups-per-qos=false (default is true), kubelet sets pod
container management to podContainerManagerNoop implementation and
GetPodContainerName() returns '/' as cgroup parent (default cgroup root).

(1) In case of 'systemd' cgroup driver, '/' is invalid parent as
docker daemon expects '.slice' suffix and throws this error:
'cgroup-parent for systemd cgroup should be a valid slice named as \"xxx.slice\"'
(5fc12449d8/daemon/daemon_unix.go (L618))
'/' corresponds to '-.slice' (root slice) in systemd but I don't think
we want to assign root slice instead of runtime specific default value.
In case of docker runtime, this will be 'system.slice'
(e2593239d9/daemon/oci_linux.go (L698))

(2) In case of 'cgroupfs' cgroup driver, '/' is valid parent but I don't
think we want to assign root instead of runtime specific default value.
In case of docker runtime, this will be '/docker'
(e2593239d9/daemon/oci_linux.go (L695))

Current fix will not set the cgroup parent when --cgroups-per-qos is disabled.
2018-07-20 10:25:50 -07:00
vikaschoudhary16
a5842503eb Use probe based plugin discovery mechanism in device manager 2018-07-17 04:02:31 -04:00
linyouchong
6ff285bce3 fix nil pointer dereference in node_container_manager#enforceExistingCgroup 2018-07-14 10:42:42 +08:00
choury
8e4b62a74b
Remove duplicate check line
There is a same [line](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/policy_static.go#L81).
2018-07-05 11:07:56 +08:00
Seth Jennings
3234b0fa5b feature gate LSI capacity calculation 2018-06-28 14:01:08 -05:00
Kubernetes Submit Queue
991a84758f
Merge pull request #59214 from kdembler/cpumanager-checkpointing
Automatic merge from submit-queue (batch tested with PRs 59214, 65330). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Migrate cpumanager to use checkpointing manager

**What this PR does / why we need it**:
This PR migrates `cpumanager` to use new kubelet level node checkpointing feature (#56040) to decrease code redundancy and improve consistency.

**Which issue(s) this PR fixes**:
Fixes #58339

**Notes**:
At point of submitting PR the most straightforward approach was used - `state_checkpoint` implementation of `State` interface was added. However, with checkpointing implementation there might be no point to keep `State` interface and just use single implementation with checkpoint backend and in case of different backend than filestore needed just supply `cpumanager` with custom `CheckpointManager` implementation.

/kind feature
/sig node
cc @flyingcougar @ConnorDoyle
2018-06-25 18:19:00 -07:00
Jeff Grafton
23ceebac22 Run hack/update-bazel.sh 2018-06-22 16:22:57 -07:00
Jeff Grafton
a725660640 Update to gazelle 0.12.0 and run hack/update-bazel.sh 2018-06-22 16:22:18 -07:00
Kubernetes Submit Queue
148350d3c4
Merge pull request #64426 from cofyc/remove_unnecessary_fakemounters
Automatic merge from submit-queue (batch tested with PRs 64142, 64426, 62910, 63942, 64548). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Clean up fake mounters.

**What this PR does / why we need it**:

Fixes https://github.com/kubernetes/kubernetes/issues/61502

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #

**Special notes for your reviewer**:

list of fake mounters:

- (keep) pkg/util/mount.FakeMounter
- (removed) pkg/kubelet/cm.fakeMountInterface:
- (inherit from mount.FakeMounter) pkg/util/mount.fakeMounter
- (inherit from mount.FakeMounter) pkg/util/removeall.fakeMounter
- (removed) pkg/volume/host_path.fakeFileTypeChecker

**Release note**:

```release-note
NONE
```
2018-06-20 00:05:10 -07:00
Kubernetes Submit Queue
c399c306e2
Merge pull request #59174 from tianshapjq/todo-already-done
Automatic merge from submit-queue (batch tested with PRs 65230, 57355, 59174, 63698, 63659). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

TODO has already been implemented

**What this PR does / why we need it**:
TODO has already been implemented, remove the TODO tag.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #

**Special notes for your reviewer**:

**Release note**:

```release-note

```NONE
2018-06-19 20:19:17 -07:00
Klaudiusz Dembler
a9df2acc4b Typo fix 2018-06-07 12:08:48 +02:00
Yecheng Fu
40c3937320 Clean up fake mounters. 2018-06-02 15:55:19 +08:00
Kubernetes Submit Queue
d2495b8329
Merge pull request #63143 from jsafrane/containerized-subpath
Automatic merge from submit-queue (batch tested with PRs 63348, 63839, 63143, 64447, 64567). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Containerized subpath

**What this PR does / why we need it**:
Containerized kubelet needs a different implementation of `PrepareSafeSubpath` than kubelet running directly on the host.

On the host we safely open the subpath and then bind-mount `/proc/<pidof kubelet>/fd/<descriptor of opened subpath>`.

With kubelet running in a container, `/proc/xxx/fd/yy` on the host contains path that works only inside the container, i.e. `/rootfs/path/to/subpath` and thus any bind-mount on the host fails.

Solution:
- safely open the subpath and gets its device ID and inode number
- blindly bind-mount the subpath to `/var/lib/kubelet/pods/<uid>/volume-subpaths/<name of container>/<id of mount>`. This is potentially unsafe, because user can change the subpath source to a link to a bad place (say `/run/docker.sock`) just before the bind-mount.
- get device ID and inode number of the destination. Typical users can't modify this file, as it lies on /var/lib/kubelet on the host.
- compare these device IDs and inode numbers.

**Which issue(s) this PR fixes**
Fixes #61456

**Special notes for your reviewer**:

The PR contains some refactoring of `doBindSubPath` to extract the common code. New `doNsEnterBindSubPath` is added for the nsenter related parts.

**Release note**:

```release-note
NONE
```
2018-06-01 12:12:19 -07:00
Guoliang Wang
761cf41427 Move pkg/scheduler/schedulercache -> pkg/scheduler/cache 2018-05-31 22:55:34 +08:00
Jan Safranek
74ba0878a1 Enhance ExistsPath check
It should return error when the check fails (e.g. no permissions, symlink link
loop etc.)
2018-05-23 10:21:20 +02:00
Jan Safranek
97b5299cd7 Add GetMode to mounter interface.
Kubelet must not call os.Lstat on raw volume paths when it runs in a container.
Mounter knows where the file really is.
2018-05-23 10:17:59 +02:00
Klaudiusz Dembler
9384937f2f Update bazel 2018-05-21 17:39:51 +02:00
Klaudiusz Dembler
de1063bc7d Add compatibility tests 2018-05-21 14:50:31 +02:00
Klaudiusz Dembler
3d09101b6f Add docstrings 2018-05-21 11:40:04 +02:00
Jan Safranek
598ca5accc Add GetSELinuxSupport to mounter. 2018-05-17 13:36:37 +02:00
Klaudiusz Dembler
aa325ec2d9 Change JSON letter case in tests 2018-05-15 18:43:48 +02:00
Klaudiusz Dembler
7bb047ec75 Rebase and backward compatibility 2018-05-15 18:34:53 +02:00
Klaudiusz Dembler
ba8d82c96a
Update error indicating unexistent checkpoint 2018-05-14 09:51:27 +02:00
Klaudiusz Dembler
0b1a73e94b
Make cpuManagerCheckpoint exported 2018-05-14 09:51:27 +02:00
Klaudiusz Dembler
cc3fa67bda
Add comments to MockCheckpoint functions and gofmt 2018-05-14 09:51:27 +02:00
Klaudiusz Dembler
0fbd19bc06
Tweaks 2018-05-14 09:51:26 +02:00
Klaudiusz Dembler
3991ed5d2f
Add tests 2018-05-14 09:51:26 +02:00
Klaudiusz Dembler
6bfceed4ab
Migrate cpumanager to use checkpointing manager 2018-05-14 09:45:58 +02:00
Kubernetes Submit Queue
204520b029
Merge pull request #63344 from RobertKrawitz/fix-process-kill-algorithm
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Correct kill logic for pod processes

Correct the kill logic for processes in the pod's cgroup.  os.FindProcess() does not check whether the process exists on POSIX systems.
2018-05-11 11:41:19 -07:00
Kubernetes Submit Queue
321201f672
Merge pull request #63406 from derekwaynecarr/label-pod-cgroups
Automatic merge from submit-queue (batch tested with PRs 60200, 63623, 63406). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Apply pod name and namespace labels for pod cgroup for cadvisor metrics

**What this PR does / why we need it**:
1. Enable Prometheus users to determine usage by pod name and namespace for pod cgroup sandbox.
1. Label cAdvisor metrics for pod cgroups by pod name and namespace.
1. Aligns with kubelet stats summary endpoint pod cpu and memory stats.

**Special notes for your reviewer**:
This provides parity with the summary API enhancements done here:
https://github.com/kubernetes/kubernetes/pull/55969

**Release note**:
```release-note
Apply pod name and namespace labels to pod cgroup in cAdvisor metrics
```
2018-05-10 08:33:11 -07:00
Derek Carr
a09990cd43 Apply pod name and namespace labels for pod cgroup for cadvisor metrics 2018-05-07 14:51:12 -04:00
Kubernetes Submit Queue
1929e0d86d
Merge pull request #63298 from dims/kubelet-remove-unused-code
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

kubelet - Remove unused code

**What this PR does / why we need it**:

Looks like we have a bunch of unused methods. Let's clean them up

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #

**Special notes for your reviewer**:

**Release note**:

```release-note
NONE
```
2018-05-04 04:20:06 -07:00
Kubernetes Submit Queue
592c39bccc
Merge pull request #62541 from filbranden/cgroupname1
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Use a []string for CgroupName, which is a more accurate internal representation

**What this PR does / why we need it**:

This is purely a refactoring and should bring no essential change in behavior.

It does clarify the cgroup handling code quite a bit.

It is preparation for further changes we might want to do in the cgroup hierarchy. (But it's useful on its own, so even if we don't do any, it should still be considered.)

**Special notes for your reviewer**:

The slice of strings more precisely captures the hierarchic nature of the cgroup paths we use to represent pods and their groupings.

It also ensures we're reducing the chances of passing an incorrect path format to a cgroup driver that requires a different path naming, since now explicit conversions are always needed.

The new constructor `NewCgroupName` starts from an existing `CgroupName`, which enforces a hierarchy where a root is always needed. It also performs checking on the component names to ensure invalid characters ("/" and "_") are not in use.

A `RootCgroupName` for the top of the cgroup hierarchy tree is introduced.

This refactor results in a net reduction of around 30 lines of code,
mainly with the demise of ConvertCgroupNameToSystemd which had fairly
complicated logic in it and was doing just too many things.

There's a small TODO in a helper `updateSystemdCgroupInfo` that was introduced to make this commit possible. That logic really belongs in libcontainer, I'm planning to send a PR there to include it there. (The API already takes a field with that information, only that field is only processed in cgroupfs and not systemd driver, we should fix that.)

Tested: By running the e2e-node tests on both Ubuntu 16.04 (with cgroupfs driver) and CentOS 7 (with systemd driver.)

**NOTE**: I only tested this with dockershim, we should double-check that this works with the CRI endpoints too, both in cgroupfs and systemd modes.

/assign @derekwaynecarr 
/assign @dashpole 
/assign @Random-Liu 

**Release note**:

```release-note
NONE
```
2018-05-03 08:16:45 -07:00
Kubernetes Submit Queue
4f56127582
Merge pull request #63073 from andyxning/refactor_grpc_dial_with_dialcontext
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

refactor device plugin grpc dial with dialcontext

**What this PR does / why we need it**:
Refactor grpc `dial` with `dialContext` as `grpc.WithTimeout` has been deprecated by:
> use DialContext and context.WithTimeout instead.

**Special notes for your reviewer**:

**Release note**:

```release-note
NONE
```
2018-05-03 01:16:34 -07:00
Kubernetes Submit Queue
186dd7beb1
Merge pull request #62903 from cofyc/fixfsgroupcheckinlocal
Automatic merge from submit-queue (batch tested with PRs 62657, 63278, 62903, 63375). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Add more volume types in e2e and fix part of them.

**What this PR does / why we need it**:

- Add dir-link/dir-bindmounted/dir-link-bindmounted/bockfs volume types for e2e tests.
- Fix fsGroup related e2e tests partially.
- Return error if we cannot resolve volume path.
  - Because we should not fallback to volume path, if it's a symbolic link, we may get wrong results.

To safely set fsGroup on local volume, we need to implement these two methods correctly for all volume types both on the host and in container:

- get volume path kubelet can access
  - paths on the host and in container are different
- get mount references
  - for directories, we cannot use its mount source (device field) to identify mount references, because directories on same filesystem have same mount source (e.g. tmpfs), we need to check filesystem's major:minor and directory root path on it

Here is current status:

| | (A) volume-path (host) | (B) volume-path (container) | (C) mount-refs (host) | (D) mount-refs (container) |
| --- | --- | --- | --- | --- |
| (1) dir | OK | FAIL | FAIL | FAIL |
| (2) dir-link | OK | FAIL | FAIL | FAIL |
| (3) dir-bindmounted | OK | FAIL | FAIL | FAIL |
| (4) dir-link-bindmounted | OK | FAIL | FAIL | FAIL |
| (5) tmpfs| OK | FAIL | FAIL | FAIL |
| (6) blockfs| OK | FAIL | OK | FAIL |
| (7) block| NOTNEEDED | NOTNEEDED | NOTNEEDED | NOTNEEDED |
| (8) gce-localssd-scsi-fs| NOTTESTED | NOTTESTED | NOTTESTED | NOTTESTED |

- This PR uses `nsenter ... readlink` to resolve path in container as @msau42  @jsafrane [suggested](https://github.com/kubernetes/kubernetes/pull/61489#pullrequestreview-110032850). This fixes B1:B6 and D6, , the rest will be addressed in https://github.com/kubernetes/kubernetes/pull/62102.
- C5:D5 marked `FAIL` because `tmpfs` filesystems can share same mount source, we cannot rely on it to check mount references. e2e tests passes due to we use unique mount source string in tests.
- A7:D7 marked `NOTNEEDED` because we don't set fsGroup on block devices in local plugin. (TODO: Should we set fsGroup on block device?)
- A8:D8 marked `NOTTESTED` because I didn't test it, I leave it to `pull-kubernetes-e2e-gce`. I think it should be same as `blockfs`.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #

**Special notes for your reviewer**:

**Release note**:

```release-note
NONE
```
2018-05-02 20:13:11 -07:00
Yecheng Fu
3748197876 Add more volume types in e2e and fix part of them.
- Add dir-link/dir-bindmounted/dir-link-bindmounted/blockfs volume types for e2e
tests.
- Return error if we cannot resolve volume path.
- Add GetFSGroup/GetMountRefs methods for mount.Interface.
- Fix fsGroup related e2e tests partially.
2018-05-02 10:31:42 +08:00
Robert Krawitz
3f3c04d722 WIP: Correct kill logic for cgroup processes 2018-05-01 19:38:12 -04:00
Filipe Brandenburger
b230fb8ac4 Use a []string for CgroupName, which is a more accurate internal representation
The slice of strings more precisely captures the hierarchic nature of
the cgroup paths we use to represent pods and their groupings.

It also ensures we're reducing the chances of passing an incorrect path
format to a cgroup driver that requires a different path naming, since
now explicit conversions are always needed.

The new constructor NewCgroupName starts from an existing CgroupName,
which enforces a hierarchy where a root is always needed. It also
performs checking on the component names to ensure invalid characters
("/" and "_") are not in use.

A RootCgroupName for the top of the cgroup hierarchy tree is introduced.

This refactor results in a net reduction of around 30 lines of code,
mainly with the demise of ConvertCgroupNameToSystemd which had fairly
complicated logic in it and was doing just too many things.

There's a small TODO in a helper updateSystemdCgroupInfo that was
introduced to make this commit possible. That logic really belongs in
libcontainer, I'm planning to send a PR there to include it there.
(The API already takes a field with that information, only that field is
only processed in cgroupfs and not systemd driver, we should fix that.)

Tested by running the e2e-node tests on both Ubuntu 16.04 (with cgroupfs
driver) and CentOS 7 (with systemd driver.)
2018-05-01 08:29:06 -07:00
Kubernetes Submit Queue
15cc20630d
Merge pull request #60034 from pohly/device-manager-goroutine
Automatic merge from submit-queue (batch tested with PRs 58474, 60034, 62101, 63198). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

avoid race condition in device manager and plugin startup/shutdown: wait for goroutines

**What this PR does / why we need it**:

Commit 1325c2f worked around issue #59488, but it is still worthwhile to fix the underlying root cause properly.

**Which issue(s) this PR fixes**:
Fixes #59488

**Special notes for your reviewer**:

This is an alternative to PR #59861, which used a different approach. Personally I tend to prefer this one now.

**Release note**:
```release-note
NONE
```

/sig node
/area hw-accelerators
/assign vikaschoudhary16
2018-04-30 13:24:08 -07:00
Davanum Srinivas
4bacd77321 Remove unused code 2018-04-30 14:57:26 -04:00
Andy Xie
b01657d0c7 refactor device plugin grpc dial with dialcontext 2018-04-25 18:40:23 +08:00
vikaschoudhary16
c846d5fe63 Fix race between stopping old and starting new endpoint 2018-04-24 22:22:39 -04:00
choury
c1b19fce90 avoid dobule RLock() in cpumanager 2018-04-23 10:33:40 +08:00
Kubernetes Submit Queue
4d6a6ced8c
Merge pull request #56525 from tianshapjq/testcase-helpers_linux.go
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

new testcase to helpers_linux.go

new testcase to helpers_linux.go, PTAL.

```release-note
NONE
```
2018-04-20 18:55:13 -07:00
tianshapjq
cbab51c15d remove useless variables in deviceplugin/types.go 2018-04-20 09:26:59 +08:00
Kubernetes Submit Queue
e9374411d5
Merge pull request #62509 from sjenning/qos-reserved-feature-gate
Automatic merge from submit-queue (batch tested with PRs 61962, 58972, 62509, 62606). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

kubelet: move QOSReserved from experimental to alpha feature gate

Fixes https://github.com/kubernetes/kubernetes/issues/61665

**Release note**:
```release-note
The --experimental-qos-reserve kubelet flags is replaced by the alpha level --qos-reserved flag or QOSReserved field in the kubeletconfig and requires the QOSReserved feature gate to be enabled.
```

/sig node
/assign  @derekwaynecarr 
/cc @mtaufen
2018-04-19 16:47:21 -07:00
Kubernetes Submit Queue
f3599ba3c9
Merge pull request #61962 from liggitt/flag-race
Automatic merge from submit-queue (batch tested with PRs 61962, 58972, 62509, 62606). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Avoid data races in unit tests

Setting global flags in unit tests leads to data races like this:

```
==================
WARNING: DATA RACE
Write at 0x0000028f5241 by goroutine 47:
  flag.(*boolValue).Set()
      /home/jliggitt/.gvm/gos/go1.9.5/src/flag/flag.go:91 +0x7b
  flag.(*FlagSet).Set()
      /home/jliggitt/.gvm/gos/go1.9.5/src/flag/flag.go:366 +0x10c
  flag.Set()
      /home/jliggitt/.gvm/gos/go1.9.5/src/flag/flag.go:379 +0x76
  k8s.io/kubernetes/pkg/kubelet/cm/devicemanager.TestPodContainerDeviceAllocation()
      /home/jliggitt/go/src/k8s.io/kubernetes/pkg/kubelet/cm/devicemanager/manager_test.go:549 +0x126
  testing.tRunner()
      /home/jliggitt/.gvm/gos/go1.9.5/src/testing/testing.go:746 +0x16c

Previous read at 0x0000028f5241 by goroutine 34:
  k8s.io/kubernetes/vendor/github.com/golang/glog.(*loggingT).output()
      /home/jliggitt/go/src/k8s.io/kubernetes/vendor/github.com/golang/glog/glog.go:682 +0x730
  k8s.io/kubernetes/vendor/github.com/golang/glog.(*loggingT).printf()
      /home/jliggitt/go/src/k8s.io/kubernetes/vendor/github.com/golang/glog/glog.go:655 +0x259
  k8s.io/kubernetes/vendor/github.com/golang/glog.Errorf()
      /home/jliggitt/go/src/k8s.io/kubernetes/vendor/github.com/golang/glog/glog.go:1118 +0x74
  k8s.io/kubernetes/pkg/kubelet/cm/devicemanager.(*endpointImpl).run()
      /home/jliggitt/go/src/k8s.io/kubernetes/pkg/kubelet/cm/devicemanager/endpoint.go:132 +0x1c7e
  k8s.io/kubernetes/pkg/kubelet/cm/devicemanager.(*ManagerImpl).addEndpoint.func1()
      /home/jliggitt/go/src/k8s.io/kubernetes/pkg/kubelet/cm/devicemanager/manager.go:378 +0x3f

Goroutine 47 (running) created at:
  testing.(*T).Run()
      /home/jliggitt/.gvm/gos/go1.9.5/src/testing/testing.go:789 +0x568
  testing.runTests.func1()
      /home/jliggitt/.gvm/gos/go1.9.5/src/testing/testing.go:1004 +0xa7
  testing.tRunner()
      /home/jliggitt/.gvm/gos/go1.9.5/src/testing/testing.go:746 +0x16c
  testing.runTests()
      /home/jliggitt/.gvm/gos/go1.9.5/src/testing/testing.go:1002 +0x521
  testing.(*M).Run()
      /home/jliggitt/.gvm/gos/go1.9.5/src/testing/testing.go:921 +0x206
  main.main()
      k8s.io/kubernetes/pkg/kubelet/cm/devicemanager/_test/_testmain.go:68 +0x1d3

Goroutine 34 (finished) created at:
  k8s.io/kubernetes/pkg/kubelet/cm/devicemanager.(*ManagerImpl).addEndpoint()
      /home/jliggitt/go/src/k8s.io/kubernetes/pkg/kubelet/cm/devicemanager/manager.go:377 +0x9d6
==================
--- FAIL: TestPodContainerDeviceAllocation (0.00s)
	testing.go:699: race detected during execution of test
FAIL
FAIL	k8s.io/kubernetes/pkg/kubelet/cm/devicemanager	0.124s

```
2018-04-19 16:47:14 -07:00
Seth Jennings
9bcd986b23 kubelet: move QOSReserved from experimental to alpha feature gate 2018-04-16 13:08:40 -05:00
vikaschoudhary16
cedbd93255 Make 'pod' package to use unified checkpointManager
Signed-off-by: vikaschoudhary16 <choudharyvikas16@gmail.com>
2018-04-16 01:30:20 -04:00
vikaschoudhary16
d62bd9ef65 Node-level Checkpointing manager 2018-04-16 00:19:42 -04:00
Patrick Ohly
fcbb64b93d avoid race condition in device manager and plugin startup/shutdown
A flaky test exposed a race condition where shutting down one server
instance broke the startup of the next instance when using the same
socket path. Commit 1325c2f8be removed the reuse of the same socket
path and thus avoided the issue.

But the real fix is to ensure that the listening socket is really
closed once Stop returns. Two solutions were proposed in
https://github.com/grpc/grpc-go/issues/1861:
- waiting for the goroutine to complete
- closing the socket

The former is done here because it's cleaner to not keep lingering
goroutines. While at it, the Stop methods are made idempotent (similar
to e.g. Close on a socket) and no longer crash when called without
prior Start.

Fixes https://github.com/kubernetes/kubernetes/issues/59488
2018-04-12 17:59:10 +02:00
Jordan Liggitt
b562263427
Avoid data races in unit tests 2018-03-30 17:19:40 -04:00
jianglingxia
583e4b61f5 fix format and typo of NodeAllocatableCgroups 2018-03-28 17:29:23 +08:00
Kubernetes Submit Queue
0022bec3a2
Merge pull request #61525 from tianshapjq/place-consts-together
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

move the const to the place it should be

**What this PR does / why we need it**:
move the const to the place it should be

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #

**Special notes for your reviewer**:

**Release note**:

```release-note

```
2018-03-25 09:51:42 -07:00
hzxuzhonghu
70e45eccf2 Replace "golang.org/x/net/context" with "context" 2018-03-22 20:57:14 +08:00
tianshapjq
55921d0827 move the const to the place it should be 2018-03-22 14:20:15 +08:00
Derek Carr
f68f3ff783 Fix cpu cfs quota flag with pod cgroups 2018-03-16 15:27:11 -04:00
linyouchong
32c265f60c Skip checking when failSwapOn=false 2018-03-14 14:36:30 +08:00
Kubernetes Submit Queue
3d1331f297
Merge pull request #61044 from liggitt/subpath-master
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

subpath fixes

fixes #60813 for master / 1.10

```release-note
Fixes CVE-2017-1002101 - See https://issue.k8s.io/60813 for details
```
2018-03-12 11:51:59 -07:00
Kubernetes Submit Queue
a3f40dd8df
Merge pull request #60856 from jiayingz/race-fix
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Fixes the races around devicemanager Allocate() and endpoint deletion.

There is a race in predicateAdmitHandler Admit() that getNodeAnyWayFunc()
could get Node with non-zero deviceplugin resource allocatable for a
non-existing endpoint. That race can happen when a device plugin fails,
but is more likely when kubelet restarts as with the current registration
model, there is a time gap between kubelet restart and device plugin
re-registration. During this time window, even though devicemanager could
have removed the resource initially during GetCapacity() call, Kubelet
may overwrite the device plugin resource capacity/allocatable with the
old value when node update from the API server comes in later. This
could cause a pod to be started without proper device runtime config set.

To solve this problem, introduce endpointStopGracePeriod. When a device
plugin fails, don't immediately remove the endpoint but set stopTime in
its endpoint. During kubelet restart, create endpoints with stopTime set
for any checkpointed registered resource. The endpoint is considered to be
in stopGracePeriod if its stoptime is set. This allows us to track what
resources should be handled by devicemanager during the time gap.
When an endpoint's stopGracePeriod expires, we remove the endpoint and
its resource. This allows the resource to be exported through other channels
(e.g., by directly updating node status through API server) if there is such
use case. Currently endpointStopGracePeriod is set as 5 minutes.

Given that an endpoint is no longer immediately removed upon disconnection,
mark all its devices unhealthy so that we can signal the resource allocatable
change to the scheduler to avoid scheduling more pods to the node.
When a device plugin endpoint is in stopGracePeriod, pods requesting the
corresponding resource will fail admission handler.

Tested:
Ran GPUDevicePlugin e2e_node test 100 times and all passed now.



**What this PR does / why we need it**:

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes https://github.com/kubernetes/kubernetes/issues/60176

**Special notes for your reviewer**:

**Release note**:

```release-note
Fixes the races around devicemanager Allocate() and endpoint deletion.
```
2018-03-12 02:50:13 -07:00
Jiaying Zhang
5514a1f4dd Fixes the races around devicemanager Allocate() and endpoint deletion.
There is a race in predicateAdmitHandler Admit() that getNodeAnyWayFunc()
could get Node with non-zero deviceplugin resource allocatable for a
non-existing endpoint. That race can happen when a device plugin fails,
but is more likely when kubelet restarts as with the current registration
model, there is a time gap between kubelet restart and device plugin
re-registration. During this time window, even though devicemanager could
have removed the resource initially during GetCapacity() call, Kubelet
may overwrite the device plugin resource capacity/allocatable with the
old value when node update from the API server comes in later. This
could cause a pod to be started without proper device runtime config set.

To solve this problem, introduce endpointStopGracePeriod. When a device
plugin fails, don't immediately remove the endpoint but set stopTime in
its endpoint. During kubelet restart, create endpoints with stopTime set
for any checkpointed registered resource. The endpoint is considered to be
in stopGracePeriod if its stoptime is set. This allows us to track what
resources should be handled by devicemanager during the time gap.
When an endpoint's stopGracePeriod expires, we remove the endpoint and
its resource. This allows the resource to be exported through other channels
(e.g., by directly updating node status through API server) if there is such
use case. Currently endpointStopGracePeriod is set as 5 minutes.

Given that an endpoint is no longer immediately removed upon disconnection,
mark all its devices unhealthy so that we can signal the resource allocatable
change to the scheduler to avoid scheduling more pods to the node.
When a device plugin endpoint is in stopGracePeriod, pods requesting the
corresponding resource will fail admission handler.
2018-03-09 17:00:57 -08:00
Jan Safranek
5110db5087 Lock subPath volumes
Users must not be allowed to step outside the volume with subPath.
Therefore the final subPath directory must be "locked" somehow
and checked if it's inside volume.

On Windows, we lock the directories. On Linux, we bind-mount the final
subPath into /var/lib/kubelet/pods/<uid>/volume-subpaths/<container name>/<subPathName>,
it can't be changed to symlink user once it's bind-mounted.
2018-03-05 09:14:44 +01:00
Jing Xu
b2e744c620 Promote LocalStorageCapacityIsolation feature to beta
The LocalStorageCapacityIsolation feature added a new resource type
ResourceEphemeralStorage "ephemeral-storage" so that this resource can
be allocated, limited, and consumed as the same way as CPU/memory. All
the features related to resource management (resource request/limit, quota, limitrange) are avaiable for local ephemeral storage.

This local ephemeral storage represents the storage for root file system, which will be consumed by containers' writtable layer and logs. Some volumes such as emptyDir might also consume this storage.
2018-03-02 15:10:08 -08:00
Kubernetes Submit Queue
e31c8a2252
Merge pull request #60318 from jiayingz/api-change
Automatic merge from submit-queue (batch tested with PRs 59159, 60318, 60079, 59371, 57415). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Made a couple API changes to deviceplugin/v1beta1 to avoid future

incompatible API changes:
- Add GetDevicePluginOptions rpc call. This is needed when we switch
  from Registration service to probe-based plugin watcher.
- Change AllocateRequest and AllocateResponse to allow device requests
  from multiple containers in a pod. Currently only made mechanical
  change on the devicemanager and test code to cope with the API but
  still issues an Allocate call per container. We can modify the
  devicemanager in 1.11 to issue a single Allocate call per pod.
  The change will also facilitate incremental API change to communicate
  pod level information through Allocate rpc if there is such future
  need.



**What this PR does / why we need it**:
Made a couple API changes to deviceplugin/v1beta1 to avoid future incompatible API changes.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes https://github.com/kubernetes/kubernetes/issues/59370

**Special notes for your reviewer**:

**Release note**:

```release-note

```
2018-02-24 21:19:33 -08:00
Jiaying Zhang
07beac6004 Made a couple API changes to deviceplugin/v1beta1 to avoid future
incompatible changes:
- Add GetDevicePluginOptions rpc call. This is needed when we switch
  from Registration service to probe-based plugin watcher.
- Change AllocateRequest and AllocateResponse to allow device requests
  from multiple containers in a pod. Currently only made mechanical
  change on the devicemanager and test code to cope with the API but
  still issues an Allocate call per container. We can modify the
  devicemanager in 1.11 to issue a single Allocate call per pod.
  The change will also facilitate incremental API change to communicate
  pod level information through Allocate rpc if there is such future
  need.
2018-02-23 16:15:09 -08:00
Kubernetes Submit Queue
d5aba0c6ca
Merge pull request #59088 from YuxiJin-tobeyjin/codeClean-merge-logfAndFailnow-to-fatalf
Automatic merge from submit-queue (batch tested with PRs 60106, 59510, 60263, 60063, 59088). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

CodeClean, merge Logf And FailNow to Fatalf

**What this PR does / why we need it**:
Trivial changes to clean code, merge Logf And FailNow to Fatalf.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #

**Special notes for your reviewer**:

**Release note**:

```release-note
"NONE"
```
2018-02-23 02:59:55 -08:00
Kubernetes Submit Queue
e8dd75f37d
Merge pull request #58282 from vikaschoudhary16/per-container-allocate
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Invoke preStart RPC call before container start, if desired by plugin

**What this PR does / why we need it**:
1. Adds a new RPC `preStart` to device plugin API
2. Update `Register` RPC handling to receive a flag from the Device plugins as an indicator if kubelet should invoke `preStart` RPC before starting container.
3. Changes in device manager to invoke `preStart` before container start
4. Test case updates


**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #56943 #56307 


**Special notes for your reviewer**:

**Release note**:

```release-note
None
```
/sig node

/area hw-accelerators
/cc @jiayingz @RenaudWasTaken @vishh @ScorpioCPH @sjenning @derekwaynecarr @jeremyeder @lichuqiang @tengqm
2018-02-21 13:07:26 -08:00
vikaschoudhary16
e64517cd74 Migrate deviceplugin api from v1alpha to v1beta1 2018-02-21 01:26:20 -05:00
vikaschoudhary16
defcab81d5 Invoke PreStart RPC call before container start, if desired by plugin
Signed-off-by: vikaschoudhary16 <vichoudh@redhat.com>
2018-02-21 01:25:24 -05:00
ravisantoshgudimetla
a9a724d500 Test cases fix after path expansion 2018-02-20 14:23:09 -05:00
Kubernetes Submit Queue
96ec318718
Merge pull request #59842 from ixdy/update-rules_go-02-2018
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

 Update bazelbuild/rules_go, kubernetes/repo-infra, and gazelle dependencies

**What this PR does / why we need it**: updates our bazelbuild/rules_go dependency in order to bump everything to go1.9.4. I'm separating this effort into two separate PRs, since updating rules_go requires a large cleanup, removing an attribute from most build rules.

**Release note**:

```release-note
NONE
```
2018-02-19 22:23:05 -08:00
David Ashpole
960856f4e8 collect metrics on the /kubepods cgroup on-demand 2018-02-17 12:32:40 -08:00
Jeff Grafton
ef56a8d6bb Autogenerated: hack/update-bazel.sh 2018-02-16 13:43:01 -08:00
David Ashpole
b259543985 collect ephemeral storage capacity on initialization 2018-02-15 17:33:22 -08:00
Kubernetes Submit Queue
58dcf3c533
Merge pull request #59489 from pohly/master-tmpdir
Automatic merge from submit-queue (batch tested with PRs 59489, 59716). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

devicemanager testing: dynamically choose tmp dir

This avoids the test issue #59488 that I was running into.

I believe I have a reasonable explanation for the race condition in that issue (TLDR: it's probably part of the gRPC API and k8s can only avoid the issue until a proper solution gets worked out together with gRPC), therefore I suggest to merge this PR now both because it avoids the issue and because using fixed tmp directories is something that should be avoided anyway.

/assign @jiayingz
2018-02-14 00:14:31 -08:00
Kubernetes Submit Queue
317853c90c
Merge pull request #59464 from dixudx/fix_all_typos
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

fix all the typos across the project

**What this PR does / why we need it**:
There are lots of typos across the project. We should avoid small PRs on fixing those annoying typos, which is time-consuming and low efficient.

This PR does fix all the typos across the project currently. And with #59463, typos could be avoided when a new PR gets merged.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #

**Special notes for your reviewer**:
/sig testing
/area test-infra
/sig release
/cc @ixdy 
/assign @fejta 

**Release note**:

```release-note
None
```
2018-02-10 22:12:45 -08:00
Di Xu
48388fec7e fix all the typos across the project 2018-02-11 11:04:14 +08:00
Patrick Ohly
0d828e061b devicemanager testing: time out sooner
Each individual step should not take longer than a second.
Suggest by Vikas Choudhary (https://github.com/kubernetes/kubernetes/pull/59489#discussion_r167205672).
2018-02-09 20:51:54 +01:00
Kubernetes Submit Queue
76e6da25fa
Merge pull request #59481 from rojkov/dm-unittests
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

devicemanager: increase code coverege of endpoint's unit test

Particularly cover the code path when an unhealthy device
becomes healthy.
2018-02-09 10:35:22 -08:00
Patrick Ohly
1325c2f8be devicemanager testing: dynamically choose tmp dir
Hard-coding the tests to use /tmp/device_plugin for sockets is
problematic because it prevents running tests in parallel on the same
machine (perhaps because there are multiple developers, perhaps
because testing is done independently on different code checkouts).
/tmp/device_plugin also was not removed after testing.

This is probably not that relevant. But more importantly, this change
also fixes https://github.com/kubernetes/kubernetes/issues/59488.
"make test" failed in TestDevicePluginReRegistration because something
removed /tmp/device_plugin/device-plugin.sock while something else
tried to connect to it:

2018/02/07 14:34:39 Starting to serve on /tmp/device_plugin/device-plugin.sock
[pid 29568] connect(14, {sa_family=AF_UNIX, sun_path="/tmp/device_plugin/server.sock"}, 33) = 0
[pid 29568] unlinkat(AT_FDCWD, "/tmp/device_plugin/server.sock", 0) = 0
[pid 29568] unlinkat(AT_FDCWD, "/tmp/device_plugin/device-plugin.sock", 0) = 0
[pid 29568] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=29568, si_uid=1000} ---
[pid 29568] connect(6, {sa_family=AF_UNIX, sun_path="/tmp/device_plugin/device-plugin.sock"}, 40) = -1 ENOENT (No such file or directory)
E0207 14:34:39.961321   29568 endpoint.go:117] listAndWatch ended unexpectedly for device plugin mock with error rpc error: code = Unavailable desc = transport is closing
strace: Process 29623 attached
[pid 29574] connect(3, {sa_family=AF_UNIX, sun_path="/tmp/device_plugin/device-plugin.sock"}, 40) = -1 ENOENT (No such file or directory)
[pid 29623] connect(3, {sa_family=AF_UNIX, sun_path="/tmp/device_plugin/device-plugin.sock"}, 40) = -1 ENOENT (No such file or directory)
[pid 29574] connect(3, {sa_family=AF_UNIX, sun_path="/tmp/device_plugin/device-plugin.sock"}, 40) = -1 ENOENT (No such file or directory)
E0207 14:34:49.961324   29568 endpoint.go:60] Can't create new endpoint with path /tmp/device_plugin/device-plugin.sock err failed to dial device plugin: context deadline exceeded
E0207 14:34:49.961390   29568 manager.go:340] Failed to dial device plugin with request &RegisterRequest{Version:v1alpha2,Endpoint:device-plugin.sock,ResourceName:fake-domain/resource,}: failed to dial device plugin: context deadline exceeded
panic: test timed out after 2m0s

It's not entirely certain which code was to blame for this unlinkat()
calls (perhaps some cleanup code from a previous test running in a
goroutine?) but this no longer happened after switching to per-test
socket directories.
2018-02-09 14:01:13 +01:00
Dmitry Rozhkov
3175a687a0 devicemanager: increase code coverege of endpoint's unit test
Particularly cover the code path when an unhealthy device
becomes healthy.
2018-02-07 12:29:48 +02:00
Lee Verberne
e10042d22f Increment CRI version from v1alpha1 to v1alpha2
This also incorporates the version string into the package name so
that incompatibile versions will fail to connect.

Arbitrary choices:
- The proto3 package name is runtime.v1alpha2. The proto compiler
  normally translates this to a go package of "runtime_v1alpha2", but
  I renamed it to "v1alpha2" for consistency with existing packages.
- kubelet/apis/cri is used as "internalapi". I left it alone and put the
  public "runtimeapi" in kubelet/apis/cri/runtime.
2018-02-07 09:06:26 +01:00
Kubernetes Submit Queue
056e9ecc43
Merge pull request #58941 from vikaschoudhary16/test-allocate
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Add unit test for endpoint allocate

**What this PR does / why we need it**:
Adds a unit test for covering `allocate` function at endpoint.


**Release note**:

```release-note
None
```

/kind testing
/area hw-accelerators
/cc @jiayingz @vishh @derekwaynecarr @RenaudWasTaken @resouer @ConnorDoyle
2018-02-06 17:19:41 -08:00
Kubernetes Submit Queue
c02b784b76
Merge pull request #58172 from NVIDIA/annotations
Automatic merge from submit-queue (batch tested with PRs 58184, 59307, 58172). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Add annotations to the device plugin API

**What this PR does / why we need it**:

**Which issue(s) this PR fixes** : Related to #56649 but does not fix it

This adds the ability for the device plugins to annotate containers.
Product wise, this allows the NVIDIA device plugin to support CRI-O (which allows hooks through container annotations).

**Special notes for your reviewer**:
/area hw-accelerators
/cc @vishh @jiayingz @vikaschoudhary16 

I'm wondering if it would make sense to fire a blank call to `newContainerAnnotations` at the start of the deviceplugin to get Annotations that are forbidden.
Current behavior is that any Annotations that conflicts with Kubelet will be overwritten by Kubelet.

**Release note**:
```release-note
NONE
```
2018-02-05 13:50:35 -08:00
Derek Carr
4afc0c8052 kubelet ignores hugepages if hugetlb is not enabled 2018-02-05 13:07:59 -05:00
vikaschoudhary16
abfb99645b Add unit test for endpoint allocate 2018-02-05 00:53:07 -05:00
Renaud Gaubert
db537e5954 Add Annotations from the deviceplugin to the runtime 2018-02-03 19:53:20 +01:00
tianshapjq
21702e3c39 TODO has already been implemented 2018-02-01 14:38:29 +08:00
Kubernetes Submit Queue
c817765b0e
Merge pull request #58445 from hanxiaoshuai/typo
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

fix some typos in comments

**What this PR does / why we need it**:

Fixes # fix some typos in comments
2018-01-30 19:44:44 -08:00
YuxiJin-tobeyjin
af6b4e39c2 codeClean-merge-logfAndFailnow-to-fatalf 2018-01-31 11:39:31 +08:00
tianshapjq
e0f15bf5bf Len() is already int 2018-01-29 09:01:23 +08:00
Kubernetes Submit Queue
bf111161b7
Merge pull request #57973 from dims/set-pids-limit-at-pod-level
Automatic merge from submit-queue (batch tested with PRs 57973, 57990). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Set pids limit at pod level

**What this PR does / why we need it**:

Add a new Alpha Feature to set a maximum number of pids per Pod.
This is to allow the use case where cluster administrators wish
to limit the pids consumed per pod (example when running a CI system).

By default, we do not set any maximum limit, If an administrator wants
to enable this, they should enable `SupportPodPidsLimit=true` in the
`--feature-gates=` parameter to kubelet and specify the limit using the
`--pod-max-pids` parameter.

The limit set is the total count of all processes running in all
containers in the pod.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #43783

**Special notes for your reviewer**:

**Release note**:

```release-note
New alpha feature to limit the number of processes running in a pod. Cluster administrators will be able to place limits by using the new kubelet command line parameter --pod-max-pids. Note that since this is a alpha feature they will need to enable the "SupportPodPidsLimit" feature.
```
2018-01-25 18:29:31 -08:00
Connor Doyle
e5667cf426 Rename package deviceplugin => devicemanager. 2018-01-24 22:32:43 -08:00
Kubernetes Submit Queue
62616d79ad
Merge pull request #58053 from tianshapjq/nit-errUnsupportedVersion
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

typo of errUnsuportedVersion

**What this PR does / why we need it**:
typo of errUnsuportedVersion in pkg/kubelet/cm/deviceplugin/types.go

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #

**Special notes for your reviewer**:

**Release note**:

```release-note

```NONE
2018-01-19 03:26:34 -08:00
Kubernetes Submit Queue
44d0ba29d3
Merge pull request #56960 from islinwb/remove_unused_code_ut_pkg
Automatic merge from submit-queue (batch tested with PRs 53631, 56960). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Remove unused code in UT files in pkg/

**What this PR does / why we need it**:
Remove unused code in UT files in pkg/ .

**Release note**:

```release-note
NONE
```
2018-01-18 02:41:29 -08:00
hangaoshuai
005f8c4926 fix some typos in comments 2018-01-18 17:07:51 +08:00
vikaschoudhary16
9c847fc4d6 Call Dial in blocking mode 2018-01-16 10:50:17 -05:00
linweibin
fa8afc1d39 Remove unused code in UT files in pkg/ 2018-01-15 16:02:35 +08:00
Kubernetes Submit Queue
9007df35b9
Merge pull request #55921 from ScorpioCPH/fix-endpoint-ut
Automatic merge from submit-queue (batch tested with PRs 58216, 58193, 53033, 58219, 55921). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Fix device plugin endpoint UT

**What this PR does / why we need it**:
Fix some issues in device plugin endpoint UT.

**Which issue(s) this PR fixes**:
Fixes #55920

**Special notes for your reviewer**:

@jiayingz @RenaudWasTaken @lichuqiang PTAL.

/sig node

**Release note**:

```release-note
None
```
2018-01-13 03:34:57 -08:00
Kubernetes Submit Queue
f2e46a2147
Merge pull request #57266 from vikaschoudhary16/unhealthy_device
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Handle Unhealthy devices

Update node capacity with sum of both healthy and unhealthy devices.
Node allocatable reflect only healthy devices.



**What this PR does / why we need it**:
Currently node capacity only reflects healthy devices. Unhealthy devices are ignored totally while updating node status. This PR handles unhealthy devices while updating node status. 

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #57241

**Special notes for your reviewer**:

**Release note**:
<!--  Write your release note:
Handle Unhealthy devices

```release-note
Handle Unhealthy devices
```
/cc @tengqm @ConnorDoyle @jiayingz @vishh @jeremyeder @sjenning @resouer @ScorpioCPH @lichuqiang @RenaudWasTaken @balajismaniam 

/sig node
2018-01-12 19:55:54 -08:00
Penghao Cen
b96c383ef7 Check grpc server ready properly 2018-01-13 05:47:49 +08:00
Penghao Cen
90bc1265cf Fix endpoint not work issue 2018-01-12 20:09:07 +08:00
Davanum Srinivas
ecd6361ff0 Set pids limit at pod level
Add a new Alpha Feature to set a maximum number of pids per Pod.
This is to allow the use case where cluster administrators wish
to limit the pids consumed per pod (example when running a CI system).

By default, we do not set any maximum limit, If an administrator wants
to enable this, they should enable `SupportPodPidsLimit=true` in the
`--feature-gates=` parameter to kubelet and specify the limit using the
`--pod-max-pids` parameter.

The limit set is the total count of all processes running in all
containers in the pod.
2018-01-11 21:22:38 -05:00
Penghao Cen
671c4eb2b7 Add e2e test logic for device plugin 2018-01-11 14:41:45 +08:00
Penghao Cen
dc5384a139 Don't rewrite device health 2018-01-11 14:18:13 +08:00
tianshapjq
e8005face7 typo of errUnsuportedVersion 2018-01-10 15:47:11 +08:00
vikaschoudhary16
e9cf3f1ac4 Handle Unhealthy devices
Update node capacity with sum of both healthy and unhealthy devices.
Node allocatable reflect only healthy devices.
2018-01-09 11:38:48 -05:00
Jonathan Basseri
85c5862552 Fix scheduler refs in BUILD files.
Update references to moved scheduler code.
2018-01-05 15:05:01 -08:00
Jonathan Basseri
30b89d830b Move scheduler code out of plugin directory.
This moves plugin/pkg/scheduler to pkg/scheduler and
plugin/cmd/kube-scheduler to cmd/kube-scheduler.

Bulk of the work was done with gomvpkg, except for kube-scheduler main
package.
2018-01-05 15:05:01 -08:00
Kubernetes Submit Queue
4d215fd235
Merge pull request #56611 from tianshapjq/testcase-cgroup_manager_linux.go
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

new testcase to cgroup_manager_linux.go

a new test case to adaptName(), for testing "cgroupManagerType != libcontainerSystemd"
2017-12-28 11:11:47 -08:00
Kubernetes Submit Queue
a4eb2f96d0
Merge pull request #57610 from vikaschoudhary16/remove-redundant-sleep
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Remove redundant sleep from ReRegistration unit test case

/kind cleanup
/sig node

**What this PR does / why we need it**:
Once upon a time, there was a race in the device plugin registration logic.  At that time, [list()](5cac9fc984/pkg/kubelet/deviceplugin/manager.go (L206)) and [listAndWatch()](5cac9fc984/pkg/kubelet/deviceplugin/manager.go (L224)) used to be separate functions. Race was there for taking manager.mutex lock from two places. [One, from within the m.addEndpoint()](5cac9fc984/pkg/kubelet/deviceplugin/manager.go (L214)) and the [second, from within m.Devices()](5cac9fc984/pkg/kubelet/deviceplugin/manager.go (L137)).  This race was making `TestDevicePluginReRegistration` flaky as explained below.
 	
```
1.     p1.Register(socketName, testResourceName)
2.  	// Wait for the first callback to be issued.
3.  	<-callbackChan
4.        devices := m.Devices()  
```
* L#1 leads to eventually **asynchronous** invocation of m.addEndpoint(), let say **thread1**.
* L#3 holds the test case execution till the [callback gets invoked](5cac9fc984/pkg/kubelet/deviceplugin/endpoint.go (L108)). This means test case execution waits on channel till the **thread1**  reaches the point where [e.list() call completes in the addEndpoint.](5cac9fc984/pkg/kubelet/deviceplugin/manager.go (L206)) 
* L#4 triggers a new thread. thread1 and this new thread are both racing for m.mutex.Lock(). Former, in the addEndpoint() and later one in the m.Devices(). If m.Devices wins the race, result is the test case failure because endpoint gets added in the manager only after taking mutex.Lock() in the addEndpoint().

To deal with this flake, we added `Sleep` between L#3 and L#4.  `Sleep` was getting some extra time to addEndpoint() and thus making thread1 win the race each time.

Above explained race scenario got fixed and merged sometime back in this PR:
[Deviceplugin refactoring: merge func list and listwatch in endpoint into one](https://github.com/kubernetes/kubernetes/pull/52149)
With the above PR, callback function is invoked from e.run() which makes sure that test case waits on channel till the endpoint is added and devices are updated
Above explained race scenario does not exist now, therefore removing redundant sleeps from the test case.

Tested:
go test -race -count 500 k8s.io/kubernetes/pkg/kubelet/cm/deviceplugin -run TestDevicePluginReRegistration  -timeout 5h

Related #52616 #56026 

**Special notes for your reviewer**:

**Release note**:

```release-note
None
```
/cc @vishh @derekwaynecarr @jiayingz @RenaudWasTaken @lichuqiang @ScorpioCPH @tengqm @mindprince @ConnorDoyle @jeremyeder
2017-12-27 14:53:21 -08:00
vikaschoudhary16
5d10dcd983 Remove redundant sleep from ReRegistration unit test case 2017-12-27 03:02:21 -05:00
Kubernetes Submit Queue
e67294105a
Merge pull request #57274 from vikaschoudhary16/reviewr
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Add vikaschoudhary16 as reviewer in pkg/kubelet/cm/deviceplugin

**What this PR does / why we need it**:
Add github user vikaschoudhary16 (me) to the reviewers list for pkg/kubelet/cm/deviceplugin

**Special notes for your reviewer**:
I would like to help with the review load in this package.

```release-note
None
```
/sig node
/cc @vishh @jiayingz @derekwaynecarr @mindprince @RenaudWasTaken @ConnorDoyle
2017-12-25 08:43:10 -08:00
Kubernetes Submit Queue
7dd82519da
Merge pull request #57369 from vikaschoudhary16/revert-to-limits
Automatic merge from submit-queue (batch tested with PRs 57591, 57369). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Revert back #57278

**What this PR does / why we need it**:
This PR reverts back to behavior of scanning Limits.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Related #
#57276
#57170
**Special notes for your reviewer**:

**Release note**:

```release-note
None
```
/sig node

/cc @vishh @ConnorDoyle @jiayingz
2017-12-24 23:37:37 -08:00
Kubernetes Submit Queue
92e1028ac7
Merge pull request #57591 from vikaschoudhary16/fix-race
Automatic merge from submit-queue (batch tested with PRs 57591, 57369). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Fix a race in the endpoint.go

**What this PR does / why we need it**:
This PR fixes a race in the endpoint.go

Fixes #56026


-->
```release-note
None
```

/sig node
/cc @RenaudWasTaken @ConnorDoyle @jiayingz @mindprince @ScorpioCPH @resouer @tengqm @vishh
2017-12-24 23:37:34 -08:00
Jeff Grafton
efee0704c6 Autogenerate BUILD files 2017-12-23 13:12:11 -08:00
vikaschoudhary16
cc4d2cbe9d Fix a race in the endpoint.go 2017-12-23 03:02:33 -05:00
vikaschoudhary16
8749c5c989 Revert back #57278 2017-12-22 18:55:53 -05:00
Kubernetes Submit Queue
eddb00e7c6
Merge pull request #57247 from dixudx/cm_return_err
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

cpumanager: Propagate error up instead panic

**What this PR does / why we need it**:

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #57239

**Special notes for your reviewer**:
/assign @sjenning 
**Release note**:

```release-note
None
```
2017-12-18 10:18:09 -08:00
Kubernetes Submit Queue
0a55f4105c
Merge pull request #57278 from vikaschoudhary16/limit
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Fix device manager to scan resources.Requests

**What this PR does / why we need it**:
This PR makes device manager to scan resources.Requests from the container spec. Currently
it scans resources.Limits. For extended resources, it is not mandatory for resources.Limits to be present in the container spec and if Limits are present, validation logic ensures that Limits will always be equal to Requests. 

Fixes #57276 

**Special notes for your reviewer**:

**Release note**:

```release-note
None
```
/sig node

/cc @ConnorDoyle @vishh @jiayingz @RenaudWasTaken @tengqm @resouer @mindprince
2017-12-17 23:43:59 -08:00
Di Xu
d474b86e05 Propagate error up instead panic 2017-12-18 14:05:06 +08:00
vikaschoudhary16
bf1fb46347 Look for requested resources in the Requests 2017-12-17 22:56:45 -05:00
tianshapjq
7a43f736c4 correct the annotations in container_manager.go 2017-12-18 09:01:36 +08:00
vikaschoudhary16
8c51d235d6 Refactor TestPodContainerDeviceAllocation to make it readable and extensible 2017-12-16 20:32:08 -05:00
Kubernetes Submit Queue
a902959544
Merge pull request #56911 from WanLinghao/projected_test_fix
Automatic merge from submit-queue (batch tested with PRs 56650, 55813, 56911, 56921, 56871). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

fix deviceplugin test file create leak file problem

When execute make test, this test file will create a file named "kubelet_internal_checkpoint" in k8s directory and not delete it.

This patch fix this error




**What this PR does / why we need it**:

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #56365

**Special notes for your reviewer**:

**Release note**:

```release-note

```
2017-12-16 12:10:48 -08:00
Kubernetes Submit Queue
8415e0c608
Merge pull request #56661 from xiangpengzhao/move-kubelet-constants
Automatic merge from submit-queue (batch tested with PRs 56410, 56707, 56661, 54998, 56722). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Move some kubelet constants to a common place

**What this PR does / why we need it**:
More context, see: https://github.com/kubernetes/kubernetes/issues/56516
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #56516
[thanks @ixdy for verifying this!]

**Special notes for your reviewer**:
@ixdy how can I verify #56516 against this locally?

/cc @ixdy @mtaufen 

**Release note**:

```release-note
NONE
```
2017-12-16 05:46:35 -08:00
vikaschoudhary16
a71d1680d4 Add vikaschoudhary16 as reviewer in pkg/kubelet/cm/deviceplugin 2017-12-16 08:19:36 -05:00
Kubernetes Submit Queue
fa0a1a3d7a
Merge pull request #56337 from mindprince/container-manager-cleanup
Automatic merge from submit-queue (batch tested with PRs 56337, 56546, 56550, 56633, 56635). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Remove redundant code in container manager.

- Reuse stub implementations from unsupported implementations.
- Delete test file that didn't contain any tests.

**Release note**:
```release-note
NONE
```

/kind cleanup
/sig node
2017-12-16 01:53:42 -08:00
Kubernetes Submit Queue
578f3db8d5
Merge pull request #55382 from vikaschoudhary16/checkpoint
Automatic merge from submit-queue (batch tested with PRs 57172, 55382, 56147, 56146, 56158). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Use file store utility for device plugin checkpointing

Partially address issue #54088
cc @sjenning @jeremyeder @jiayingz @vishh 

/sig node
2017-12-14 12:38:13 -08:00
Kubernetes Submit Queue
7908e96539
Merge pull request #56191 from ConnorDoyle/cpu-manager-panic-state-init-error
Automatic merge from submit-queue (batch tested with PRs 54410, 56184, 56199, 56191, 56231). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

CPU Manager panics on state initialization error.

**What this PR does / why we need it**:

- CPU Manager panics on state initialization error.
- Update unit tests accordingly.
- Minor related cleanup in `state_file.go`.

**Special notes for your reviewer**:

**Release note**:
```release-note
NONE
```

/kind bug
/sig node
/priority important-soon
Blocks #52031
/assign @balajismaniam 
cc @flyingcougar
2017-12-14 05:33:17 -08:00
Kubernetes Submit Queue
e9a9da8aa3
Merge pull request #54410 from intelsdi-x/cpu-reconcile-state
Automatic merge from submit-queue (batch tested with PRs 54410, 56184, 56199, 56191, 56231). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Cpu manager reconcile loop - restore state

**What this PR does / why we need it**:
Cpu manager reconcile loop can add orphaned containers to `State` calling `policy.AddContainer()`
Previous PR: #54409 
e2e tests PR: #53378

Blocked by #56191
2017-12-14 05:33:08 -08:00
Kubernetes Submit Queue
65a7ecf147
Merge pull request #57045 from ConnorDoyle/add-connor-containermanager-owners
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Add ConnorDoyle as approver in /pkg/kubelet/cm.

**What this PR does / why we need it**:
- Add github user `ConnorDoyle` (me) to the approvers list for `pkg/kubelet/cm`.

**Special notes for your reviewer**:
I would like to help with the review load in this package. I believe I have demonstrated good stewardship of sub-packages in this part of the code base.

```release-note
NONE
```

/sig node
/kind cleanup
/assign @derekwaynecarr
2017-12-13 19:32:53 -08:00
WanLinghao
3e7e4ab397 old test file will create a leak file in current directory.
this patch fix this.
	modified:   pkg/kubelet/cm/deviceplugin/manager_test.go
2017-12-07 11:57:17 +08:00
tianshapjq
3945a66f7a new testcase helpers_linux.go 2017-12-07 10:26:37 +08:00
Connor Doyle
4207b4fd2c Add ConnorDoyle as approver in /pkg/kubelet/cm. 2017-12-06 09:05:59 -06:00
Jiaying Zhang
d4244f3ded Re-uses device plugin resources allocated to init containers.
Implements option 2 mentioned in
https://github.com/kubernetes/kubernetes/issues/56022#issuecomment-348286184
2017-12-04 22:01:28 -08:00
xiangpengzhao
8048823d0e Auto generated BUILD files. 2017-12-01 11:24:41 +08:00
xiangpengzhao
1f2262e6b0 Move some kubelet constants to a common place. 2017-12-01 11:24:04 +08:00
tianshapjq
0cc6a4d937 new testcase to cgroup_manager_linux.go 2017-11-30 14:14:59 +08:00
Szymon Scharmach
552e4d3a9d Cpu manager reconclie loop can restore state 2017-11-27 11:22:21 +01:00
vikaschoudhary16
de358fb21f Use file store utility for device plugin check-pointing 2017-11-24 08:41:11 -05:00
Rohit Agarwal
4b216f7cd9 Remove redundant code in container manager.
- Reuse stub implementations from unsupported implementations.
- Delete test file that didn't contain any tests.
2017-11-24 03:15:55 -08:00
Connor Doyle
4f185e6b7f CPU Manager panics on state initialization error.
- Update unit tests accordingly.
- Minor related cleanup in state_file.go
2017-11-22 10:25:38 -08:00
Jing Xu
a66ee2eb3f Add pod-level metric for CPU and memory stats
This PR adds the pod-level metrics for CPU and memory stats. cAdvisor
can get all pod cgroup information so we can add this pod-level CPU and
memory stats information from the corresponding pod cgroup
2017-11-22 09:25:23 -08:00
Jiaying Zhang
048bafdd0b Adds device plugin registration count metric and allocation latency metric. 2017-11-21 13:44:10 -08:00
Jiaying Zhang
1eb4e79453 Extends deviceplugin to gracefully handle full device plugin lifecycle.
- Instead of using cm.capacity field to communicate device plugin resource
capacity, this PR changes to use an explicit cm.GetDevicePluginResourceCapacity()
function that returns device plugin resource capacity as well as any inactive
device plugin resource. Kubelet syncNodeStatus call this function during its
periodic run to update node status capacity and allocatable. After this call,
device plugin can remove the inactive device plugin resource from its allDevices
field as the update is already pushed to API server.
- Extends device plugin checkpoint data to record registered resources
so that we can finish resource removing even upon kubelet restarts.
- Passes sourcesReady from kubelet to device plugin to avoid removing
inactive pods during grace period of kubelet restart.
2017-11-20 23:40:14 -08:00
Niklas Q. Nielsen
b16bfc768d Merging handler into manager API 2017-11-20 21:37:46 +00:00
Kubernetes Submit Queue
0b1d023aa7
Merge pull request #55884 from mpolednik/dpi-race-fix
Automatic merge from submit-queue (batch tested with PRs 55839, 54495, 55884, 55983, 56069). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

deviceplugin: fix race when multiple plugins are registered

**What this PR does / why we need it**:
When registering multiple device plugins to Kubelet concurrently, there exists a race that crashes the Kubelet.

Consider two plugins: D1 and D2. The call order method is roughly

D1 -> manager.go:register -> endpoint.go:listAndWatch -> device_plugin_handler.go:(*D1).callback
D2 -> manager.go:register -> endpoint.go:listAndWatch -> device_plugin_handler.go:(*D2).callback

The callback function accesses HandlerImpl's allDevices map that maps (resourceName -> DeviceID). If both plugins reach these accesses at the same time, Kubelet crashes with "fatal error: concurrent map read and map write".

This can be solved by making sure handler is locked when allDevices are being updated. The functionality is needed to avoid Kubelet crashes when multiple device plugins are trying to register with Kubelet at the same moment. Occurs frequently when single binary tries to register itself as multiple plugins.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:

**Special notes for your reviewer**:

**Release note**:
```release-note
NONE
```
2017-11-20 13:08:09 -08:00
Kubernetes Submit Queue
869b5ab191
Merge pull request #55841 from ConnorDoyle/cpuman-file-state-for-none-policy
Automatic merge from submit-queue (batch tested with PRs 55841, 55948, 55945). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

CPU Manager: file state for all policies

**What this PR does / why we need it**:

Before this change, the new file-backed state was only enabled for the static CPU manager policy. This patch enables persistent state for all policies.

This PR fixes #55736 and the potential CPU resource leak described in that issue.

**Release note**:

```release-note
NONE
```

/kind bug
/sig node
/assign @balajismaniam
2017-11-18 14:10:12 -08:00
Kubernetes Submit Queue
c60b35bcd3
Merge pull request #52977 from yanxuean/improvecgroup
Automatic merge from submit-queue (batch tested with PRs 54837, 55970, 55912, 55898, 52977). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Improve kubelet cgroup

**What this PR does / why we need it**:
1.Use arg cgroupRoot,not nodeConfig.CgroupRoot
    Using both arg cgroupRoot and nodeConfig.CgroupRoot is confused in function NewQOSContainerManager
2.improve cgroupmanager in qosContainerManager
3. improve arg "cgroupRoot" type in NewQOSContainerManager

**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #

**Special notes for your reviewer**:

**Release note**:

```release-note
```
2017-11-18 13:13:28 -08:00
Michael Taufen
1085b6f730 Lift embedded structure out of eviction-related KubeletConfiguration fields
- Changes the following KubeletConfiguration fields from `string` to
`map[string]string`:
  - `EvictionHard`
  - `EvictionSoft`
  - `EvictionSoftGracePeriod`
  - `EvictionMinimumReclaim`
- Adds flag parsing shims to maintain Kubelet's public flags API, while
enabling structured input in the file API.
- Also removes `kubeletconfig.ConfigurationMap`, which was an ad-hoc flag
parsing shim living in the kubeletconfig API group, and replaces it
with the `MapStringString` shim introduced in this PR. Flag parsing
shims belong in a common place, not in the kubeletconfig API.
I manually audited these to ensure that this wouldn't cause errors
parsing the command line for syntax that would have previously been
error free (`kubeletconfig.ConfigurationMap` was unique in that it
allowed keys to be provided on the CLI without values. I believe this was
done in `flags.ConfigurationMap` to facilitate the `--node-labels` flag,
which rightfully accepts value-free keys, and that this shim was then
just copied to `kubeletconfig`). Fortunately, the affected fields
(`ExperimentalQOSReserved`, `SystemReserved`, and `KubeReserved`) expect
non-empty strings in the values of the map, and as a result passing the
empty string is already an error. Thus requiring keys shouldn't break
anyone's scripts.
- Updates code and tests accordingly.

Regarding eviction operators, directionality is already implicit in the
signal type (for a given signal, the decision to evict will be made when
crossing the threshold from either above or below, never both). There is
no need to expose an operator, such as `<`, in the API. By changing
`EvictionHard` and `EvictionSoft` to `map[string]string`, this PR
simplifies the experience of working with these fields via the
`KubeletConfiguration` type. Again, flags stay the same.

Other things:
- There is another flag parsing shim, `flags.ConfigurationMap`, from the
shared flag utility. The `NodeLabels` field still uses
`flags.ConfigurationMap`. This PR moves the allocation of the
`map[string]string` for the `NodeLabels` field from
`AddKubeletConfigFlags` to the defaulter for the external
`KubeletConfiguration` type. Flags are layered on top of an internal
object that has undergone conversion from a defaulted external object,
which means that previously the mere registration of flags would have
overwritten any previously-defined defaults for `NodeLabels` (fortunately
there were none).
2017-11-16 18:35:13 -08:00
Martin Polednik
6e3f8f3890 deviceplugin: fix race when multiple plugins are registered
Signed-off-by: Martin Polednik <mpolednik@redhat.com>
2017-11-16 15:20:00 +01:00
Connor Doyle
c95ee34234 Use file-backed state for all cpumanager policies
- Add unit test to verify policy name mismatch behavior.
2017-11-15 22:38:11 -08:00
Kubernetes Submit Queue
e99544d018
Merge pull request #54409 from intelsdi-x/cpu-enable-state-file
Automatic merge from submit-queue (batch tested with PRs 55764, 55683, 55468, 54409, 55546). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Enable file back state in static policy

**What this PR does / why we need it**:
Enables file back `State` in `static policy` and cpu manager + tests.
Upon policy start, state read from file is validated whether it meets the policy assumption. In case of any error, state is cleared.

Previous PR: #54408
Next PR: #54409
2017-11-15 22:16:05 -08:00
Kubernetes Submit Queue
6f35d49079
Merge pull request #52149 from lichuqiang/combineListwatch
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Deviceplugin refactoring: merge func list and listwatch in endpoint into one

**What this PR does / why we need it**:
merge func list and listwatch in endpoint into one, since we won't call list func individually

**Which issue this PR fixes**
fixes #51993
Part2

**Special notes for your reviewer**:
/cc @jiayingz @RenaudWasTaken @vishh

**Release note**:

```release-note
NONE
```
2017-11-15 16:56:51 -08:00
Jiaying Zhang
93916242f7 Adds jiayingz@ and vish@ as approvers for pkg/kubelet/cm/deviceplugin/. 2017-11-14 15:27:02 -08:00
Michał Stachowski
809ac834a0 Cpu manager file state tests 2017-11-14 18:26:41 +01:00
Szymon Scharmach
7e7301ffaf Enable file state in static policy 2017-11-14 18:25:58 +01:00
lichuqiang
4fa0fa5ad1 pass devices of previous endpoint into re-registered one to avoid potential orphaned devices upon re-registration 2017-11-14 16:43:19 +08:00
Kubernetes Submit Queue
e2c02f425a
Merge pull request #53970 from ScorpioCPH/add-more-comments
Automatic merge from submit-queue (batch tested with PRs 55283, 55461, 55288, 53970, 55487). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Add more comments for DevicePluginHandlerImpl struct

**What this PR does / why we need it**:

Add more comments

**Special notes for your reviewer**:

@jiayingz PTAL.

**Release note**:

```
NONE
```
2017-11-13 12:32:27 -08:00
Dr. Stefan Schimanski
bec617f3cc Update generated files 2017-11-09 12:14:08 +01:00
Dr. Stefan Schimanski
012b085ac8 pkg/apis/core: mechanical import fixes in dependencies 2017-11-09 12:14:08 +01:00
Clayton Coleman
66590d6f83
Container manager has a bad fake interface 2017-11-03 22:21:29 -04:00
Penghao Cen
1d4e1942d8 Add more comments for HandlerImpl struct 2017-11-03 18:24:32 +08:00
Kubernetes Submit Queue
2084f7f4f3
Merge pull request #54488 from lichuqiang/plugin_base
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Add admission handler for device resources allocation

**What this PR does / why we need it**:
Add admission handler for device resources allocation to fail fast during pod creation

**Which issue this PR fixes** 
fixes #51592

**Special notes for your reviewer**:
@jiayingz Sorry, there is something wrong with my branch in #51895. And I think the existing comments in the PR might be too long for others to view. So I closed it and opened the new one, as we have basically reach an agreement on the implement :)
I have covered the functionality and unit test part here, and would set about the e2e part ASAP

/cc @jiayingz @vishh @RenaudWasTaken 

**Release note**:

```release-note
NONE
```
2017-11-02 17:24:06 -07:00
lichuqiang
0630896383 update unit test for plugin resources allocation reinforcement 2017-11-02 09:18:24 +08:00
lichuqiang
ebd445eb8c add admission handler for device resources allocation 2017-11-02 09:17:48 +08:00
Shawn Hsiao
f7a15cb751 set leveled logging (v=4) for 'updating container' message 2017-11-01 16:54:23 -04:00
Kubernetes Submit Queue
94e77bd4ca Merge pull request #54408 from intelsdi-x/cpu-state-file
Automatic merge from submit-queue (batch tested with PRs 54656, 54552, 54389, 53634, 54408). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Add file backed state to cpu manager

**What this PR does / why we need it**:
Adds file backed `State` implementation to cpu manger with tests.
Reads from `State` are done from memory, while each write triggers state save to a file.

Any failure in reading the state file results in empty state

Next PR: #54409
2017-10-26 21:08:38 -07:00
Rohit Agarwal
092429be1c Better error messages and logging while registering device plugins. 2017-10-26 15:17:38 -07:00
Michał Stachowski
97e3f7bf86 State file test fixes 2017-10-26 20:03:35 +02:00
Szymon Scharmach
4ee0adc77a Added Cpu Manager file state 2017-10-26 20:03:17 +02:00
lichuqiang
6a39ac3874 merge func list and listwatch into one 2017-10-26 16:36:16 +08:00
Jiaying Zhang
e501f01d85 Move podDevices code into a separate file. 2017-10-24 17:48:59 -07:00
Jiaying Zhang
ff4e8d429e Device plugin code refactoring to cope with file move.
While moving device_plugin_handler_test.go from pkg/kubelet/cm/ to
pkg/kubelet/cm/deviceplugin/, we can no longer uses cm in its tests
because that would cause a cycle dependency. To solve this problem,
I moved the main cm GetResources functionality as well as part of the
current device plugin handler Allocate functionality into a new device
plugin handler function, GetDeviceRunContainerOptions(). This
refactoring is also needed by another PR 51895 that moves device
allocation into admission phase. Now device plugin handler Allocate()
first checks whether there is cached device runtime state and only
issues Allocate grpc call if there is no cached state available.
The new GetDeviceRunContainerOptions() function simply returns device
runtime config from the cached state. To support this change, extended the
podDevices struct and checkpoint data structure with device runtime state.
2017-10-24 14:38:15 -07:00
Jiaying Zhang
796f488789 Move device plugin related files under pkg/kubelet/cm/deviceplugin/. 2017-10-24 14:17:20 -07:00
lichuqiang
fd8b04649e unnecessary functions cleanup for deviceplugin 2017-10-20 09:37:59 +08:00
Vishnu kannan
16b0363b95 Disabling k8s.io/kubernetes/pkg/kubelet/cm TestPodContainerDeviceAllocation due to #54100
Signed-off-by: Vishnu kannan <vishnuk@google.com>
2017-10-19 10:35:24 -07:00
Vishnu kannan
e0032af916 bump device plugin version to v1alpha2 to reflect the change to AllocateResponce API
Signed-off-by: Vishnu kannan <vishnuk@google.com>
2017-10-19 10:35:24 -07:00
Vishnu kannan
18eee1eaa0 Make AllocateResponse artifacts global across all devices per container in device plugin API
There is no use case known for passing artifacts per device as it currently exists. The current API is also
complex to use for simple clients. Hence this PR creates a flat namespace where artifacts like environment variables
and mount points apply globally to all devices returned as part of AllocateResponse proto.

Signed-off-by: Vishnu kannan <vishnuk@google.com>
2017-10-19 10:34:00 -07:00
Kubernetes Submit Queue
1d8f1e268f Merge pull request #47699 from supereagle/fix-typos
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

fix typos: remove duplicated word in comments

**What this PR does / why we need it**: Remove the duplicated word `the` in comments

**Which issue this PR fixes** : fixes #

**Special notes for your reviewer**:

```release-note
NONE
```
2017-10-17 02:35:52 -07:00
Jeff Grafton
aee5f457db update BUILD files 2017-10-15 18:18:13 -07:00
Kubernetes Submit Queue
3deab69d3b Merge pull request #53790 from yanxuean/cgroupredundancy
Automatic merge from submit-queue (batch tested with PRs 52959, 53790). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

remove redundancy code in setCPUCgroupConfig

fix #53925

Signed-off-by: yanxuean <yan.xuean@zte.com.cn>



**What this PR does / why we need it**:

The check of burstableCPUShares is redundancy. We have done it in MilliCPUToShares. It is responsibility of MilliCPUToShares.
```
func (m *qosContainerManagerImpl) setCPUCgroupConfig(configs map[v1.PodQOSClass]*CgroupConfig) error {
        ........
	// set burstable shares based on current observe state
	burstableCPUShares := MilliCPUToShares(burstablePodCPURequest)
	if burstableCPUShares < uint64(MinShares) {
		burstableCPUShares = uint64(MinShares)
	}
```
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
Improveing code.

**Special notes for your reviewer**:

**Release note**:

```release-note
```
2017-10-13 19:19:32 -07:00
yanxuean
5d5fee8cab capitalize the first letter
capitalize the first letter for the field comment of containerManagerImpl

Signed-off-by: yanxuean <yan.xuean@zte.com.cn>
2017-10-13 14:54:06 +08:00
Kubernetes Submit Queue
03adf92aa9 Merge pull request #53753 from derekwaynecarr/log-spam
Automatic merge from submit-queue (batch tested with PRs 53119, 53753, 53795, 52981). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Reduce log spam in qos container manager

**What this PR does / why we need it**:
excessive log stmts make it hard to debug actual problems.

**Release note**:
```release-note
NONE
```
2017-10-12 08:28:36 -07:00
yanxuean
8adb2181eb remove redundancy code in setCPUCgroupConfig
Signed-off-by: yanxuean <yan.xuean@zte.com.cn>
2017-10-12 18:42:18 +08:00
Derek Carr
328a12d160 Reduce log spam in qos container manager 2017-10-11 19:47:40 -04:00
Euan Kemp
7aa88b5103 kubelet/cm: remove unneeded fork of 'cat'
Reading a file in Go is perfectly possible without invoking cat.

I also removed an outdated comment.
2017-10-10 21:53:35 -07:00
Kubernetes Submit Queue
ec116fdc73 Merge pull request #53328 from intelsdi-x/lscpu_fix
Automatic merge from submit-queue (batch tested with PRs 53297, 53328). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Cpu Manager - make CoreID's platform unique

**What this PR does / why we need it**:
Cpu Manager uses topology from cAdvisor(`/proc/cpuinfo`) where coreID's are socket unique - not platform unique - this causes problems on multi-socket platforms.

All code assumes unique coreID's (on platform) -  `Discovery` function has been changed to assign CoreID as the lowest cpuID from all cpus belonging to the same core. This can be expressed as:
`CoreID=min(cpuID's on the same core)`

Since cpuID's are platform unique - above gives us guarantee that CoreID's will also be platform unique.



**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #53323
2017-10-10 11:20:37 -07:00
Kubernetes Submit Queue
aaf14d4619 Merge pull request #53525 from sttts/sttts-scheme-copier-romoval
Automatic merge from submit-queue (batch tested with PRs 53525, 53652). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

apimachinery: remove ObjectCopier interface(s)

The big commit is a mechanical, transitive removal of the copier interfaces in all structs and function calls.
2017-10-10 08:31:41 -07:00
Szymon Scharmach
b86dc9c054 Make CoreID's platform unique 2017-10-10 10:45:44 +02:00
Kubernetes Submit Queue
c12dab37e7 Merge pull request #53547 from jiayingz/deviceplugin-fix
Automatic merge from submit-queue (batch tested with PRs 52662, 53547, 53588, 53573, 53599). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

In DevicePluginHandlerImpl.Allocate(), skips untracked extended resou…

…rces.

Otherwise, we would fail a Pod allocation request that has an extended
resource not managed by any device plugin.



**What this PR does / why we need it**:

**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
https://github.com/kubernetes/kubernetes/issues/53548

**Special notes for your reviewer**:

**Release note**:

```release-note
Ignore extended resources that are not registered with kubelet
```
2017-10-09 12:51:17 -07:00
Kubernetes Submit Queue
85b252d47e Merge pull request #51771 from dixudx/refactor_nsenter
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Refactor nsenter

**What this PR does / why we need it**:

**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #51273

**Special notes for your reviewer**:
/assign @jsafrane 

**Release note**:

```release-note
None
```
2017-10-08 23:27:32 -07:00
Dr. Stefan Schimanski
ecb65a6a71 Update generated files 2017-10-07 11:28:47 +02:00
Jiaying Zhang
ee1ffa619b In DevicePluginHandlerImpl.Allocate(), skips untracked extended resources.
Otherwise, we would fail a Pod allocation request that has an extended
resource not managed by any device plugin.
2017-10-06 13:57:53 -07:00
Dr. Stefan Schimanski
ed586da147 apimachinery: remove Scheme.DeepCopy 2017-10-06 14:59:17 +02:00
Kubernetes Submit Queue
5e2ce3aaf2 Merge pull request #53122 from resouer/fix-cpu
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Eliminate extra CRI call during processing cpu set

**What this PR does / why we need it**:

Encountered this during `kubernetes/frakti` node e2e test.

When cpuset is not set, there's still plenty of `runtime.UpdateContainerResources` been called, which seems unnecessary.

cc @ConnorDoyle Make sense? Fixes: #53304

**Special notes for your reviewer**:

**Release note**:

```release-note
Only do UpdateContainerResources when cpuset is set 
```
2017-10-01 15:30:56 -07:00
Harry Zhang
282973d87d Elimenate extra CRI call 2017-09-30 16:51:32 +08:00
Kubernetes Submit Queue
6fcf841d69 Merge pull request #52692 from wackxu/fbc
Automatic merge from submit-queue (batch tested with PRs 44596, 52708, 53163, 53167, 52692). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Fix the bad code comment and make the format unify

**What this PR does / why we need it**:

Fix the bad code comment and make the format unify

**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #


**Release note**:

```release-note
NONE
```
2017-09-28 21:15:43 -07:00
Di Xu
57ead4898b use GetFileType per mount.Interface to check hostpath type 2017-09-26 09:57:06 +08:00
NickrenREN
7f9696201e Fix --kube-reserved storage key name and add test cases for node allocatable reservation 2017-09-26 09:32:21 +08:00
yanxuean
f011c044d4 improve cgroupmanager in qosContainerManager
improve arg "cgroupRoot" type in NewQOSContainerManager

Signed-off-by: yanxuean <yan.xuean@zte.com.cn>
2017-09-25 16:59:15 +08:00
yanxuean
45146cff4e Use arg cgroupRoot,not nodeConfig.CgroupRoot
Using both arg cgroupRoot and nodeConfig.CgroupRoot is confused in function NewQOSContainerManager

Signed-off-by: yanxuean <yan.xuean@zte.com.cn>
2017-09-25 15:19:20 +08:00
wackxu
d8aa0ca82a fix the bad code comment and make the format unify 2017-09-19 11:15:10 +08:00
supereagle
87c29a08e1 fix typos: remove duplicated word in comments 2017-09-16 14:38:10 +08:00
Balaji Subramaniam
e2cb80db4a Added large topology tests for static policy in CPU Manager.
- Added comments for tests cases.
2017-09-06 13:15:22 -07:00
Kubernetes Submit Queue
dcc1aa0628 Merge pull request #51928 from mindprince/pr-45724-fix-build
Automatic merge from submit-queue

Make *fakeMountInterface in container_manager_unsupported_test.go implement mount.Interface again.

This was broken in #45724

**Release note**:
```release-note
NONE
```
/sig storage
/sig node

/cc @jsafrane, @vishh
2017-09-05 19:44:54 -07:00
Kubernetes Submit Queue
99aa992ce8 Merge pull request #51751 from dashpole/update_cadvisor_godep
Automatic merge from submit-queue (batch tested with PRs 51186, 50350, 51751, 51645, 51837)

Update Cadvisor Dependency

Fixes: https://github.com/kubernetes/kubernetes/issues/51832
This is the worst dependency update ever... 
The root of the problem is the [name change of Sirupsen -> sirupsen](https://github.com/sirupsen/logrus/issues/570#issuecomment-313933276).  This means that in order to update cadvisor, which venders the lowercase, we need to update all dependencies to use the lower-cased version.  With that being said, this PR updates the following packages:

`github.com/docker/docker`
- `github.com/docker/distribution`
  - `github.com/opencontainers/go-digest`
  - `github.com/opencontainers/image-spec`
  - `github.com/opencontainers/runtime-spec`
  - `github.com/opencontainers/selinux`
  - `github.com/opencontainers/runc`
    - `github.com/mrunalp/fileutils`
  - `golang.org/x/crypto`
    - `golang.org/x/sys`
- `github.com/docker/go-connections`
- `github.com/docker/go-units`
- `github.com/docker/libnetwork`
- `github.com/docker/libtrust`
- `github.com/sirupsen/logrus`
- `github.com/vishvananda/netlink`

`github.com/google/cadvisor`
- `github.com/euank/go-kmsg-parser`

`github.com/json-iterator/go`

Fixed https://github.com/kubernetes/kubernetes/issues/51832

```release-note
Fix journalctl leak on kubelet restart
Fix container memory rss
Add hugepages monitoring support
Fix incorrect CPU usage metrics with 4.7 kernel
Add tmpfs monitoring support
```
2017-09-05 17:30:06 -07:00
Kubernetes Submit Queue
8b9e8cf80a Merge pull request #51744 from jiayingz/deviceplugin-checkpoint
Automatic merge from submit-queue (batch tested with PRs 50072, 51744)

Deviceplugin checkpoint

**What this PR does / why we need it**:
Extends on top of PR 51209 to checkpoint device to pod allocation information on Kubelet to recover from Kubelet restarts.

**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #

**Special notes for your reviewer**:

**Release note**:

```release-note
```
2017-09-05 13:33:01 -07:00
David Ashpole
e5a6a79fd7 update cadvisor, docker, and runc godeps 2017-09-05 12:38:57 -07:00
Jiaying Zhang
3b2bc58c11 Extends device_plugin_handler to checkpoint device to container allocation information. 2017-09-05 09:52:14 -07:00
Derek Carr
38d5dee677 Node validation restricts pre-allocated hugepages to single page size 2017-09-05 10:34:30 -04:00
Derek Carr
1ec2a69d9a Kubelet changes to support hugepages 2017-09-05 09:46:08 -04:00
Rohit Agarwal
08ea02b9a5 Make *fakeMountInterface in container_manager_unsupported_test.go implement mount.Interface again.
This was broken in #45724
2017-09-04 21:48:55 -07:00
Balaji Subramaniam
5b5958ecec Add tests for the static cpumanager policy. 2017-09-04 07:24:59 -07:00
Connor Doyle
d0bcbbb437 Added static cpumanager policy. 2017-09-04 07:24:59 -07:00
Connor Doyle
e03a6435bb Added cpu assignment helpers. 2017-09-04 07:24:59 -07:00
Szymon Scharmach
242439c9d7 Add topology helper and tests to cpumanager. 2017-09-04 07:24:59 -07:00
Connor Doyle
e4d5565228 Fix Start signature in container_manager_windows. 2017-09-04 07:24:59 -07:00
Connor Doyle
81ccd396d7 Fixed nil InternalContainerLifecycle in cm stubs. 2017-09-04 07:24:59 -07:00
Connor Doyle
ec706216e6 Un-revert "CPU manager wiring and none policy"
This reverts commit 8d2832021a.
2017-09-04 07:24:59 -07:00
Jiaying Zhang
29d178fbc3 Fixes a cross-build failure introduced in PR 51209. FYI, issue 51863. 2017-09-02 21:56:39 -07:00
Kubernetes Submit Queue
917f9f02ef Merge pull request #45724 from jsafrane/mount-propagation2
Automatic merge from submit-queue

Make /var/lib/kubelet as shared during startup

This is part of ~~https://github.com/kubernetes/community/pull/589~~ https://github.com/kubernetes/community/pull/659

We'd like kubelet to be able to consume mounts from containers in the future, therefore kubelet should make sure that `/var/lib/kubelet` has shared mount propagation to be able to see these mounts. 

On most distros, root directory is already mounted with shared mount propagation and this code will not do anything. On older distros such as Debian Wheezy, this code detects that `/var/lib/kubelet` is a directory on `/` which has private mount propagation and kubelet bind-mounts `/var/lib/kubelet` as rshared.

Both "regular" linux mounter and `NsenterMounter` are updated here.

@kubernetes/sig-storage-pr-reviews @kubernetes/sig-node-pr-reviews 
@vishh 

Release note:
```release-note
Kubelet re-binds /var/lib/kubelet directory with rshared mount propagation during startup if it is not shared yet.
```
2017-09-02 12:00:30 -07:00
Jiaying Zhang
02001af752 Kubelet side extension to support device allocation 2017-09-01 11:56:35 -07:00
Renaud Gaubert
c4a1c97329 Device Plugin Kubelet integration 2017-09-01 11:47:09 -07:00
Shyam JVS
8d2832021a Revert "CPU manager wiring and none policy" 2017-09-01 18:17:36 +02:00
Connor Doyle
50674ec614 Added cpu-manager-reconcile-period config.
- Defaults to sync-frequency.
2017-08-30 23:42:32 -07:00
Connor Doyle
7c6e31617d CPU Manager initialization and lifecycle calls. 2017-08-30 08:50:41 -07:00
Connor Doyle
5dee682796 CPU manager config and feature gate. 2017-08-30 08:27:23 -07:00
Balaji Subramaniam
7567f1765f Added CPU manager unit tests (none policy) 2017-08-30 08:26:22 -07:00
Seth Jennings
ff471913f9 Added none policy for CPU manager. 2017-08-30 08:26:21 -07:00