kubernetes/pkg/kubelet/cm
Kubernetes Submit Queue a3f40dd8df
Merge pull request #60856 from jiayingz/race-fix
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Fixes the races around devicemanager Allocate() and endpoint deletion.

There is a race in predicateAdmitHandler Admit() that getNodeAnyWayFunc()
could get Node with non-zero deviceplugin resource allocatable for a
non-existing endpoint. That race can happen when a device plugin fails,
but is more likely when kubelet restarts as with the current registration
model, there is a time gap between kubelet restart and device plugin
re-registration. During this time window, even though devicemanager could
have removed the resource initially during GetCapacity() call, Kubelet
may overwrite the device plugin resource capacity/allocatable with the
old value when node update from the API server comes in later. This
could cause a pod to be started without proper device runtime config set.

To solve this problem, introduce endpointStopGracePeriod. When a device
plugin fails, don't immediately remove the endpoint but set stopTime in
its endpoint. During kubelet restart, create endpoints with stopTime set
for any checkpointed registered resource. The endpoint is considered to be
in stopGracePeriod if its stoptime is set. This allows us to track what
resources should be handled by devicemanager during the time gap.
When an endpoint's stopGracePeriod expires, we remove the endpoint and
its resource. This allows the resource to be exported through other channels
(e.g., by directly updating node status through API server) if there is such
use case. Currently endpointStopGracePeriod is set as 5 minutes.

Given that an endpoint is no longer immediately removed upon disconnection,
mark all its devices unhealthy so that we can signal the resource allocatable
change to the scheduler to avoid scheduling more pods to the node.
When a device plugin endpoint is in stopGracePeriod, pods requesting the
corresponding resource will fail admission handler.

Tested:
Ran GPUDevicePlugin e2e_node test 100 times and all passed now.



**What this PR does / why we need it**:

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes https://github.com/kubernetes/kubernetes/issues/60176

**Special notes for your reviewer**:

**Release note**:

```release-note
Fixes the races around devicemanager Allocate() and endpoint deletion.
```
2018-03-12 02:50:13 -07:00
..
cpumanager Autogenerated: hack/update-bazel.sh 2018-02-16 13:43:01 -08:00
cpuset Autogenerated: hack/update-bazel.sh 2018-02-16 13:43:01 -08:00
devicemanager Fixes the races around devicemanager Allocate() and endpoint deletion. 2018-03-09 17:00:57 -08:00
util Autogenerate BUILD files 2017-12-23 13:12:11 -08:00
BUILD Promote LocalStorageCapacityIsolation feature to beta 2018-03-02 15:10:08 -08:00
cgroup_manager_linux_test.go Test cases fix after path expansion 2018-02-20 14:23:09 -05:00
cgroup_manager_linux.go kubelet ignores hugepages if hugetlb is not enabled 2018-02-05 13:07:59 -05:00
cgroup_manager_test.go Lift embedded structure out of eviction-related KubeletConfiguration fields 2017-11-16 18:35:13 -08:00
cgroup_manager_unsupported.go Add pod-level metric for CPU and memory stats 2017-11-22 09:25:23 -08:00
container_manager_linux_test.go use GetFileType per mount.Interface to check hostpath type 2017-09-26 09:57:06 +08:00
container_manager_linux.go Invoke PreStart RPC call before container start, if desired by plugin 2018-02-21 01:25:24 -05:00
container_manager_stub.go Promote LocalStorageCapacityIsolation feature to beta 2018-03-02 15:10:08 -08:00
container_manager_unsupported.go Remove redundant code in container manager. 2017-11-24 03:15:55 -08:00
container_manager_windows.go Extends deviceplugin to gracefully handle full device plugin lifecycle. 2017-11-20 23:40:14 -08:00
container_manager.go collect metrics on the /kubepods cgroup on-demand 2018-02-17 12:32:40 -08:00
fake_internal_container_lifecycle.go Un-revert "CPU manager wiring and none policy" 2017-09-04 07:24:59 -07:00
helpers_linux_test.go update cadvisor, docker, and runc godeps 2017-09-05 12:38:57 -07:00
helpers_linux.go Add pod-level metric for CPU and memory stats 2017-11-22 09:25:23 -08:00
helpers_unsupported.go Add pod-level metric for CPU and memory stats 2017-11-22 09:25:23 -08:00
internal_container_lifecycle.go Fixed nil InternalContainerLifecycle in cm stubs. 2017-09-04 07:24:59 -07:00
node_container_manager_test.go codeClean-merge-logfAndFailnow-to-fatalf 2018-01-31 11:39:31 +08:00
node_container_manager.go Move some kubelet constants to a common place. 2017-12-01 11:24:04 +08:00
OWNERS Add ConnorDoyle as approver in /pkg/kubelet/cm. 2017-12-06 09:05:59 -06:00
pod_container_manager_linux.go Set pids limit at pod level 2018-01-11 21:22:38 -05:00
pod_container_manager_stub.go run hack/update-all 2017-06-22 11:31:03 -07:00
pod_container_manager_unsupported.go Remove redundant code in container manager. 2017-11-24 03:15:55 -08:00
qos_container_manager_linux.go Merge pull request #52977 from yanxuean/improvecgroup 2017-11-18 13:13:28 -08:00
types.go Set pids limit at pod level 2018-01-11 21:22:38 -05:00