There is a race in predicateAdmitHandler Admit() that getNodeAnyWayFunc()
could get Node with non-zero deviceplugin resource allocatable for a
non-existing endpoint. That race can happen when a device plugin fails,
but is more likely when kubelet restarts as with the current registration
model, there is a time gap between kubelet restart and device plugin
re-registration. During this time window, even though devicemanager could
have removed the resource initially during GetCapacity() call, Kubelet
may overwrite the device plugin resource capacity/allocatable with the
old value when node update from the API server comes in later. This
could cause a pod to be started without proper device runtime config set.
To solve this problem, introduce endpointStopGracePeriod. When a device
plugin fails, don't immediately remove the endpoint but set stopTime in
its endpoint. During kubelet restart, create endpoints with stopTime set
for any checkpointed registered resource. The endpoint is considered to be
in stopGracePeriod if its stoptime is set. This allows us to track what
resources should be handled by devicemanager during the time gap.
When an endpoint's stopGracePeriod expires, we remove the endpoint and
its resource. This allows the resource to be exported through other channels
(e.g., by directly updating node status through API server) if there is such
use case. Currently endpointStopGracePeriod is set as 5 minutes.
Given that an endpoint is no longer immediately removed upon disconnection,
mark all its devices unhealthy so that we can signal the resource allocatable
change to the scheduler to avoid scheduling more pods to the node.
When a device plugin endpoint is in stopGracePeriod, pods requesting the
corresponding resource will fail admission handler.
incompatible changes:
- Add GetDevicePluginOptions rpc call. This is needed when we switch
from Registration service to probe-based plugin watcher.
- Change AllocateRequest and AllocateResponse to allow device requests
from multiple containers in a pod. Currently only made mechanical
change on the devicemanager and test code to cope with the API but
still issues an Allocate call per container. We can modify the
devicemanager in 1.11 to issue a single Allocate call per pod.
The change will also facilitate incremental API change to communicate
pod level information through Allocate rpc if there is such future
need.
Automatic merge from submit-queue (batch tested with PRs 59489, 59716). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
devicemanager testing: dynamically choose tmp dir
This avoids the test issue #59488 that I was running into.
I believe I have a reasonable explanation for the race condition in that issue (TLDR: it's probably part of the gRPC API and k8s can only avoid the issue until a proper solution gets worked out together with gRPC), therefore I suggest to merge this PR now both because it avoids the issue and because using fixed tmp directories is something that should be avoided anyway.
/assign @jiayingz
Hard-coding the tests to use /tmp/device_plugin for sockets is
problematic because it prevents running tests in parallel on the same
machine (perhaps because there are multiple developers, perhaps
because testing is done independently on different code checkouts).
/tmp/device_plugin also was not removed after testing.
This is probably not that relevant. But more importantly, this change
also fixes https://github.com/kubernetes/kubernetes/issues/59488.
"make test" failed in TestDevicePluginReRegistration because something
removed /tmp/device_plugin/device-plugin.sock while something else
tried to connect to it:
2018/02/07 14:34:39 Starting to serve on /tmp/device_plugin/device-plugin.sock
[pid 29568] connect(14, {sa_family=AF_UNIX, sun_path="/tmp/device_plugin/server.sock"}, 33) = 0
[pid 29568] unlinkat(AT_FDCWD, "/tmp/device_plugin/server.sock", 0) = 0
[pid 29568] unlinkat(AT_FDCWD, "/tmp/device_plugin/device-plugin.sock", 0) = 0
[pid 29568] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=29568, si_uid=1000} ---
[pid 29568] connect(6, {sa_family=AF_UNIX, sun_path="/tmp/device_plugin/device-plugin.sock"}, 40) = -1 ENOENT (No such file or directory)
E0207 14:34:39.961321 29568 endpoint.go:117] listAndWatch ended unexpectedly for device plugin mock with error rpc error: code = Unavailable desc = transport is closing
strace: Process 29623 attached
[pid 29574] connect(3, {sa_family=AF_UNIX, sun_path="/tmp/device_plugin/device-plugin.sock"}, 40) = -1 ENOENT (No such file or directory)
[pid 29623] connect(3, {sa_family=AF_UNIX, sun_path="/tmp/device_plugin/device-plugin.sock"}, 40) = -1 ENOENT (No such file or directory)
[pid 29574] connect(3, {sa_family=AF_UNIX, sun_path="/tmp/device_plugin/device-plugin.sock"}, 40) = -1 ENOENT (No such file or directory)
E0207 14:34:49.961324 29568 endpoint.go:60] Can't create new endpoint with path /tmp/device_plugin/device-plugin.sock err failed to dial device plugin: context deadline exceeded
E0207 14:34:49.961390 29568 manager.go:340] Failed to dial device plugin with request &RegisterRequest{Version:v1alpha2,Endpoint:device-plugin.sock,ResourceName:fake-domain/resource,}: failed to dial device plugin: context deadline exceeded
panic: test timed out after 2m0s
It's not entirely certain which code was to blame for this unlinkat()
calls (perhaps some cleanup code from a previous test running in a
goroutine?) but this no longer happened after switching to per-test
socket directories.