kubernetes

Author	SHA1	Message	Date
Patrick Ohly	fcbb64b93d	avoid race condition in device manager and plugin startup/shutdown A flaky test exposed a race condition where shutting down one server instance broke the startup of the next instance when using the same socket path. Commit `1325c2f8be` removed the reuse of the same socket path and thus avoided the issue. But the real fix is to ensure that the listening socket is really closed once Stop returns. Two solutions were proposed in https://github.com/grpc/grpc-go/issues/1861: - waiting for the goroutine to complete - closing the socket The former is done here because it's cleaner to not keep lingering goroutines. While at it, the Stop methods are made idempotent (similar to e.g. Close on a socket) and no longer crash when called without prior Start. Fixes https://github.com/kubernetes/kubernetes/issues/59488	2018-04-12 17:59:10 +02:00
Kubernetes Submit Queue	0022bec3a2	Merge pull request #61525 from tianshapjq/place-consts-together Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. move the const to the place it should be What this PR does / why we need it: move the const to the place it should be Which issue(s) this PR fixes (optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged): Fixes # Special notes for your reviewer: Release note: ```release-note ```	2018-03-25 09:51:42 -07:00
hzxuzhonghu	70e45eccf2	Replace "golang.org/x/net/context" with "context"	2018-03-22 20:57:14 +08:00
tianshapjq	55921d0827	move the const to the place it should be	2018-03-22 14:20:15 +08:00
Jiaying Zhang	5514a1f4dd	Fixes the races around devicemanager Allocate() and endpoint deletion. There is a race in predicateAdmitHandler Admit() that getNodeAnyWayFunc() could get Node with non-zero deviceplugin resource allocatable for a non-existing endpoint. That race can happen when a device plugin fails, but is more likely when kubelet restarts as with the current registration model, there is a time gap between kubelet restart and device plugin re-registration. During this time window, even though devicemanager could have removed the resource initially during GetCapacity() call, Kubelet may overwrite the device plugin resource capacity/allocatable with the old value when node update from the API server comes in later. This could cause a pod to be started without proper device runtime config set. To solve this problem, introduce endpointStopGracePeriod. When a device plugin fails, don't immediately remove the endpoint but set stopTime in its endpoint. During kubelet restart, create endpoints with stopTime set for any checkpointed registered resource. The endpoint is considered to be in stopGracePeriod if its stoptime is set. This allows us to track what resources should be handled by devicemanager during the time gap. When an endpoint's stopGracePeriod expires, we remove the endpoint and its resource. This allows the resource to be exported through other channels (e.g., by directly updating node status through API server) if there is such use case. Currently endpointStopGracePeriod is set as 5 minutes. Given that an endpoint is no longer immediately removed upon disconnection, mark all its devices unhealthy so that we can signal the resource allocatable change to the scheduler to avoid scheduling more pods to the node. When a device plugin endpoint is in stopGracePeriod, pods requesting the corresponding resource will fail admission handler.	2018-03-09 17:00:57 -08:00
Jiaying Zhang	07beac6004	Made a couple API changes to deviceplugin/v1beta1 to avoid future incompatible changes: - Add GetDevicePluginOptions rpc call. This is needed when we switch from Registration service to probe-based plugin watcher. - Change AllocateRequest and AllocateResponse to allow device requests from multiple containers in a pod. Currently only made mechanical change on the devicemanager and test code to cope with the API but still issues an Allocate call per container. We can modify the devicemanager in 1.11 to issue a single Allocate call per pod. The change will also facilitate incremental API change to communicate pod level information through Allocate rpc if there is such future need.	2018-02-23 16:15:09 -08:00
vikaschoudhary16	e64517cd74	Migrate deviceplugin api from v1alpha to v1beta1	2018-02-21 01:26:20 -05:00
vikaschoudhary16	defcab81d5	Invoke PreStart RPC call before container start, if desired by plugin Signed-off-by: vikaschoudhary16 <vichoudh@redhat.com>	2018-02-21 01:25:24 -05:00
tianshapjq	21702e3c39	TODO has already been implemented	2018-02-01 14:38:29 +08:00
tianshapjq	e0f15bf5bf	Len() is already int	2018-01-29 09:01:23 +08:00
Connor Doyle	e5667cf426	Rename package deviceplugin => devicemanager.	2018-01-24 22:32:43 -08:00

1 2

61 Commits