Files
kubernetes/hack
Kubernetes Submit Queue 6de28fab7d Merge pull request #42942 from vishh/gpu-cont-fix
Automatic merge from submit-queue (batch tested with PRs 42942, 42935)

[Bug] Handle container restarts and avoid using runtime pod cache while allocating GPUs

Fixes #42412

**Background**
Support for multiple GPUs is an experimental feature in v1.6. 
Container restarts were handled incorrectly which resulted in stranding of GPUs
Kubelet is incorrectly using runtime cache to track running pods which can result in race conditions (as it did in other parts of kubelet). This can result in same GPU being assigned to multiple pods.

**What does this PR do**
This PR tracks assignment of GPUs to containers and returns pre-allocated GPUs instead of (incorrectly) allocating new GPUs.
GPU manager is updated to consume a list of active pods derived from apiserver cache instead of runtime cache.
Node e2e has been extended to validate this failure scenario.

**Risk**
Minimal/None since support for GPUs is an experimental feature that is turned off by default. The code is also isolated to GPU manager in kubelet.

**Workarounds**
In the absence of this PR, users can mitigate the original issue by setting `RestartPolicyNever`  in their pods.
There is no workaround for the race condition caused by using the runtime cache though.
Hence it is worth including this fix in v1.6.0.

cc @jianzhangbjz @seelam @kubernetes/sig-node-pr-reviews 

Replaces #42560
2017-03-14 10:19:17 -07:00
..
2017-02-02 15:20:45 -08:00
2017-03-13 10:58:26 -07:00
2017-02-10 17:00:28 -08:00
2016-07-12 21:52:54 -07:00
2016-07-12 21:52:00 -07:00
2017-02-01 15:18:32 -05:00
2016-08-02 10:27:29 -04:00
2016-07-12 21:52:00 -07:00
2016-07-12 21:52:00 -07:00
2017-02-09 11:09:13 -08:00
2016-12-06 13:45:10 -05:00
2016-12-14 06:03:00 -08:00