Initial design doc for AWS GPU support
This commit is contained in:
308
docs/proposals/gpu-support.md
Normal file
308
docs/proposals/gpu-support.md
Normal file
@@ -0,0 +1,308 @@
|
|||||||
|
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||||
|
|
||||||
|
<!-- BEGIN STRIP_FOR_RELEASE -->
|
||||||
|
|
||||||
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||||
|
width="25" height="25">
|
||||||
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||||
|
width="25" height="25">
|
||||||
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||||
|
width="25" height="25">
|
||||||
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||||
|
width="25" height="25">
|
||||||
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||||
|
width="25" height="25">
|
||||||
|
|
||||||
|
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
|
||||||
|
|
||||||
|
If you are using a released version of Kubernetes, you should
|
||||||
|
refer to the docs that go with that version.
|
||||||
|
|
||||||
|
Documentation for other releases can be found at
|
||||||
|
[releases.k8s.io](http://releases.k8s.io).
|
||||||
|
</strong>
|
||||||
|
--
|
||||||
|
|
||||||
|
<!-- END STRIP_FOR_RELEASE -->
|
||||||
|
|
||||||
|
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||||
|
|
||||||
|
<!-- BEGIN MUNGE: GENERATED_TOC -->
|
||||||
|
|
||||||
|
- [GPU support](#gpu-support)
|
||||||
|
- [Objective](#objective)
|
||||||
|
- [Background](#background)
|
||||||
|
- [Detailed discussion](#detailed-discussion)
|
||||||
|
- [Inventory](#inventory)
|
||||||
|
- [Scheduling](#scheduling)
|
||||||
|
- [The runtime](#the-runtime)
|
||||||
|
- [NVIDIA support](#nvidia-support)
|
||||||
|
- [Event flow](#event-flow)
|
||||||
|
- [Too complex for now: nvidia-docker](#too-complex-for-now-nvidia-docker)
|
||||||
|
- [Implementation plan](#implementation-plan)
|
||||||
|
- [V0](#v0)
|
||||||
|
- [Scheduling](#scheduling)
|
||||||
|
- [Runtime](#runtime)
|
||||||
|
- [Other](#other)
|
||||||
|
- [Future work](#future-work)
|
||||||
|
- [V1](#v1)
|
||||||
|
- [V2](#v2)
|
||||||
|
- [V3](#v3)
|
||||||
|
- [Undetermined](#undetermined)
|
||||||
|
- [Security considerations](#security-considerations)
|
||||||
|
|
||||||
|
<!-- END MUNGE: GENERATED_TOC -->
|
||||||
|
|
||||||
|
# GPU support
|
||||||
|
|
||||||
|
Author: @therc
|
||||||
|
|
||||||
|
Date: Apr 2016
|
||||||
|
|
||||||
|
Status: Design in progress, early implementation of requirements
|
||||||
|
|
||||||
|
## Objective
|
||||||
|
|
||||||
|
Users should be able to request GPU resources for their workloads, as easily as
|
||||||
|
for CPU or memory. Kubernetes should keep an inventory of machines with GPU
|
||||||
|
hardware, schedule containers on appropriate nodes and set up the container
|
||||||
|
environment with all that's necessary to access the GPU. All of this should
|
||||||
|
eventually be supported for clusters on either bare metal or cloud providers.
|
||||||
|
|
||||||
|
## Background
|
||||||
|
|
||||||
|
An increasing number of workloads, such as machine learning and seismic survey
|
||||||
|
processing, benefits from offloading computations to graphic hardware. While not
|
||||||
|
as tuned as traditional, dedicated high performance computing systems such as
|
||||||
|
MPI, a Kubernetes cluster can still be a great environment for organizations
|
||||||
|
that need a variety of additional, "classic" workloads, such as database, web
|
||||||
|
serving, etc.
|
||||||
|
|
||||||
|
GPU support is hard to provide extensively and will thus take time to tame
|
||||||
|
completely, because
|
||||||
|
|
||||||
|
- different vendors expose the hardware to users in different ways
|
||||||
|
- some vendors require fairly tight coupling between the kernel driver
|
||||||
|
controlling the GPU and the libraries/applications that access the hardware
|
||||||
|
- it adds more resource types (whole GPUs, GPU cores, GPU memory)
|
||||||
|
- it can introduce new security pitfalls
|
||||||
|
- for systems with multiple GPUs, affinity matters, similarly to NUMA
|
||||||
|
considerations for CPUs
|
||||||
|
- running GPU code in containers is still a relatively novel idea
|
||||||
|
|
||||||
|
## Detailed discussion
|
||||||
|
|
||||||
|
Currently, this document is mostly focused on the basic use case: run GPU code
|
||||||
|
on AWS `g2.2xlarge` EC2 machine instances using Docker. It constitutes a narrow
|
||||||
|
enough scenario that it does not require large amounts of generic code yet. GCE
|
||||||
|
doesn't support GPUs at all; bare metal systems throw a lot of extra variables
|
||||||
|
into the mix.
|
||||||
|
|
||||||
|
Later sections will outline future work to support a broader set of hardware,
|
||||||
|
environments and container runtimes.
|
||||||
|
|
||||||
|
### Inventory
|
||||||
|
|
||||||
|
Before any scheduling can occur, we need to know what's available out there. In
|
||||||
|
v0, we'll hardcode capacity detected by the kubelet based on a flag,
|
||||||
|
`--experimental-nvidia-gpu`. This will result in the user-defined resource
|
||||||
|
`alpha.kubernetes.io/nvidia-gpu` to be reported for `NodeCapacity` and
|
||||||
|
`NodeAllocatable`, as well as as a node label.
|
||||||
|
|
||||||
|
### Scheduling
|
||||||
|
|
||||||
|
GPUs will be visible as first-class resources. In v0, we'll only assign whole
|
||||||
|
devices; sharing among multiple pods is left to future implementations. It's
|
||||||
|
probable that GPUs will exacerbate the need for [a rescheduler](rescheduler.md)
|
||||||
|
or pod priorities, especially if the nodes in a cluster are not homogeneous.
|
||||||
|
Consider these two cases:
|
||||||
|
|
||||||
|
> Only half of the machines have a GPU and they're all busy with other
|
||||||
|
workloads. The other half of the cluster is doing very little work. A GPU
|
||||||
|
workload comes, but it can't schedule, because the devices are sitting idle on
|
||||||
|
nodes that are running something else and the nodes with little load lack the
|
||||||
|
hardware.
|
||||||
|
|
||||||
|
> Some or all the machines have two graphic cards each. A number of jobs get
|
||||||
|
scheduled, requesting one device per pod. The scheduler puts them all on
|
||||||
|
different machines, spreading the load, perhaps by design. Then a new job comes
|
||||||
|
in, requiring two devices per pod, but it can't schedule anywhere, because all
|
||||||
|
we can find, at most, is one unused device per node.
|
||||||
|
|
||||||
|
### The runtime
|
||||||
|
|
||||||
|
Once we know where to run the container, it's time to set up its environment. At
|
||||||
|
a minimum, we'll need to map the host device(s) into the container. Because each
|
||||||
|
manufacturer exposes different device nodes (`/dev/ati/card0`, `/dev/nvidia0`,
|
||||||
|
but also the required `/dev/nvidiactl` and `/dev/nvidia-uvm`), some of the logic
|
||||||
|
needs to be hardware-specific, mapping from a logical device to a list of device
|
||||||
|
nodes necessary for software to talk to it.
|
||||||
|
|
||||||
|
Support binaries and libraries are often versioned along with the kernel module,
|
||||||
|
so there should be further hooks to project those under `/bin` and some kind of
|
||||||
|
`/lib` before the application is started. This can be done for Docker with the
|
||||||
|
use of a versioned [Docker
|
||||||
|
volume](https://docs.docker.com/engine/userguide/containers/dockervolumes/) or
|
||||||
|
with upcoming Kubernetes-specific hooks such as init containers and volume
|
||||||
|
containers. In v0, images are expected to bundle everything they need.
|
||||||
|
|
||||||
|
#### NVIDIA support
|
||||||
|
|
||||||
|
The first implementation and testing ground will be for NVIDIA devices, by far
|
||||||
|
the most common setup.
|
||||||
|
|
||||||
|
In v0, the `--experimental-nvidia-gpu` flag will also result in the host devices
|
||||||
|
(limited to those required to drive the first card, `nvidia0`) to be mapped into
|
||||||
|
the container by the dockertools library.
|
||||||
|
|
||||||
|
### Event flow
|
||||||
|
|
||||||
|
This is what happens before and after an user schedules a GPU pod.
|
||||||
|
|
||||||
|
1. Administrator installs a number of Kubernetes nodes with GPUs. The correct
|
||||||
|
kernel modules and device nodes under `/dev/` are present.
|
||||||
|
|
||||||
|
1. Administrator makes sure the latest CUDA/driver versions are installed.
|
||||||
|
|
||||||
|
1. Administrator enables `--experimental-nvidia-gpu` on kubelets
|
||||||
|
|
||||||
|
1. Kubelets update node status with information about the GPU device, in addition
|
||||||
|
to cAdvisor's usual data about CPU/memory/disk
|
||||||
|
|
||||||
|
1. User creates a Docker image compiling their application for CUDA, bundling
|
||||||
|
the necessary libraries. We ignore any versioning requirements in the image
|
||||||
|
using labels based on [NVIDIA's
|
||||||
|
conventions](https://github.com/NVIDIA/nvidia-docker/blob/64510511e3fd0d00168eb076623854b0fcf1507d/tools/src/nvidia-docker/utils.go#L13).
|
||||||
|
|
||||||
|
1. User creates a pod using the image, requiring
|
||||||
|
`alpha.kubernetes.io/nvidia-gpu: 1`
|
||||||
|
|
||||||
|
1. Scheduler picks a node for the pod
|
||||||
|
|
||||||
|
1. The kubelet notices the GPU requirement and maps the three devices. In
|
||||||
|
Docker's engine-api, this means it'll add them to the Resources.Devices list.
|
||||||
|
|
||||||
|
1. Docker runs the container to completion
|
||||||
|
|
||||||
|
1. The scheduler notices that the device is available again
|
||||||
|
|
||||||
|
### Too complex for now: nvidia-docker
|
||||||
|
|
||||||
|
For v0, we discussed at length, but decided to leave aside initially the
|
||||||
|
[nvidia-docker plugin](https://github.com/NVIDIA/nvidia-docker). The plugin is
|
||||||
|
an officially supported solution, thus avoiding a lot of new low level code, as
|
||||||
|
it takes care of functionality such as:
|
||||||
|
|
||||||
|
- creating a Docker volume with binaries such as `nvidia-smi` and shared
|
||||||
|
libraries
|
||||||
|
- providing HTTP endpoints that monitoring tools can use to collect GPU metrics
|
||||||
|
- abstracting details such as `/dev` entry names for each device, as well as
|
||||||
|
control ones like `nvidiactl`
|
||||||
|
|
||||||
|
The `nvidia-docker` wrapper also verifies that the CUDA version required by a
|
||||||
|
given image is supported by the host drivers, through inspection of well-known
|
||||||
|
image labels, if present. We should try to provide equivalent checks, either
|
||||||
|
for CUDA or OpenCL.
|
||||||
|
|
||||||
|
This is current sample output from `nvidia-docker-plugin`, wrapped for
|
||||||
|
readability:
|
||||||
|
|
||||||
|
$ curl -s localhost:3476/docker/cli
|
||||||
|
--device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0
|
||||||
|
--volume-driver=nvidia-docker
|
||||||
|
--volume=nvidia_driver_352.68:/usr/local/nvidia:ro
|
||||||
|
|
||||||
|
It runs as a daemon listening for HTTP requests on port 3476. The endpoint above
|
||||||
|
returns flags that need to be added to the Docker command line in order to
|
||||||
|
expose GPUs to the containers. There are optional URL arguments to request
|
||||||
|
specific devices if more than one are present on the system, as well as specific
|
||||||
|
versions of the support software. An obvious improvement is an additional
|
||||||
|
endpoint for JSON output.
|
||||||
|
|
||||||
|
The unresolved question is whether `nvidia-docker-plugin` would run standalone
|
||||||
|
as it does today (called over HTTP, perhaps with endpoints for a new Kubernetes
|
||||||
|
resource API) or whether the relevant code from its `nvidia` package should be
|
||||||
|
linked directly into kubelet. A partial list of tradeoffs:
|
||||||
|
|
||||||
|
| | External binary | Linked in |
|
||||||
|
|---------------------|---------------------------------------------------------------------------------------------------|--------------------------------------------------------------|
|
||||||
|
| Use of cgo | Confined to binary | Linked into kubelet, but with lazy binding |
|
||||||
|
| Expandibility | Limited if we run the plugin, increased if library is used to build a Kubernetes-tailored daemon. | Can reuse the `nvidia` library as we prefer |
|
||||||
|
| Bloat | None | Larger kubelet, even for systems without GPUs |
|
||||||
|
| Reliability | Need to handle the binary disappearing at any time | Fewer headeaches |
|
||||||
|
| (Un)Marshalling | Need to talk over JSON | None |
|
||||||
|
| Administration cost | One more daemon to install, configure and monitor | No extra work required, other than perhaps configuring flags |
|
||||||
|
| Releases | Potentially on its own schedule | Tied to Kubernetes' |
|
||||||
|
|
||||||
|
## Implementation plan
|
||||||
|
|
||||||
|
### V0
|
||||||
|
|
||||||
|
The first two tracks can progress in parallel.
|
||||||
|
|
||||||
|
#### Scheduling
|
||||||
|
|
||||||
|
1. Define new resource `alpha.kubernetes.io:nvidia-gpu` in `pkg/api/types.go`
|
||||||
|
and co.
|
||||||
|
1. Plug resource into feasability checks used by kubelet, scheduler and
|
||||||
|
schedulercache. Maybe gated behind a flag?
|
||||||
|
1. Plug resource into resource_helpers.go
|
||||||
|
1. Plug resource into the limitranger
|
||||||
|
|
||||||
|
#### Runtime
|
||||||
|
|
||||||
|
1. Add kubelet config parameter to enable the resource
|
||||||
|
1. Make kubelet's `setNodeStatusMachineInfo` report the resource
|
||||||
|
1. Add a Devices list to container.RunContainerOptions
|
||||||
|
1. Use it from DockerManager's runContainer
|
||||||
|
1. Do the same for rkt (stretch goal)
|
||||||
|
1. When a pod requests a GPU, add the devices to the container options
|
||||||
|
|
||||||
|
#### Other
|
||||||
|
|
||||||
|
1. Add new resource to `kubectl describe` output. Optional for non-GPU users?
|
||||||
|
1. Administrator documentation, with sample scripts
|
||||||
|
1. User documentation
|
||||||
|
|
||||||
|
## Future work
|
||||||
|
|
||||||
|
Above all, we need to collect feedback from real users and use that to set
|
||||||
|
priorities for any of the items below.
|
||||||
|
|
||||||
|
### V1
|
||||||
|
|
||||||
|
- Perform real detection of the installed hardware
|
||||||
|
- Figure a standard way to avoid bundling of shared libraries in images
|
||||||
|
- Support fractional resources so multiple pods can share the same GPU
|
||||||
|
- Support bare metal setups
|
||||||
|
- Report resource usage
|
||||||
|
|
||||||
|
### V2
|
||||||
|
|
||||||
|
- Support multiple GPUs with resource hierarchies and affinities
|
||||||
|
- Support versioning of resources (e.g. "CUDA v7.5+")
|
||||||
|
- Build resource plugins into the kubelet?
|
||||||
|
- Support other device vendors
|
||||||
|
- Support Azure?
|
||||||
|
- Support rkt?
|
||||||
|
|
||||||
|
### V3
|
||||||
|
|
||||||
|
- Support OpenCL (so images can be device-agnostic)
|
||||||
|
|
||||||
|
### Undetermined
|
||||||
|
|
||||||
|
It makes sense to turn the output of this project (external resource plugins,
|
||||||
|
etc.) into a more generic abstraction at some point.
|
||||||
|
|
||||||
|
|
||||||
|
## Security considerations
|
||||||
|
|
||||||
|
There should be knobs for the cluster administrator to only allow certain users
|
||||||
|
or roles to schedule GPU workloads. Overcommitting or sharing the same device
|
||||||
|
across different pods is not considered safe. It should be possible to segregate
|
||||||
|
such GPU-sharing pods by user, namespace or a combination thereof.
|
||||||
|
|
||||||
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||||
|
[]()
|
||||||
|
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
Reference in New Issue
Block a user