Commit Graph

10 Commits

Author SHA1 Message Date
David Porter
e1a951afe5 Fix COS GPU driver installation
* Rely on the built in GPU driver installer in COS as recommended in
  public docs - https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus
* Run `nvidia-smi` after installation to verify installation
2021-10-28 17:49:50 -07:00
Danielle Lancashire
0cc8af82a1 e2e_node: use upstream gpu installer
The current GPU installer was built in 2017, from source that no longer
exists in Kubernetes ([adding commit][1]. The image was built on 2017-06-13.

Unfortunately, this installer no longer appears to work. When debugging
on the same node type as used by test-infra, it failed to build the
driver as the kernel sha was no longer available.

This lead to needing to find a new way to install GPUs. The smallest
logical change was switching to [cos-gpu-installer][2]
. There is a newer version of this available on [googlesource][3] that
I have not yet tested as it's not clear what the state of the project
is, as I couldn't find docs outside of the source itself.

We install things to the same location as previously to avoid needing
extra downstream changes. There are a couple of weird issues here
however, like needing to run the container twice to correctly update the
LD Cache.

[1]: 1e77594958/cluster/gce/gci/nvidia-gpus/Dockerfile
[2]: https://github.com/GoogleCloudPlatform/cos-gpu-installer
[3]: https://cos.googlesource.com/cos/tools/+/refs/heads/master/src/cmd/cos_gpu_installer/
2021-08-26 14:09:45 +02:00
Tim Hockin
3586986416 Switch to k8s.gcr.io vanity domain
This is the 2nd attempt.  The previous was reverted while we figured out
the regional mirrors (oops).

New plan: k8s.gcr.io is a read-only facade that auto-detects your source
region (us, eu, or asia for now) and pulls from the closest.  To publish
an image, push k8s-staging.gcr.io and it will be synced to the regionals
automatically (similar to today).  For now the staging is an alias to
gcr.io/google_containers (the legacy URL).

When we move off of google-owned projects (working on it), then we just
do a one-time sync, and change the google-internal config, and nobody
outside should notice.

We can, in parallel, change the auto-sync into a manual sync - send a PR
to "promote" something from staging, and a bot activates it.  Nice and
visible, easy to keep track of.
2018-02-07 21:14:19 -08:00
Tim Hockin
e9dd8a68f6 Revert k8s.gcr.io vanity domain
This reverts commit eba5b6092a.

Fixes https://github.com/kubernetes/kubernetes/issues/57526
2017-12-22 14:36:16 -08:00
Tim Hockin
eba5b6092a Use k8s.gcr.io vanity domain for container images 2017-12-18 09:18:34 -08:00
zouyee
68c5ce19b8 [test/e2e_node]Redirect dl.k8s.io to the kubernetes-release GCS bucket 2017-11-02 12:18:50 +08:00
Rohit Agarwal
9c0bf19f80 Use cos-stable-59-9460-60-0 and newer installer for GPU node e2e tests. 2017-06-13 15:36:20 -07:00
Rohit Agarwal
4a5badfafa Move the nvidia installer to the beginning.
When the installer runs for the first time, it disables loadpin and restarts
the node. So, it is better to run it in the beginning so that we can avoid
redoing the later steps. One of the later steps include downloading a tar file
and untarring it. Doing that only once saves around 1m30s in test runtime for
the gci image.
2017-06-08 09:55:14 -07:00
Vishnu kannan
d45286c575 update cos kernel sha for node e2e GPU installer
Signed-off-by: Vishnu kannan <vishnuk@google.com>
2017-06-01 17:09:18 -07:00
Vishnu kannan
1e77594958 Adding an installer script that installs Nvidia drivers in Container Optimized OS
Packaged the script as a docker container stored in gcr.io/google-containers
A daemonset deployment is included to make it easy to consume the installer
A cluster e2e has been added to test the installation daemonset along with verifying installation
by using a sample CUDA application.
Node e2e for GPUs updated to avoid running on nodes without GPU devices.

Signed-off-by: Vishnu kannan <vishnuk@google.com>
2017-05-20 21:17:19 -07:00