In older versions of Kubernetes (at least pre-0.19, it's the earliest
this test will run unmodified on), Pods that depended on devices could be
restarted after the device plugin had been removed. Currently however,
this isn't possible, as during ContainerManager.GetResources(), we
attempt to DeviceManager.GetDeviceRunContainerOptions() which fails as
there's no cached endpoint information for the plugin type.
This commit therefore breaks apart the existing test into two:
- One active test that validates that assignments are maintained across
restarts
- One skipped test that validates the behaviour after GPUs have been
removed, in case we decide that this is a bug that should be fixed in
the future.
15m is enough for Cluster Autoscaler to remove empty nodes, so we need
to break them sooner than that. Instead, wait 15m after breaking them to
ensure Cluster Autoscaler will consider them as unready instead of still
starting.
The e2e test "should have Endpoints and EndpointSlices pointing to
the API Server Service" was veryfing the current endpoints
reconciler implementation on the apiservers, however, users may
disable the endpoint reconciler and create their own.
This e2e test is also a conformance test, so we should test the
behaviour and not the implementation details. The test verifies
that a kubernetes.default service exist, an endpoint and endpoint
slices object referencing that service exist and are equivalent.
The Container Images for Windows Server 2022 have been published, and we can
start adding jobs for them.
The ltsc2022-based images have been built and promoted with these image versions.
The PR https://github.com/kubernetes/kubernetes/pull/104575 introduces
some intermediate types which makes the 32GiB memory machine kill the
typecheck process. To resolve that issue and make the test more robust,
we now reduce the amount of parallel typechecks to run to `2`.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
Prior to this change, the pod was not getting scheduled on the node as
we don't have a running scheduler in e2e_node. PodClient solves this
problem by manually assigning the pod to the node.
The current GPU installer was built in 2017, from source that no longer
exists in Kubernetes ([adding commit][1]. The image was built on 2017-06-13.
Unfortunately, this installer no longer appears to work. When debugging
on the same node type as used by test-infra, it failed to build the
driver as the kernel sha was no longer available.
This lead to needing to find a new way to install GPUs. The smallest
logical change was switching to [cos-gpu-installer][2]
. There is a newer version of this available on [googlesource][3] that
I have not yet tested as it's not clear what the state of the project
is, as I couldn't find docs outside of the source itself.
We install things to the same location as previously to avoid needing
extra downstream changes. There are a couple of weird issues here
however, like needing to run the container twice to correctly update the
LD Cache.
[1]: 1e77594958/cluster/gce/gci/nvidia-gpus/Dockerfile
[2]: https://github.com/GoogleCloudPlatform/cos-gpu-installer
[3]: https://cos.googlesource.com/cos/tools/+/refs/heads/master/src/cmd/cos_gpu_installer/
Different CSI drivers have different error messages, making it difficult
to check them accurately. We remove the check for the error message and
only check the failure type instead, since that is all we need.
If device plugin returns device without topology, keep it internaly
as NUMA node -1, it helps at podresources level to not export NUMA
topology, otherwise topology is exported with NUMA node id 0,
which is not accurate.
It's imposible to unveile this bug just by tracing json.Marshal(resp)
in podresource client, because NUMANodes field ID has json property
omitempty, in this case when ID=0 shown as emtpy NUMANode.
To reproduce it, better to iterate on devices and just
trace dev.Topology.Nodes[0].ID.
Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>