Document why teardownSRIOVPod has to wait for all the containers
to be gone before to end, and why is important.
Additionally, change the code to wait for all the containers to be gone,
not just the first. This is both a little cleaner and a little safer,
even though it seems the current code caused no issues so far.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Under the CPU manager and topology manager e2e tests possible the situation
when one of steps under the test will fail and it will not clean the CPU manager
state file. Move the deletion of the state file to `AfterEach` to guarantee that
the state file will be always removed from the node.
Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
The e2e topology manager want to test the resource alignment using
devices, and the easiest devices to use are the SRIOV devices at this
moment.
The resource alignment test cases are run for each supported policies,
in a loop.
The tests manage the SRIOV device plugin; up until now, the plugin
was set up and tore down at each loop.
There is no real need for that. Each loop must reconfigure (thus
restart) the kubelet, but the device plugin can set up and tore down
just once for all the policies, thus once.
The kubelet can reconnect just fine to a running device plugin.
This way, we greatly reduce the interactions and the complexity of the
test environment, making it easier to understand and more robust, and
we trim down some minutes from execution time.
However, this patch also hides (not solves) a test flake we observed
on some environment. The issue is hardly reproduceable and not well
understood, but seems caused by doing the sriov dp setup/teardown
in each policy testing loop.
Investigation so far suggests that the kubelet sometimes have a stale
state after the sriovdp teardown/setup cycle, leading to flakes and
false negatives.
We tried to address this in https://github.com/kubernetes/kubernetes/pull/95611
with no conclusive results yet.
This patch was posted because overall we believe this patch gains
exceeds the drawbacks (hiding the aforementioned flake) and
because understanding the potential interaction issues between the
sriovdp and the kubelet deserve a separate test.
Signed-off-by: Francesco Romani <fromani@redhat.com>
A suite of e2e tests was created for Topology Manager
so as to test pod scope alignment feature.
Co-authored-by: Pawel Rapacz <p.rapacz@partner.samsung.com>
Co-authored-by: Krzysztof Wiatrzyk <k.wiatrzyk@samsung.com>
Signed-off-by: Cezary Zukowski <c.zukowski@samsung.com>
Due to a rebase glitch the fmt.Sprintf() was lost.
This patches restores it improving the logs readability.
Signed-off-by: Francesco Romani <fromani@redhat.com>
We need to make sure we tear down the sriov device plugin pod
should the tests fail, to avoid leaking pods in the test environment.
Signed-off-by: Francesco Romani <fromani@redhat.com>
This drops testfiles.ReadOrDie and updated testfiles.Exists to return an
error, forcing the caller to decide whether to call framework.Fail or do
something else.
It makes for a slightly less friendly API, but also means the package is
decoupled from framework again, as per the comments at the top of the
file
Make sure the SR-IOV device plugin is ready, and that
there are enough SR-IOV devices allocatable before
spinning up test pods.
Signed-off-by: vpickard <vpickard@redhat.com>
Due to an oversight, the e2e topology manager tests
were leaking a configmap and a serviceaccount.
This patch ensures a proper cleanup
Signed-off-by: Francesco Romani <fromani@redhat.com>
Up until now, the test validated the alignment of resources
only in the first container in a pod. That was just an overlook.
With this patch, we validate all the containers in a given pod.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Add autodetection code to figure out on which NUMA node are
the devices attached to.
This autodetection work under the assumption all the VFs in
the system must be used for the tests.
Should not this be the case, or in general to handle non-trivial
configurations, we keep the annotations mechanism added to the
SRIOV device plugin config map.
Signed-off-by: Francesco Romani <fromani@redhat.com>
The e2e_node topology_manager check have a early, quick check
to rule out systems without sriov device, thus skipping the tests.
The first version of the ckeck detected PFs, (Physical Functions),
under the assumption that VFs (Virtual Functions) were already been
created. This works because, obviously, you can't have VFs without PFs.
However, it's a little safer and easier to understand if we check
firectly for VFs, bailing out from systems which don't provide them.
Nothing changes for properly configured test systems.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Reorganize the code with setup and teardown functions,
to make room for the future addition of more device plugin
support, and to make the code a bit tidier.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Add a helper function to check if a Pod failed
admission for Topology Affinity Error.
So far we only check the Status.Reason.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Five minutes was initially used only to be overcautious.
From my experiments, the node is ready in usually less than a minute.
Double it to give some buffer space.
Signed-off-by: Francesco Romani <fromani@redhat.com>
TO properly implement some e2e tests, we need to know
some basic topology facts about the system running the tests.
The bare minimum we need to know is how many PCI SRIOV devices
are attached to which NUMA node.
This way we know which core we can reserve for kube services,
and which NUMA socket we can take to test full socket reservation.
To let the tests know the PCI device topology, we use annotations
in the SRIOV device plugin ConfigMap we need anyway.
The format is
```yaml
metadata:
annotations:
pcidevice_node0: "2"
pcidevice_node1: "0"
```
with one annotation per NUMA node in the system.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Negative tests is when we request a gu Pod we know the system cannot
fullfill - hence we expect rejection from the topology manager.
Unfortunately, besides the trivial case of excessive cores (request
more socket than a NUMA node provides) we cannot easily test the
devices, because crafting a proper pod will require detailed knowledge
of the hw topology.
Let's consider a hypotetical two-node NUMA system with two PCIe busses,
one per NUMA node, with a SRIOV device on each bus.
A proper negative test would require two SRIOV device, that the system
can provide but not on the same single NUMA node.
Requiring for example three devices (one more than the system provides)
will lead to a different, legitimate admission error.
For these reasons we bootstrap the testing infra for the negative tests,
but we add just the simplest one.
Signed-off-by: Francesco Romani <fromani@redhat.com>
We cannot anticipate all the possible configurations
needed by the SRIOV device plugin: there is too much variety.
Hence, we need to allow the test environment to supply
a host-specific ConfigMap to properly configure the device
plugin and avoid false negatives.
We still provide a the default config map as fallback and reference.
Signed-off-by: Francesco Romani <fromani@redhat.com>
The SRIOV device plugin can create different resources depending
on both the hardware present on the system and the configuration.
As long as we have at least one SRIOV device, the tests don't actually
care about which specific device is.
Previously, the test hardcoded the most common intel SRIOV device
identifier. This patch lifts the restriction and let the test
autodetect and use what's available.
Signed-off-by: Francesco Romani <fromani@redhat.com>
This patch extends and completes the previously-added
empty topology manager test for single-NUMA node policy
by adding reporting in the test pod and checking
the resource alignment.
Signed-off-by: Francesco Romani <fromani@redhat.com>
This patch all the testing infra and utilities needed
to run e2e topology manager tests. This include setup
a guaranteed pod which needs some devices.
The simplest real device available for the purpose
are the SRIOV devices, hence we use them.
This patch pulls the SRIOV device plugin from
the official, yet external, repository.
We do it as close as possible for the nvidia GPU plugin.
This patch also performs minor refactoring for some
test framework utilities, needed to support the new
e2e tests.
Finally, we add an empty e2e topology manager test,
to be completed by the next patch.
Signed-off-by: Francesco Romani <fromani@redhat.com>
This is the initial commit for E2E testing for Topology
Manager.
For now, run a subset of the CPU Manager tests.
Additional tests will be forthcoming.
Signed-off-by: vpickard <vpickard@redhat.com>