This is used across multiple tests, so let's move into the util file.
Also, refactor it a bit to provide a better error message in case of a
failure.
Signed-off-by: David Porter <david@porter.me>
All code must use the context from Ginkgo when doing API calls or polling for a
change, otherwise the code would not return immediately when the test gets
aborted.
ginkgo.DeferCleanup has multiple advantages:
- The cleanup operation can get registered if and only if needed.
- No need to return a cleanup function that the caller must invoke.
- Automatically determines whether a context is needed, which will
simplify the introduction of context parameters.
- Ginkgo's timeline shows when it executes the cleanup operation.
Every ginkgo callback should return immediately when a timeout occurs or the
test run manually gets aborted with CTRL-C. To do that, they must take a ctx
parameter and pass it through to all code which might block.
This is a first automated step towards that: the additional parameter got added
with
sed -i 's/\(framework.ConformanceIt\|ginkgo.It\)\(.*\)func() {$/\1\2func(ctx context.Context) {/' \
$(git grep -l -e framework.ConformanceIt -e ginkgo.It )
$GOPATH/bin/goimports -w $(git status | grep modified: | sed -e 's/.* //')
log_test.go was left unchanged.
- update all the import statements
- run hack/pin-dependency.sh to change pinned dependency versions
- run hack/update-vendor.sh to update go.mod files and the vendor directory
- update the method signatures for custom reporters
Signed-off-by: Dave Chen <dave.chen@arm.com>
A cpu/topology manager e2e test wants to require one exclusive CPU
and a share of CPU time; let's round up the allocatable CPU requirements
(from 1 to 2) to reduce the chances of false negatives.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Even though CI machines _usually_ have at least two cpus,
let's rather not assume this holds true, and let's actually
check the allocatable CPUs, skipping even the simplest
tests if the assumption is broken, to avoid false negatives.
Signed-off-by: Francesco Romani <fromani@redhat.com>
The existing cpu/topology manager tests correctly check for the
node resources and skip if the detected resources are not enough
to run the tests, to avoid false negatives.
Unfortunately they do the check against the node capacity, while
the correct approach is to check the allocatable resources.
The existing check is correct only on a narrow set of cases;
otherwise can still lead to false negatives.
This PR fixes that.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Make sure to log out the cpu capacity and allocatable for
the node running the tests, to make the troubleshooting
of test failures easier.
Signed-off-by: Francesco Romani <fromani@redhat.com>
This patch makes the CRI `v1` API the new project-wide default version.
To allow backwards compatibility, a fallback to `v1alpha2` has been added
as well. This fallback can either used by automatically determined by
the kubelet.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
Add a e2e test to exercise the checkpoint recovery flow.
This means we need to actually create a old (V1, pre-1.20) checkpoint,
but if we do it only in the e2e test, it's still fine.
Signed-off-by: Francesco Romani <fromani@redhat.com>
The Topology Manager e2e tests wants to run on real multi-NUMA system
and want to consume real devices supported by device plugins; SRIOV
devices happen to be the most commonly available of such devices.
CI machines aren't multi NUMA nor expose SRIOV devices, so the biggest portion
of the tests will just skip, and we need to keep it like this until we
figure out how to enable these features.
However, some organizations can and want to run the testsuite on bare metal;
in this case, the current test will skip (not fail) with misconfigured
boxes, and this reports a misleading result. It will be much better to
fail if the test preconditions aren't met.
To satisfy both needs, we add an option, controlled by an environment
variable, to fail (not skip) if the machine on which the test run
doesn't meet the expectations (multi-NUMA, 4+ cores per NUMA cell,
expose SRIOV VFs).
We keep the old behaviour as default to keep being CI friendly.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Add e2e tests to cover the basic flows for the `full-pcpus-only` option:
negative flow to ensure rejection with proper error message, and
positive flow to verify the actual cpu allocation.
Co-authored-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Francesco Romani <fromani@redhat.com>
Previously the code used to delete pods serially.
In this patch we factor out code to do that in parallel,
using goroutines.
This shaves some time in the e2e tm test run with no intended
changes in behaviour.
Signed-off-by: Francesco Romani <fromani@redhat.com>
The Topology Manager e2e tests wants to run on real multi-NUMA system
and want to consume real devices supported by device plugins; SRIOV
devices happen to be the most commonly available of such devices.
The tests need to wait for resource availability before to actually
run the tests, or they will fail with a false negative, also relatively
hard to debug.
An optimization was added in commit 56106439cf to minimize the restarts,
speed up the execution and make a nasty, yet not fully understood, flake
with SRIOV device plugin much less likely.
Unfortunately the pod-scope tests were mistakenly left over.
This Patch fixes that.
CI lanes did NOT fail (and will not fail) because the CI machines aren't
multi NUMA nor expose SRIOV devices, so the relevant portion of the test
will just skip, avoiding the issue.
However, this resurfaces when running the testsuite on bare metal; this
is how we noticed.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Document why teardownSRIOVPod has to wait for all the containers
to be gone before to end, and why is important.
Additionally, change the code to wait for all the containers to be gone,
not just the first. This is both a little cleaner and a little safer,
even though it seems the current code caused no issues so far.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Under the CPU manager and topology manager e2e tests possible the situation
when one of steps under the test will fail and it will not clean the CPU manager
state file. Move the deletion of the state file to `AfterEach` to guarantee that
the state file will be always removed from the node.
Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
The e2e topology manager want to test the resource alignment using
devices, and the easiest devices to use are the SRIOV devices at this
moment.
The resource alignment test cases are run for each supported policies,
in a loop.
The tests manage the SRIOV device plugin; up until now, the plugin
was set up and tore down at each loop.
There is no real need for that. Each loop must reconfigure (thus
restart) the kubelet, but the device plugin can set up and tore down
just once for all the policies, thus once.
The kubelet can reconnect just fine to a running device plugin.
This way, we greatly reduce the interactions and the complexity of the
test environment, making it easier to understand and more robust, and
we trim down some minutes from execution time.
However, this patch also hides (not solves) a test flake we observed
on some environment. The issue is hardly reproduceable and not well
understood, but seems caused by doing the sriov dp setup/teardown
in each policy testing loop.
Investigation so far suggests that the kubelet sometimes have a stale
state after the sriovdp teardown/setup cycle, leading to flakes and
false negatives.
We tried to address this in https://github.com/kubernetes/kubernetes/pull/95611
with no conclusive results yet.
This patch was posted because overall we believe this patch gains
exceeds the drawbacks (hiding the aforementioned flake) and
because understanding the potential interaction issues between the
sriovdp and the kubelet deserve a separate test.
Signed-off-by: Francesco Romani <fromani@redhat.com>
A suite of e2e tests was created for Topology Manager
so as to test pod scope alignment feature.
Co-authored-by: Pawel Rapacz <p.rapacz@partner.samsung.com>
Co-authored-by: Krzysztof Wiatrzyk <k.wiatrzyk@samsung.com>
Signed-off-by: Cezary Zukowski <c.zukowski@samsung.com>
Due to a rebase glitch the fmt.Sprintf() was lost.
This patches restores it improving the logs readability.
Signed-off-by: Francesco Romani <fromani@redhat.com>
We need to make sure we tear down the sriov device plugin pod
should the tests fail, to avoid leaking pods in the test environment.
Signed-off-by: Francesco Romani <fromani@redhat.com>
This drops testfiles.ReadOrDie and updated testfiles.Exists to return an
error, forcing the caller to decide whether to call framework.Fail or do
something else.
It makes for a slightly less friendly API, but also means the package is
decoupled from framework again, as per the comments at the top of the
file
Make sure the SR-IOV device plugin is ready, and that
there are enough SR-IOV devices allocatable before
spinning up test pods.
Signed-off-by: vpickard <vpickard@redhat.com>
Due to an oversight, the e2e topology manager tests
were leaking a configmap and a serviceaccount.
This patch ensures a proper cleanup
Signed-off-by: Francesco Romani <fromani@redhat.com>