Commit Graph

59 Commits

Author SHA1 Message Date
Sergiusz Urbaniak
1495c9f2cd
test/e2e/*: default existing tests to privileged pod security policy
This is to ensure that all existing tests don't break when defaulting
the pod security policy to restricted in the e2e test framework.
2022-04-05 08:41:12 +02:00
ahrtr
fe95aa614c io/ioutil has already been deprecated in golang 1.16, so replace all ioutil with io and os 2022-02-03 05:32:12 +08:00
Francesco Romani
7004a718d9 e2e: node: {cpu,topo}mgr: round up test requirement
A cpu/topology manager e2e test wants to require one exclusive CPU
and a share of CPU time; let's round up the allocatable CPU requirements
(from 1 to 2) to reduce the chances of false negatives.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2022-02-02 15:17:09 +01:00
Francesco Romani
c92d9f7974 e2e: node: {cpu,topo}mgr: don't assume cpu capacity >= 2
Even though CI machines _usually_ have at least two cpus,
let's rather not assume this holds true, and let's actually
check the allocatable CPUs, skipping even the simplest
tests if the assumption is broken, to avoid false negatives.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2022-02-02 15:17:05 +01:00
Francesco Romani
2d1503dae3 e2e: node: {cpu,topo}mgr: make logic on allocatable
The existing cpu/topology manager tests correctly check for the
node resources and skip if the detected resources are not enough
to run the tests, to avoid false negatives.

Unfortunately they do the check against the node capacity, while
the correct approach is to check the allocatable resources.
The existing check is correct only on a narrow set of cases;
otherwise can still lead to false negatives.

This PR fixes that.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2022-02-02 14:10:46 +01:00
Francesco Romani
60585da68f e2e: node: {cpu,top}omgr: report node capacity/allocatable
Make sure to log out the cpu capacity and allocatable for
the node running the tests, to make the troubleshooting
of test failures easier.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2022-02-02 14:10:44 +01:00
Sascha Grunert
de37b9d293
Make CRI v1 the default and allow a fallback to v1alpha2
This patch makes the CRI `v1` API the new project-wide default version.
To allow backwards compatibility, a fallback to `v1alpha2` has been added
as well. This fallback can either used by automatically determined by
the kubelet.

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
2021-11-17 11:05:05 -08:00
Artyom Lukianov
117141eee3 e2e_node: fix tests after Kubelet dynamic configuration removal
- CPU manager
- Memory Manager
- Topology Manager

Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
2021-11-08 09:42:24 +02:00
Artyom Lukianov
50fdcdfc59 e2e_node: refactor code to use a single method to update the kubelet config
Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
2021-11-04 15:44:35 +02:00
Artyom Lukianov
b6211657bf e2e_node: drop usage of DynamicKubeletConfig
Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
2021-11-04 15:26:19 +02:00
Francesco Romani
b382b6cd0a node: e2e: add test for the checkpoint recovery
Add a e2e test to exercise the checkpoint recovery flow.
This means we need to actually create a old (V1, pre-1.20) checkpoint,
but if we do it only in the e2e test, it's still fine.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-10-26 09:55:11 +02:00
Francesco Romani
54c7d8fbb1 e2e: TM: add option to fail instead of skip
The Topology Manager e2e tests wants to run on real multi-NUMA system
and want to consume real devices supported by device plugins; SRIOV
devices happen to be the most commonly available of such devices.

CI machines aren't multi NUMA nor expose SRIOV devices, so the biggest portion
of the tests will just skip, and we need to keep it like this until we
figure out how to enable these features.

However, some organizations can and want to run the testsuite on bare metal;
in this case, the current test will skip (not fail) with misconfigured
boxes, and this reports a misleading result. It will be much better to
fail if the test preconditions aren't met.

To satisfy both needs, we add an option, controlled by an environment
variable, to fail (not skip) if the machine on which the test run
doesn't meet the expectations (multi-NUMA, 4+ cores per NUMA cell,
expose SRIOV VFs).
We keep the old behaviour as default to keep being CI friendly.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-09-13 13:23:36 +02:00
Francesco Romani
a2fb8b0039 smtalign: e2e: add tests
Add e2e tests to cover the basic flows for the `full-pcpus-only` option:
negative flow to ensure rejection with proper error message, and
positive flow to verify the actual cpu allocation.

Co-authored-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-07-08 23:15:37 +02:00
Kubernetes Prow Robot
15a60d1a19
Merge pull request #100180 from fromanirh/tm-e2e-fix-wait
e2e: TM: wait for SRIOV devices in pod scope tests
2021-06-23 11:42:10 -07:00
Francesco Romani
fc0955c26a e2e: topomgr: use deletePodSync for faster delete
Previously the code used to delete pods serially.
In this patch we factor out code to do that in parallel,
using goroutines.

This shaves some time in the e2e tm test run with no intended
changes in behaviour.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-03-18 13:28:16 +01:00
Francesco Romani
04f091790e e2e: TM: wait for SRIOV devices in pod scope tests
The Topology Manager e2e tests wants to run on real multi-NUMA system
and want to consume real devices supported by device plugins; SRIOV
devices happen to be the most commonly available of such devices.

The tests need to wait for resource availability before to actually
run the tests, or they will fail with a false negative, also relatively
hard to debug.

An optimization was added in commit 56106439cf to minimize the restarts,
speed up the execution and make a nasty, yet not fully understood, flake
with SRIOV device plugin much less likely.

Unfortunately the pod-scope tests were mistakenly left over.
This Patch fixes that.
CI lanes did NOT fail (and will not fail) because the CI machines aren't
multi NUMA nor expose SRIOV devices, so the relevant portion of the test
will just skip, avoiding the issue.

However, this resurfaces when running the testsuite on bare metal; this
is how we noticed.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-03-12 11:01:56 +01:00
Francesco Romani
16d5ac3689 node: e2e: docs and fix for teardownSRIOVConfig
Document why teardownSRIOVPod has to wait for all the containers
to be gone before to end, and why is important.

Additionally, change the code to wait for all the containers to be gone,
not just the first. This is both a little cleaner and a little safer,
even though it seems the current code caused no issues so far.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-03-09 13:13:36 +01:00
Francesco Romani
4e7434028c e2e: node: bootstrap podresources tests
Start e2e tests for the existing List() API.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-03-09 13:13:35 +01:00
Kubernetes Prow Robot
bd902db13d
Merge pull request #98342 from cynepco3hahue/e2e_move_delete_state_file_to_after_each
e2e: move deleteState file to the AfterEach
2021-02-24 11:10:50 -08:00
pacoxu
a10bdfed09 fix all keps links 404 for kep folder migration
Signed-off-by: pacoxu <paco.xu@daocloud.io>
2021-02-01 19:41:59 +08:00
Artyom Lukianov
97ac255513 e2e: move deleteState file to the AfterEach
Under the CPU manager and topology manager e2e tests possible the situation
when one of steps under the test will fail and it will not clean the CPU manager
state file. Move the deletion of the state file to `AfterEach` to guarantee that
the state file will be always removed from the node.

Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
2021-01-26 20:34:17 +02:00
Francesco Romani
56106439cf node: e2e: bring up/down SRIOV DP just once
The e2e topology manager want to test the resource alignment using
devices, and the easiest devices to use are the SRIOV devices at this
moment.

The resource alignment test cases are run for each supported policies,
in a loop.

The tests manage the SRIOV device plugin; up until now, the plugin
was set up and tore down at each loop.
There is no real need for that. Each loop must reconfigure (thus
restart) the kubelet, but the device plugin can set up and tore down
just once for all the policies, thus once.
The kubelet can reconnect just fine to a running device plugin.

This way, we greatly reduce the interactions and the complexity of the
test environment, making it easier to understand and more robust, and
we trim down some minutes from execution time.

However, this patch also hides (not solves) a test flake we observed
on some environment. The issue is hardly reproduceable and not well
understood, but seems caused by doing the sriov dp setup/teardown
in each policy testing loop.
Investigation so far suggests that the kubelet sometimes have a stale
state after the sriovdp teardown/setup cycle, leading to flakes and
false negatives.
We tried to address this in https://github.com/kubernetes/kubernetes/pull/95611
with no conclusive results yet.

This patch was posted because overall we believe this patch gains
exceeds the drawbacks (hiding the aforementioned flake) and
because understanding the potential interaction issues between the
sriovdp and the kubelet deserve a separate test.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2020-11-13 10:04:31 +01:00
Pawel Rapacz
16c7bf4db4 Implement e2e tests for pod scope alignment
A suite of e2e tests was created for Topology Manager
so as to test pod scope alignment feature.

Co-authored-by: Pawel Rapacz <p.rapacz@partner.samsung.com>
Co-authored-by: Krzysztof Wiatrzyk <k.wiatrzyk@samsung.com>
Signed-off-by: Cezary Zukowski <c.zukowski@samsung.com>
2020-11-12 12:25:55 +01:00
Francesco Romani
82a730f116 e2e: topomgr: fix ginkgo log
Due to a rebase glitch the fmt.Sprintf() was lost.
This patches restores it improving the logs readability.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2020-10-19 19:28:01 +02:00
Francesco Romani
009b5356cb e2e: node: topomgr: avoid plugin leak on test fail
We need to make sure we tear down the sriov device plugin pod
should the tests fail, to avoid leaking pods in the test environment.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2020-10-14 23:01:58 +02:00
Aaron Crickenberger
28768166f5 decouple testfiles from framework
This drops testfiles.ReadOrDie and updated testfiles.Exists to return an
error, forcing the caller to decide whether to call framework.Fail or do
something else.

It makes for a slightly less friendly API, but also means the package is
decoupled from framework again, as per the comments at the top of the
file
2020-06-29 14:54:09 -07:00
David Porter
f5b8c3d746 Mark Topology Manager Test as non-alpha and NodeFeature 2020-05-26 12:10:18 -07:00
drfish
dfab6b637f Update .import-aliases for e2e test framework 2020-03-25 11:40:02 +08:00
Kubernetes Prow Robot
5708511499
Merge pull request #88708 from mikedanese/deleteopts
Migrate clientset metav1.DeleteOpts to pass-by-value
2020-03-05 23:09:23 -08:00
Kubernetes Prow Robot
50dd75f9c5
Merge pull request #88773 from vpickard/e2e-topology-manager-sriovdpReady
e2e-topology-manager: Wait for SR-IOV device plugin
2020-03-05 20:04:38 -08:00
Mike Danese
c58e69ec79 automated refactor 2020-03-05 14:59:46 -08:00
Kubernetes Prow Robot
1f2e1967d1
Merge pull request #88566 from Deepthidharwar/topology-mgr-numa-tests
Enable running cpu-mgr-multiNUMA e2e tests with Topology manager
2020-03-05 05:38:37 -08:00
vpickard
61565b3f6c e2e-topology-manager: Wait for SR-IOV device plugin
Make sure the SR-IOV device plugin is ready, and that
there are enough SR-IOV devices allocatable before
spinning up test pods.

Signed-off-by: vpickard <vpickard@redhat.com>
2020-03-04 10:07:35 -05:00
Deepthi Dharwar
1ede096465 Enable topology-manager-e2e tests to run on MultiNUMA nodes.
Signed-off-by: Deepthi Dharwar <ddharwar@redhat.com>
2020-03-02 22:36:43 +05:30
Deepthi Dharwar
a4b59a5d7c Currently SRIOV detection logic is reporting error if it fails to detect SRIOV device
on the system. This patch aims to fix the same.

Signed-off-by: Deepthi Dharwar <ddharwar@redhat.com>
2020-03-02 19:31:37 +05:30
Francesco Romani
64904d0ab8 e2e: topomgr: extend tests to all the policies
Per https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/0035-20190130-topology-manager.md#multi-numa-systems-tests
we validate only the results for single-numa node policy,
because the is no a simple and reliable way to validate
the allocation performed by the other policies.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2020-02-20 18:22:34 +01:00
Francesco Romani
a249b93687 e2e: topomgr: address reviewer comments
Signed-off-by: Francesco Romani <fromani@redhat.com>
2020-02-20 10:31:09 +01:00
Francesco Romani
833519f80b e2e: topomgr: properly clean up after completion
Due to an oversight, the e2e topology manager tests
were leaking a configmap and a serviceaccount.
This patch ensures a proper cleanup

Signed-off-by: Francesco Romani <fromani@redhat.com>
2020-02-19 17:15:42 +01:00
Francesco Romani
7c12251c7a e2e: topomgr: add multi-container tests
Add tests to check alignment of pods which contains more than one
container.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2020-02-19 17:15:42 +01:00
Francesco Romani
8e9d76f1b9 e2e: topomgr: validate all containers in pod
Up until now, the test validated the alignment of resources
only in the first container in a pod. That was just an overlook.
With this patch, we validate all the containers in a given pod.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2020-02-19 17:15:42 +01:00
Francesco Romani
ddc18eae67 e2e: topomgr: autodetect NUMA position of VF devs
Add autodetection code to figure out on which NUMA node are
the devices attached to.
This autodetection work under the assumption all the VFs in
the system must be used for the tests.
Should not this be the case, or in general to handle non-trivial
configurations, we keep the annotations mechanism added to the
SRIOV device plugin config map.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2020-02-19 17:15:42 +01:00
Francesco Romani
bb6beb99e5 e2e: topomgr: early check to detect VFs, not PFs
The e2e_node topology_manager check have a early, quick check
to rule out systems without sriov device, thus skipping the tests.

The first version of the ckeck detected PFs, (Physical Functions),
under the assumption that VFs (Virtual Functions) were already been
created. This works because, obviously, you can't have VFs without PFs.

However, it's a little safer and easier to understand if we check
firectly for VFs, bailing out from systems which don't provide them.

Nothing changes for properly configured test systems.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2020-02-19 17:15:41 +01:00
Francesco Romani
70cce5e3f1 e2e: topomgr: introduce sriov setup/teardown funcs
Reorganize the code with setup and teardown functions,
to make room for the future addition of more device plugin
support, and to make the code a bit tidier.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2020-02-10 22:47:54 +01:00
Francesco Romani
2f0a6d2c76 e2e: topomgr: use constants for test limits
Signed-off-by: Francesco Romani <fromani@redhat.com>
2020-02-10 22:47:54 +01:00
Francesco Romani
fee1dba054 e2r: topomgr: improve the test logs
Add clarification to which test is doing what, to make
the test output easier to understand.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2020-02-10 22:47:54 +01:00
Francesco Romani
83c344647f e2e: topomgr: better check for AffinityError
Add a helper function to check if a Pod failed
admission for Topology Affinity Error.
So far we only check the Status.Reason.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2020-02-10 22:47:54 +01:00
Francesco Romani
512a4e8a3e e2e: topomgr: reduce node readiness timeout
Five minutes was initially used only to be overcautious.
From my experiments, the node is ready in usually less than a minute.
Double it to give some buffer space.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2020-02-10 22:47:54 +01:00
Francesco Romani
3b4122bd03 e2e: topomgr: get and use topology hints from conf
TO properly implement some e2e tests, we need to know
some basic topology facts about the system running the tests.
The bare minimum we need to know is how many PCI SRIOV devices
are attached to which NUMA node.

This way we know which core we can reserve for kube services,
and which NUMA socket we can take to test full socket reservation.

To let the tests know the PCI device topology, we use annotations
in the SRIOV device plugin ConfigMap we need anyway.
The format is

```yaml
  metadata:
    annotations:
      pcidevice_node0: "2"
      pcidevice_node1: "0"
```

with one annotation per NUMA node in the system.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2020-02-10 22:47:53 +01:00
Francesco Romani
d9d652e867 e2e: topomgr: initial negative tests
Negative tests is when we request a gu Pod we know the system cannot
fullfill - hence we expect rejection from the topology manager.

Unfortunately, besides the trivial case of excessive cores (request
more socket than a NUMA node provides) we cannot easily test the
devices, because crafting a proper pod will require detailed knowledge
of the hw topology.

Let's consider a hypotetical two-node NUMA system with two PCIe busses,
one per NUMA node, with a SRIOV device on each bus.
A proper negative test would require two SRIOV device, that the system
can provide but not on the same single NUMA node.
Requiring for example three devices (one more than the system provides)
will lead to a different, legitimate admission error.

For these reasons we bootstrap the testing infra for the negative tests,
but we add just the simplest one.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2020-02-10 22:47:53 +01:00
Francesco Romani
ee92b4aae0 e2e: topomgr: add more positive tests
this patch builds on the topology manager e2e infrastructure to
add more positive e2e test cases.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2020-02-10 22:47:53 +01:00