The PR https://github.com/kubernetes/kubernetes/pull/100041 updated
node-problem-detector to v0.8.7, but unfortunately we didn't update
also the image using in the e2e_node tests.
As result, the tests were failing like
E2eNode Suite: [sig-node] NodeProblemDetector [NodeFeature:NodeProblemDetector] [Serial] SystemLogMonitor should generate node condition and events for corresponding errors
_output/local/go/src/k8s.io/kubernetes/test/e2e_node/node_problem_detector_linux.go:301
Timed out after 60.000s.
Expected success, but got an error:
<*errors.errorString | 0xc0011f2600>: {
s: "expected total number of events was 4, actual events counted was 7\nEvents
This in turn was one of the contributing factors in making the
pull-kubernetes-node-kubelet-serial lane constantly failing.
This patch updates the image used in the tests, fixing the failure.
Signed-off-by: Francesco Romani <fromani@redhat.com>
The CPUManager graduated to beta a while ago (k8s 1.10?)
so let's get rid of the obsolete Alpha tag on its e2e tests.
Signed-off-by: Francesco Romani <fromani@redhat.com>
- verify memory manager data returned by `GetAllocatableResources`
- verify pod container memory manager data
Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
The apiserver and test suite in node e2e runs under the sshd daemon
that can limit the amount of files it can open. Set a higher limit
to address the issues.
Signed-off-by: Odin Ugedal <odin@uged.al>
Node e2e tests exceeding the global timeout are sent SIGINT, resulting
in no artifacts or console output. This will ignore the first SIGINT,
and since all children processes are being stopped due to SIGINT, we can
clean up before exiting.
Make sure to use SIGKILL so that the service is killed in a dirty way.
In case container runtime use "Restart=on-abnormal" in systemd, killing
with SIGTERM will not restart the service, as the kill looks intentional
and clean. This is used by cri-o by default.
Current test assumes that test pod is deleted when the test
namespace is deleted. However, namespace deletion is an asynchronous
operation. The pod may still be running and allocating hugepages
resources when next test case creates another pod that requests
the same hugepages resources. This can cause kubelet to fail the test
pod with this kind of error:
OutOfhugepages-2Mi: Node didn't have enough resource: hugepages-2Mi
requested: 6291456, used: 6291456, capacity: 10485760
Explicitly deleting test pod should fix this issue.
Previously the code used to delete pods serially.
In this patch we factor out code to do that in parallel,
using goroutines.
This shaves some time in the e2e tm test run with no intended
changes in behaviour.
Signed-off-by: Francesco Romani <fromani@redhat.com>
The Topology Manager e2e tests wants to run on real multi-NUMA system
and want to consume real devices supported by device plugins; SRIOV
devices happen to be the most commonly available of such devices.
The tests need to wait for resource availability before to actually
run the tests, or they will fail with a false negative, also relatively
hard to debug.
An optimization was added in commit 56106439cf to minimize the restarts,
speed up the execution and make a nasty, yet not fully understood, flake
with SRIOV device plugin much less likely.
Unfortunately the pod-scope tests were mistakenly left over.
This Patch fixes that.
CI lanes did NOT fail (and will not fail) because the CI machines aren't
multi NUMA nor expose SRIOV devices, so the relevant portion of the test
will just skip, avoiding the issue.
However, this resurfaces when running the testsuite on bare metal; this
is how we noticed.
Signed-off-by: Francesco Romani <fromani@redhat.com>