Add some additional init container tests that work via monitoring
container lifetime based on logs written to a common file. This allows
more easily writing assertions about the container lifetimes with
respect to one another.
Bump the timeout as the previous timeout was sometimes too short,
resulting in the pod status update not sent. Also, fixed a typo in
previous refactor.
Signed-off-by: David Porter <david@porter.me>
Breakdown of the steps implemented as part of this e2e test is as follows:
1. Create a file `registration` at path `/var/lib/kubelet/device-plugins/sample/`
2. Create sample device plugin with an environment variable with
`REGISTER_CONTROL_FILE=/var/lib/kubelet/device-plugins/sample/registration` that
waits for a client to delete the control file.
3. Trigger plugin registeration by deleting the abovementioned directory.
4. Create a test pod requesting devices exposed by the device plugin.
5. Stop kubelet.
6. Remove pods using CRI to ensure new pods are created after kubelet restart.
7. Restart kubelet.
8. Wait for the sample device plugin pod to be running. In this case,
the registration is not triggered.
9. Ensure that resource capacity/allocatable exported by the device plugin is zero.
10. The test pod should fail with `UnexpectedAdmissionError`
11. Delete the test pod.
12. Delete the sample device plugin pod.
13. Remove `/var/lib/kubelet/device-plugins/sample/` and its content, the directory
created to control registration
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
This commit reuses e2e tests implmented as part of https://github.com/kubernetes/kubernetes/pull/110729.
The commit is borrowed from the aforementioned PR as is to preserve
authorship. Subsequent commit will update the end to end test to
simulate the problem this PR is trying to solve by reproducing
the issue: 109595.
Co-authored-by: Francesco Romani <fromani@redhat.com>
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
The existing path is incorrect (missing `sample-device-plugin`)
directory and thus causing test failures. The full path should be
`test/e2e/testing-manifests/sample-device-plugin/sample-device-plugin.yaml`.
Signed-off-by: David Porter <david@porter.me>
Add the following ginkgo flags for each node e2e similar to the
existing hack/ginkgo-e2e.sh script.
* --no-color, colors aren't rendered properly in prow and make examining
the log in text editors more difficult, so let's disable them.
`hack/ginkgo-e2e.sh` (used for kind e2e tests) also disables them
already.
* -v, enable verbose logs. This is needed so we get more detailed info
even when the tests pass. This is useful so we can compare successful
runs to failed runs.
Signed-off-by: David Porter <david@porter.me>
When running multiple node e2e with multiple machine images, the tests
are run separately for each node. The final build log has all of the
results for each of the hosts combined together which make debugging the
log difficult. To make it easier, emit a log for each host that was run.
This log will be written to the results directory and uploaded as an
artifact in prow jobs.
Signed-off-by: David Porter <david@porter.me>
This was never being used, the only config that used it was deleted in
https://github.com/kubernetes/test-infra/pull/26017 so we don't need
this anymore, so let's delete it.
Signed-off-by: David Porter <david@porter.me>
All of these issues were reported by https://github.com/nunnatsa/ginkgolinter.
Fixing these issues is useful (several expressions get simpler, using
framework.ExpectNoError is better because it has additional support for
failures) and a necessary step for enabling that linter in our golangci-lint
invocation.
Fix the waiting logic in the e2e test loop to wait
for resources to be reported again instead of making logic on the
timestamp. The idea is that waiting for resource availability
is the canonical way clients should observe the desired state,
and it should also be more robust than comparing timestamps,
especially on CI environments.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Start to consolidate the sample device plugin utility
and constants in a central place, because we need
to use it in different e2e tests.
Having a central dependency is better than a maze of
entangled e2e tests depending on each other helpers.
Signed-off-by: Francesco Romani <fromani@redhat.com>
The podresources e2e tests want to exercise the case on which a device
plugin doesn't report topology affinity. The only known device plugin
which had this requirement and didn't depend on specialized hardware
was the kubevirt device plugin, which was however deprecated after
we started using it.
So the e2e tests are now broken, and in any case they can't depend on
unmaintained and liable to be obsolete code.
To unblock the state and preserve some e2e signal, we switch to the
sample device plugin, which is a stub implementation and which is
managed in-tree, so we can maintain it and ensure it fits the e2e test
usecase.
This is however a regression, because e2e tests should try their hardest
to use real devices and avoid any mocking or faking.
The upside is that using a OS-neutral device plugin for the tests enables
us to run on all the supported platform (windows!) so this could allow
us to transition these tests to conformance.
Signed-off-by: Francesco Romani <fromani@redhat.com>
rename getPodResources for clarity. Allow to return error
(and not use ginkgo expectations), so it can actually be used
as intended inside `Eventually` blocks without blow up at the
first failure.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Since the pod names are reused across the test, searching the logs is
currently difficult.
Use a uuid for each pod name to make grepping the logs easier. Also,
always include the pod name and pod namespace in any logs or error
messages to make debugging easier.
Signed-off-by: David Porter <david@porter.me>
With the change of the CRI-O jobs to use butane, we now have a
verification for base64 data urls in place. This means that the
following URL is invalid:
```
data:text/plain;base64,GCE_SSH_PUBLIC_KEY_FILE_CONTENT
```
This means we have to pass valid base64 to the URL. To fix that, we now
allow to inject SSH key values with both, the
`GCE_SSH_PUBLIC_KEY_FILE_CONTENT` field and its base64 encoded variant.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
The recently introduced failure handling in ExpectNoError depends on error
wrapping: if an error prefix gets added with `fmt.Errorf("foo: %v", err)`, then
ExpectNoError cannot detect that the root cause is an assertion failure and
then will add another useless "unexpected error" prefix and will not dump the
additional failure information (currently the backtrace inside the E2E
framework).
Instead of manually deciding on a case-by-case basis where %w is needed, all
error wrapping was updated automatically with
sed -i "s/fmt.Errorf\(.*\): '*\(%s\|%v\)'*\",\(.* err)\)/fmt.Errorf\1: %w\",\3/" $(git grep -l 'fmt.Errorf' test/e2e*)
This may be unnecessary in some cases, but it's not wrong.
WaitForPodToDisappear was always called such that it listed all pods, which
made it less efficient than trying to get just the one pod it was checking for.
Being able to customize the poll interval in practice wasn't useful, therefore
it can be replaced with WaitForPodNotFoundInNamespace.
Unknown pods are pods which are unknown pods to the kubelet, but are still
running in the container runtime. If kubelet detects a pod which is not in
the config (i.e. not present in API-server or static pod), but running as
detected in container runtime, kubelet should aggressively terminate the pod.
This situation can be encountered if a pod is running, then kubelet is
stopped, and while stopped, the manifest is deleted (by force deleting the
API pod or deleting the static pod manifest), and then restarting the
kubelet. Upon restart, kubelet will see the pod as running via the container
runtime, but it will not be present in the config, thus making the pod a
"unknown pod". Kubelet should then proceed to terminate these unknown pods.
Add two tests that ensure that unknown pods will be terminated (1)
static pods and (2) API pods. The test will start a pod, stop the
kubelet, force delete the pod (by deleting the manifest or force
deleting the pod), and then restarting the kubelet. The container
runtime is then queried to ensure the containers are terminated by
kubelet.
Signed-off-by: David Porter <david@porter.me>
This is used across multiple tests, so let's move into the util file.
Also, refactor it a bit to provide a better error message in case of a
failure.
Signed-off-by: David Porter <david@porter.me>
The device plugin test expects that no other pods are running prior to
the test starting. However, it has been observed that in some cases
some resources may still be around from previous tests. This is because
the deletion of resources from other tests is handled by deleting that
test's framework's namespace which is done asynchronously without
waiting for the other test's namespace to be deleted.
As a result, when the node e2e device plugin starts, there may still be
other pods in process of termination. To work around this, add a retry
to the device plugin test to account for the time it takes to delete the
resources from the prior test.
Signed-off-by: David Porter <david@porter.me>
The recently introduced failure handling in ExpectNoError depends on error
wrapping: if an error prefix gets added with `fmt.Errorf("foo: %v", err)`, then
ExpectNoError cannot detect that the root cause is an assertion failure and
then will add another useless "unexpected error" prefix and will not dump the
additional failure information (currently the backtrace inside the E2E
framework).
Instead of manually deciding on a case-by-case basis where %w is needed, all
error wrapping was updated automatically with
sed -i "s/fmt.Errorf\(.*\): '*\(%s\|%v\)'*\",\(.* err)\)/fmt.Errorf\1: %w\",\3/" $(git grep -l 'fmt.Errorf' test/e2e*)
This may be unnecessary in some cases, but it's not wrong.
WaitForPodToDisappear was always called such that it listed all pods, which
made it less efficient than trying to get just the one pod it was checking for.
Being able to customize the poll interval in practice wasn't useful, therefore
it can be replaced with WaitForPodNotFoundInNamespace.
There are two runtime class tests which required the container runtime
config to include explicit configuration for `test-handler`. The current
logic skips these tests in non GCE environments. This skip is too strict
since the test is skipped in node e2e environments and in other
environments such as kind, which support running the test and also
configure `test-handler`.
Instead of skipping based on provider, add a new function
`NodeSupportsPreconfiguredRuntimeClassHandler` which examines the
underlying container runtime config and checks if the config includes
`test-handler`. The check is a bit brittle since it assumes container
runtime config paths, but it is a net improvement over skipping the test
entirely on non GCE environments.
This results in the test working in the common test environments, namely
GCE kube-up, node e2e, and kind.
Signed-off-by: David Porter <david@porter.me>