Add a node e2e to verify that if a static pod is terminated while the
container runtime or CRI returns an error, the pod is eventually
terminated successfully.
This test serves as a regression test for k8s.io/issue/113145 which
fixes an issue where force deleted pods may not be terminated if the
container runtime fails during a `syncTerminatingPod`.
To test this behavior, start a static pod, stop the container runtime,
and later start the container runtime. The static pod is expected to
eventually terminate successfully.
To start and stop the container runtime, we need to find the container
runtime systemd unit name. Introduce a util function
`findContainerRuntimeServiceName` which finds the unit name by getting
the pid of the container runtime from the existing
`ContainerRuntimeProcessName` flag passed into node e2e and using
systemd dbus `GetUnitNameByPID` function to convert the pid of the
container runtime to a unit name. Using the unit name, introduce helper
functions to start and stop the container runtime.
Signed-off-by: David Porter <david@porter.me>
The recently introduced failure handling in ExpectNoError depends on error
wrapping: if an error prefix gets added with `fmt.Errorf("foo: %v", err)`, then
ExpectNoError cannot detect that the root cause is an assertion failure and
then will add another useless "unexpected error" prefix and will not dump the
additional failure information (currently the backtrace inside the E2E
framework).
Instead of manually deciding on a case-by-case basis where %w is needed, all
error wrapping was updated automatically with
sed -i "s/fmt.Errorf\(.*\): '*\(%s\|%v\)'*\",\(.* err)\)/fmt.Errorf\1: %w\",\3/" $(git grep -l 'fmt.Errorf' test/e2e*)
This may be unnecessary in some cases, but it's not wrong.
This is used across multiple tests, so let's move into the util file.
Also, refactor it a bit to provide a better error message in case of a
failure.
Signed-off-by: David Porter <david@porter.me>
The recently introduced failure handling in ExpectNoError depends on error
wrapping: if an error prefix gets added with `fmt.Errorf("foo: %v", err)`, then
ExpectNoError cannot detect that the root cause is an assertion failure and
then will add another useless "unexpected error" prefix and will not dump the
additional failure information (currently the backtrace inside the E2E
framework).
Instead of manually deciding on a case-by-case basis where %w is needed, all
error wrapping was updated automatically with
sed -i "s/fmt.Errorf\(.*\): '*\(%s\|%v\)'*\",\(.* err)\)/fmt.Errorf\1: %w\",\3/" $(git grep -l 'fmt.Errorf' test/e2e*)
This may be unnecessary in some cases, but it's not wrong.
All code must use the context from Ginkgo when doing API calls or polling for a
change, otherwise the code would not return immediately when the test gets
aborted.
- update all the import statements
- run hack/pin-dependency.sh to change pinned dependency versions
- run hack/update-vendor.sh to update go.mod files and the vendor directory
- update the method signatures for custom reporters
Signed-off-by: Dave Chen <dave.chen@arm.com>
In the AfterEach check of the e2e node device plugin tests,
the tests want really bad to clean up after themselves:
- delete the sample device plugin
- restart again the kubelet
- ensure that after the restart, no stale sample devices
(provided by the sample device plugin) are reported anymore.
We observed that in the AfterEach block of these e2e tests
we have quite reliably a flip/flop of the kubelet readiness
state, possibly related to a race with/ a slow runtime/PLEG check.
What happens is that the kubelet readiness state is true,
but goes false for a quick interval and then goes true again
and it's pretty stable after that (observed adding more logs
to the check loop).
The key factor here is the function `getLocalNode` aborts the
test (as in `framework.ExpectNoError`) if the node state is
not ready. So any occurrence of this scenario, even if it
is transient, will cause a test failure. I believe this will
make the e2e test unnecessarily fragile without making it more
correct.
For the purpose of the test we can tolerate this kind of glitches,
with kubelet flip/flopping the ready state, granted that we meet
eventually the final desired condition on which the node reports
ready AND reports no sample devices present - which was the condition
the code was trying to check.
So, we add a variant of `getLocalNode`, which just fetches the
node object the e2e_node framework created, alongside to a flag
reporting the node readiness. The new helper does not make
implicitly the test abort if the node is not ready, just bubbles
up this information.
Signed-off-by: Francesco Romani <fromani@redhat.com>
On IPv6 clusters, one of the most frequent problems I encounter is
assumptions that one can build a URL with a host and port simply by
using Sprintf, like this:
```go
fmt.Sprintf("http://%s:%d/foo", host, port)
```
When `host` is an IPv6 address, this produces an invalid URL as it must
be bracketed, like this:
```
http://[2001:4860:4860::8888]:9443
```
This change fixes the occurences of joining a host and port with the
purpose built `net.JoinHostPort` function.
I encounter this problem often enough that I started to [write a linter
for it](https://github.com/stbenjam/go-sprintf-host-port). I don't
think the linter is quite ready for wide use yet, but I did run it
against the Kube codebase and found these. While the host portion in
some of these changes may always be an FQDN or IPv4 IP today, it's an
easy thing that can break later on.
The readonly port could be disabled.
Since we are only using the /healthz endpoint,
we can use the healthz port for this.
Change-Id: Ie0e05a5ab4ec6f51e4d3c63226aa23c1b3a69956
Add a e2e test to exercise the checkpoint recovery flow.
This means we need to actually create a old (V1, pre-1.20) checkpoint,
but if we do it only in the e2e test, it's still fine.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Each e2e test knows it wants to restart a running kubelet or a
non-running kubelet. The vast majority of times, we want to
restart a running kubelet (e.g. to change config or to check
some properties hold across kubelet crashes/restarts), but sometimes
we stop the kubelet, do some actions and only then restart.
To accomodate both use cases, we just expose the `running` boolean
flag to the e2e tests.
Having the `restartKubelet` explicitly restarting a running kubelet
helps us to trobuleshoot e2e failures on which the kubelet
was supposed to be running, while it was not; attempting a restart
in such cases only murkied the waters further, making the
troubleshooting and the eventual fix harder.
In the happy path, no expected change in behaviour.
Signed-off-by: Francesco Romani <fromani@redhat.com>
In the `restartKubelet` helper, we use `exec.Command`, whose
return value is the output as the command, but as `[]byte`.
The way we logged the output of the command was as value, making
the output, meant to be human readable, unnecessarily hard to read.
We fix this annoying behaviour converting the output to string before
to log it out, making pretty obvious to understand the outcome of
the command.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Even DynamicKubeletConfig is deprecated it still used in e2e_node test.
The bug is hidden by forcibly enabled option
TEST_ARGS='--feature-gates=DynamicKubeletConfig=true'
if this option is not enabled setKubeletConfiguration tries to set
kubelet config via apiserver interface and failed with timeout.
Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>
The server service monitors the kubelet service and restart it
once the service is down, to avoid kubelet double restarting
we will stop the kubelet service and wait until the kubelet will be
restarted and the node will be ready.
Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
microk8s run kubelet service as `snap.microk8s.daemon-kubelet.service`, instead of `kubelet.service`.
so e2e should use `systemctl list-units *kubelet* --state=running` to find out kubelet service of microk8s.
Under e2e tests possible the situation when we restart the kubelet
number of times in the short time frame. When it happens the systemd
can fail the service restart with the `Failed with result 'start-limit-hit'.`
error.
To avoid this situation the code will reset the kubelet service start failures
on each call to the kubelet restart command.
Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
Address review comments and move the helper function
in the `framework/kubelet` package to avoid circular deps
(see https://github.com/kubernetes/kubernetes/issues/81245)
Signed-off-by: Francesco Romani <fromani@redhat.com>
this patch moves the helper getCurrentKubeletConfig function,
used in both e2e and e2e_node tests and previously duplicated,
in the common framework.
There are no intended changes in behaviour.
Signed-off-by: Francesco Romani <fromani@redhat.com>
- SimpleGET: Moved to ingress sub package of e2e framework
- PollURL: Moved to ingress sub package of e2e framework
- ProxyMode: Moved to service e2e test package
- ListNamespaceEvents: Moved to e2e_node test package
- NewE2ETestNodePreparer: Removed since 59533f0cd1