To test https://github.com/kubernetes/kubernetes/issues/117745,
restart kubelet with a CSI volume mounted *and* the API server running as a
static pod.
The test heavily uses `kind` containers and the fact that it uses the API
server as a static pod.
condidering NewSerializer* funcs are deprecated with
NewSerializerWithOptions(), the test functions are adjusted to the same.
Signed-off-by: Humble Chirammal <humble.devassy@gmail.com>
This is required because an empty name is no longer supported: the
perf-dashboard is run with --allow-parsers-matching-all-tests=false with causes
perfdash to skip current configuration for BenchmarkPerfResults as it does not
have name
set (4674704f45/perfdash/metrics-downloader.go (L165-L167)).
The perf-dash config needs to be updated accordingly.
* Enable dockerized build with --use-dockerized-build=true
* Build and create test artifacts for ARM64 with --target-build-arch=arm64
* Prepull multi-arch ready container image
* Download ARM64 binaries/packages if running on ARM64 machine
`Framework` variable has been removed from test/*
unwanted `[]byte` conversion has been removed
import alias has been avoided
Signed-off-by: Humble Chirammal <humble.devassy@gmail.com>
test/e2e images have lost the parity compared the e2e suite image
versions and this commit make them in parity.
Signed-off-by: Humble Chirammal <humble.devassy@gmail.com>
The plugins get called by scheduler goroutines. At least the polling seems to
be done concurrently and thus needs locking.
Locking the PreBindPlugin state is less obvious. It might be that the scheduler
is really done with the test pod, but that ordering doesn't seem to be enough
for the race detector. It's simpler to add mutex locking.
v1.17.0 has been built at present where this commit make use
of available latest version/tag for the build purpose.
Signed-off-by: Humble Chirammal <humble.devassy@gmail.com>
a) add namespacing to metrics: fixes interference between `should scale up when one metric is missing (Pod and External metrics)` and `should not scale down when one metric is missing (Container Resource and External Metrics)` specs, cause of flakiness.
b) replaces deployments containing unused exporters (metrics ignored) with deployments without any exporters: potential fix for often hitting a rate-limit on creating metrics descriptors (429 errors), also adds clarity.
c) fixes metric types: some external metrics tests used non-average type while expecting the value to be constant regardless of the number of pods. However, queries resulting from metric specs don't filter by pods, so a sum of metrics for all the pods is the fetched metric value (https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/#autoscaling-on-metrics-not-related-to-kubernetes-objects). Adding averaging back by the number of pods fixes a couple of specs where the tests were passing for the wrong reason (wanted d ifferent test conditions).
The nightly containerd binary no longer works in the current kind base images:
May 15 16:32:31 kind-worker containerd[222]: /usr/local/bin/containerd:
/lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by
/usr/local/bin/containerd)
kind now builds containerd directly with the base images. The official base
images still use containerd 1.6, so we have to use a special base image that
was prepared for this purpose.
Because the containerd config can be patched through kind, we don't need to
modify the generated node image anymore.
The goal is to only label workloads as "performance" which actually run long
enough to provide useful metrics. The throughput collector samples once per
second, so a workload should run at least 5, better 10 seconds to get at least
a minimal amount of samples for the percentile calculation.
For benchstat analysis of runs with sufficient repetitions to get statistically
meaningful results, each workload shouldn't run more than one minute, otherwise
before/after analysis becomes too slow.
The labels were chosen based on benchmark runs on a reasonably fast desktop. To
know how long each workload takes, a new "runtime_seconds" benchmark result
gets added.
This PR updates changes related references to the legacy
release bucket, excluding CHANGELOG updates.
Signed-off-by: Ricky Sadowski <richard.j.sadowski@gmail.com>
When certain status conditions are not expected, we need to see
the nested objects, but %#v doesn't handle pointers well. Output
as simple encoded JSON.
Add two new metrics to monitor the client-go logic that
generate http.Transports for the clients.
- rest_client_transport_cache_entries is a gauge metrics
with the number of existin entries in the internal cache
- rest_client_transport_create_calls_total is a counter
that increments each time a new transport is created, storing
the result of the operation needed to generate it: hit, miss
or uncacheable
Change-Id: I2d8bde25281153d8f8e8faa249385edde3c1cb39
* update serial number to a valid non-zero number in ca certificate
* fix the existing problem (0 SerialNumber in all certificate) as part of this PR in a separate commit
There are some e2e tets on networking services that depend on the
healthcheck nodeport to be available. However, the healtcheck nodeport
will be available asynchronously, so we should wait until it is
available on the tests and not fail hard if it is not.
Change-Id: I595402c070c263f0e7855ee8d5662ae975dbd1d3
The existing termination period of 600 seconds for pods on the
e2e test causes that those pods are kept running after the
test has finished. 100 seconds is a good compromise to avoid
leaving pods lingering and more than enought for the test to finish.
Change-Id: I993162a77125345df1829044dc2514e03b13a407
The default scheduler configuration must be based on the v1 API where the
plugin is enabled by default. Then if (and only if) the
DynamicResourceAllocation feature gate for a test is set, the corresponding
API group also gets enabled.
The normal dynamic resource claim controller is started if needed to create
ResourceClaims from ResourceClaimTemplates.
Without the upcoming optimizations in the scheduler, scheduling with dynamic
resources is fairly slow. The new test cases take around 15 minutes wall clock
time on my desktop.
add a new functionality to the agnhost image to run a server that
closes the connections received by sending a RST.
If a TCP servers closes the connection before all the socket is read,
it sends a RST. This implementations just reads only one byte from the
connection and closes it after that, that means that in order for this
to work the client has to send at least 2 bytes of data.
Using a simple curl is enough to trigger a RST:
curl http://127.0.0.1:8080
curl: (56) Recv failure: Connection reset by peer
Change-Id: I238fba0f790f2c92b37c732f51910a8b125f65db
image_list.go is one of the files included in the non-test variant Go build list, but its getSampleDevicePluginPod function references readDaemonSetV1OrDie function defined in device_plugin_test.go which is included in the test variant Go build list only. (The file name is *_test.go).
As a result, "go build" fails with the undefined reference error.
In practice, that may not be an issue since k8s project contributors aren't meant to run go build on this package. However, tools that depend on go build to operate - e.g., gopls or govulncheck ./... - will report this as an error.
Fix this error and make test/e2e package pass go build by moving this file to also test-only source code.
Sometimes, the kube-proxy endpoints take longer than 10 seconds to
be ready, so the initial connection check fails (via `gomega.Eventually`).
This patch uses a separate longer timeout for `gomega.Eventually`.
Signed-off-by: Ionut Balutoiu <ibalutoiu@cloudbasesolutions.com>
* Fix flaky HPA e2e tests by not failing on context cancelled
Consume requests are sent during test execution in a loop in a separate goroutine. Once the test completes, it is expected that a consumption request may be pending. Cancelling the request during cleanup should not cause test failures.
Tests started being flaky since #112923 introduced passing test context that gets cancelled during cleanup.
* Use PollUntilContextTimeout and restructure error ignoring logic
Additional test cases added:
Keeps device plugin assignments across pod and kubelet restarts (no device plugin re-registration)
Keeps device plugin assignments after the device plugin has re-registered (no kubelet or pod restart)
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Add a test suit to simulate node reboot (achieved by removing pods
using CRI API before kubelet is restarted).
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
This test captures that scenario where after kubelet restart,
application pod comes up and the device plugin pod hasn't re-registered
itself, the pod fails with admission error. It is worth noting that
once the device plugin pod has registered itself, another
application pod requesting devices ends up running
successfully.
For the test case where kubelet is restarted and device plugin
has re-registered without involving pod restart, since the
pod after kubelet restart ends up with admission error,
we cannot be certain the device that the second pod (pod2) would
get. As long as, it gets a device we consider the test to pass.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Breakdown of the steps implemented as part of this e2e test is as follows:
1. Create a file `registration` at path `/var/lib/kubelet/device-plugins/sample/`
2. Create sample device plugin with an environment variable with
`REGISTER_CONTROL_FILE=/var/lib/kubelet/device-plugins/sample/registration` that
waits for a client to delete the control file.
3. Trigger plugin registeration by deleting the abovementioned directory.
4. Create a test pod requesting devices exposed by the device plugin.
5. Stop kubelet.
6. Remove pods using CRI to ensure new pods are created after kubelet restart.
7. Restart kubelet.
8. Wait for the sample device plugin pod to be running. In this case,
the registration is not triggered.
9. Ensure that resource capacity/allocatable exported by the device plugin is zero.
10. The test pod should fail with `UnexpectedAdmissionError`
11. Delete the test pod.
12. Delete the sample device plugin pod.
13. Remove `/var/lib/kubelet/device-plugins/sample/` and its content, the directory
created to control registration
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
This commit reuses e2e tests implmented as part of https://github.com/kubernetes/kubernetes/pull/110729.
The commit is borrowed from the aforementioned PR as is to preserve
authorship. Subsequent commit will update the end to end test to
simulate the problem this PR is trying to solve by reproducing
the issue: 109595.
Co-authored-by: Francesco Romani <fromani@redhat.com>
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
This will change when adding dynamic resource allocation test cases. Instead of
changing mustSetupScheduler and StartScheduler for that, let's return the
informer factory and create informers as needed in the test.