We should evaluate the error, otherwise we risk to hang indefinately on
waiting for the `reschan` in:
64939b66c6/test/e2e_node/util.go (L419)
We also increase the timeout, because it can take a bit longer for
runtimes to determinate depending on the work they have to be done on
running containers.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
We use the label definitions in CRI-O, means we now make them public to
stop vendoring/copying this part of Kubernetes.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
We only added failed plulgins, but actually this will not work unless
we make the status with a fitError because we only copy the failured plugins
to podInfo if it is a fitError
Signed-off-by: kerthcet <kerthcet@gmail.com>
Normal binaries should never have to do this. It's not safe when there are
already some goroutines running which might do logging. Therefore the new
default is to return an error when a binary accidentally re-applies.
A few unit ensure that there are no goroutines and have to call the functions
more then once. The new ResetForTest API gets used by those to enable changing the
logging settings more than once in the same process.
Integration tests use the same code as the normal binaries. To make reuse of
that code safe, component-base/logs can be configured to silently ignore any
additional calls. This addresses data races that were found when enabling -race
for integration tests. To catch cases where the integration test does want
to modify the config, the old and new config get compared and an error is
raised when it's not the same.
To avoid having to modify all integration tests which start test servers,
reconfiguring component-base/logs is done by the test server packages.
When a pod is done, but not getting removed yet for while, then a claim that
got generated for that pod can be deleted already. This then also triggers
deallocation.
Invalid flags are detected by flag parsing, but optional arguments are just
passed through to the E2E suites. None of them support any, so rejecting them
with an error message is useful because it helps catch typos (like a missing
hyphen before a flag).
perfdash expects all data items to have the same set of labels. It then
renders drop-down buttons for each label with all values found for each
label. Previously, data items that didn't have a label didn't match any label
filter in perfdash and couldn't get selected because perfdash doesn't have
"unset" in it's drop-down menus.
To avoid that, scheduler-perf now collects all labels and then adds missing
labels with "not applicable" as value:
{
"data": {
"Average": 939.7071223010004,
"Perc50": 927.7987421383649,
"Perc90": 2166.153846153846,
"Perc95": 2363.076923076923,
"Perc99": 2520.6153846153848
},
"unit": "ms",
"labels": {
"Metric": "scheduler_pod_scheduling_duration_seconds",
"Name": "SchedulingBasic/5000Nodes/namespace-2",
"extension_point": "not applicable",
"result": "not applicable"
}
},
...
{
"data": {
"Average": 1.1172570650000004,
"Perc50": 1.1418367346938776,
"Perc90": 1.5500000000000003,
"Perc95": 1.6410256410256412,
"Perc99": 3.7333333333333334
},
"unit": "ms",
"labels": {
"Metric": "scheduler_framework_extension_point_duration_seconds",
"Name": "SchedulingBasic/5000Nodes/namespace-2",
"extension_point": "Score",
"result": "not applicable"
}
},
Because the JSON file gets written at the end of the top-level benchmark, all
data items had `BenchmarkPerfScheduling/` as prefix in the `Name` label. This
is redundant and makes it harder to see the actual name. Now that common prefix
gets removed.
CreatePod and MakePod only accepted an `isPrivileged` boolean, which made it
impossible to write tests using those helpers which work in a default
framework.Framework, because the default there is LevelRestricted.
The simple boolean gets replaced with admissionapi.Level. Passing
LevelRestricted does the same as calling e2epod.MixinRestrictedPodSecurity.
Instead of explicitly passing a constant to these modified helpers, most tests
get updated to pass f.NamespacePodSecurityLevel. This has the advantage
that if that level gets lowered in the future, tests only need to be updated in
one place.
In some cases, helpers taking client+namespace+timeouts parameters get replaced
with passing the Framework instance to get access to
f.NamespacePodSecurityEnforceLevel. These helpers don't need separate
parameters because in practice all they ever used where the values from the
Framework instance.
The post merge job was failed https://github.com/kubernetes/kubernetes/pull/117103
and this causes the e2e tests to fail. This PR retrigger the same.
Signed-off-by: Humble Chirammal <humble.devassy@gmail.com>
The namespace the crictical pod was referring to was wrong, because it
was using the generated one instead of `kube-system`. This and the
resulting test condition is now fixed.
The test seems to run only in `ci-crio-cgroupv1-node-e2e-flaky` for now.
Closes https://github.com/kubernetes/kubernetes/issues/109296
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
Doing the initialization once was not good enough because it was not guaranteed
that RunCustomEtcd gets called early enough, before there are other goroutines
which use gRPC. The data race for
test/integration/apiserver.TestWatchCacheUpdatedByEtcd was:
WARNING: DATA RACE
Read at 0x00000cfffb90 by goroutine 140052:
k8s.io/kubernetes/vendor/google.golang.org/grpc/grpclog.V()
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/google.golang.org/grpc/grpclog/grpclog.go:41 +0x30
k8s.io/kubernetes/vendor/google.golang.org/grpc/grpclog.(*componentData).V()
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/google.golang.org/grpc/grpclog/component.go:103 +0x4e
k8s.io/kubernetes/vendor/google.golang.org/grpc/internal/transport.(*http2Client).Close()
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/google.golang.org/grpc/internal/transport/http2_client.go:955 +0xca
k8s.io/kubernetes/vendor/google.golang.org/grpc/internal/transport.(*http2Client).reader()
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/google.golang.org/grpc/internal/transport/http2_client.go:1619 +0xbfb
k8s.io/kubernetes/vendor/google.golang.org/grpc/internal/transport.newHTTP2Client.func11()
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/google.golang.org/grpc/internal/transport/http2_client.go:394 +0x47
Previous write at 0x00000cfffb90 by goroutine 145643:
k8s.io/kubernetes/vendor/google.golang.org/grpc/grpclog.SetLoggerV2()
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/google.golang.org/grpc/grpclog/loggerv2.go:75 +0x104
k8s.io/kubernetes/test/integration/framework.RunCustomEtcd.func2()
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/integration/framework/etcd.go:157 +0x33
sync.(*Once).doSlow()
/usr/local/go/src/sync/once.go:74 +0x101
sync.(*Once).Do()
/usr/local/go/src/sync/once.go:65 +0x46
k8s.io/kubernetes/test/integration/framework.RunCustomEtcd()
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/integration/framework/etcd.go:156 +0xb97
k8s.io/kubernetes/test/integration/apiserver.multiEtcdSetup()
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/integration/apiserver/watchcache_test.go:41 +0xc4
k8s.io/kubernetes/test/integration/apiserver.TestWatchCacheUpdatedByEtcd()
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/integration/apiserver/watchcache_test.go:92 +0xa9
testing.tRunner()
/usr/local/go/src/testing/testing.go:1576 +0x216
testing.(*T).Run.func1()
/usr/local/go/src/testing/testing.go:1629 +0x47
This commit removes the legacy networkpolicy tests since they now have
complete appropriate coverage in the new netpol suite.
Signed-off-by: Andrew Stoycos <astoycos@redhat.com>
We have a e2e test which want to get a rate limit error. To do so, we
sent an abnormally high amount of calls in a tight loop.
The relevant test per se is reportedly fine, but wwe need to play nicer
with *other* tests which may run just after and which need to query the API.
If the testsuite runs "too fast", it's possible an innocent test falls in the
same rate limit watch period which was saturated by the ratelimit test,
so the innocent test can still be throttled because the throttling period
is not exhausted yet, yielding false negatives, leading to flakes.
We can't reset the period for the rate limit, we just wait "long enough" to
make sure we absorb the burst and other legit queries are not rejected.
Signed-off-by: Francesco Romani <fromani@redhat.com>
This releases the underlying resource sooner and ensures that another consumer
can get scheduled without being influenced by a decision that was made for the
previous consumer.
An alternative would have been to have the apiserver trigger the deallocation
whenever it sees the `status.reservedFor` getting reduced to zero. But that
then also triggers deallocation when kube-scheduler removes the last
reservation after a failed scheduling cycle. In that case we want to keep the
claim allocated and let the kube-scheduler decide on a case-by-case basis which
claim should get deallocated.
add integration test to wait for json without value
refactor JSON condition value parsing and validating
adjusting test to reflect the error message refactoring
ginkgo.By should be used for steps in the test flow. Creating and deleting CDI
files happens in parallel to that. If reported via ginkgo.By, progress reports
look weird because they contain e.g. step "waiting for...." (from the main
test, which is still on-going) and end with "creating CDI file" (which is
already completed).
This avoids the surprise of identical authorization checks within a
policy evaluating to different decisions during the same admission
pass, and reduces the overhead of repeatedly referencing the same
authorization check.
This runs workloads that are labeled as "integration-test". The apiserver and
scheduler are only started once per unique configuration, followed by each
workload using that configuration. This makes execution faster. In contrast to
benchmarking, we care less about starting with a clean slate for each test.
Merely deleting the namespace is not enough:
- Workloads might rely on the garbage collector to get rid of obsolete objects,
so we should run it to be on the safe side.
- Pods must be force-deleted because kubelet is not running.
- Finally, the namespace controller is needed to get rid of
deleted namespaces.
* Skip terminal Pods with a deletion timestamp from the Daemonset sync
Change-Id: I64a347a87c02ee2bd48be10e6fff380c8c81f742
* Review comments and fix integration test
Change-Id: I3eb5ec62bce8b4b150726a1e9b2b517c4e993713
* Include deleted terminal pods in history
Change-Id: I8b921157e6be1c809dd59f8035ec259ea4d96301
* test comment should match the code in podgc
* Update test/integration/podgc/podgc_test.go
Co-authored-by: Michał Woźniak <mimowo@users.noreply.github.com>
* test comment should match the code in podgc
---------
Co-authored-by: Michał Woźniak <mimowo@users.noreply.github.com>
The exception comments were added due to a false positive in
staticcheck. This has since been rectified.
Signed-off-by: Madhav Jivrajani <madhav.jiv@gmail.com>
The failure message becomes nicer. Found with the new ginkgolinter, for
example:
test/e2e/apps/cronjob.go:113:3: ginkgo-linter: wrong length assertion; consider using `gomega.Expect(jobs.Items).To(gomega.BeEmpty())` instead (ginkgolinter)
gomega.Expect(jobs.Items).To(gomega.HaveLen(0))
^
The failure message becomes a bit nicer. Found by the new ginkgolinter, for
example:
test/e2e/windows/memory_limits.go:160:2: ginkgo-linter: wrong boolean assertion; consider using `gomega.Eventually(ctx, func() bool {
eventList, err := f.ClientSet.CoreV1().Events(f.Namespace.Name).List(ctx, metav1.ListOptions{})
...
}, 3*time.Minute, 10*time.Second).Should(gomega.BeTrue())` instead (ginkgolinter)
"gomega.Expect" is not the same as "assert" in C: it always has to be combined
with a statement of what is expected.
Found with the new ginkgolinter, for example:
test/e2e/node/pod_resize.go:242:3: ginkgo-linter: "Expect": missing assertion method. Expected "Should()", "To()", "ShouldNot()", "ToNot()" or "NotTo()" (ginkgolinter)
gomega.Expect(found == true)
NOTE: we are not installing the ecr-credential-provider binary
itself here we are, we need to do it out-of-band from the test
suite itself before it runs.
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
Previously, we were trying to patch the deployment's ready replicas by
changing it to 0, at the same time we have 2 available replicas.
Therefore, this test fails since the ready replicas is less than the
available replicas. As a fix, we make the number of ready replicas equals to the number
of available ones.
Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
Each benchmark test case runs with a fresh etcd instance. Therefore it is not
necessary to delete objects after a run.
A future unit test might reuse etcd, therefore cleanup is optional.
Added NodeAlphaFeature:DynamicResourceAllocation to the Node DRA test
to fix failing containerd serial jobs. Those jobs skip tests labeled
with NodeAlphaFeature.
Removed NodeFeature:DynamicResourceAllocation label from the
tests to fix cos-cgroupv1/v2-containerd-node-e2e-serial CI jobs.
It turned out that labeling DRA Node tests as NodeFeature was
a mistake. Re-labeling with NodeAlphaFeature would not work either.
It would fail certain containerd jobs as DRA requires containerd >= 1.7
When running iscsi test, use dbus socket from the host. targetcli uses the
socket for synchronization.
Recent Fedoras can run dbus only via systemd, which is cumbersome here.
Once DeferCleanup for the worker goroutine is invoked, there's no need to
continue doing anything anymore in that goroutine and it can return
immediately, without reporting the "context canceled" error because there is no
other reason for that.
This test requires consistent CPU consumption for 3 minutes
to pass. Consumption on a single Pod is more consistent than
split across multiple Pods: no temporary usage drops in aggregate.
Ginkgo changed the noColor command line arg to be no-color and will
issue the following warning:
You're using deprecated Ginkgo functionality:
=============================================
--noColor is deprecated, use --no-color instead
Fix this by changing all occurrences accordingly.
Fixes issue 115945 by moving the cleanup code in AfterEach into DeferCleanup.
Cleanup stanzas are now paired with their setup stanzas within the body
of the BeforeEach and are now guarenteed to run in the correct order.
Prior to this there was no guarantee that the goroutine to recycle
unbound PVs had finished before the AfterEach began.
Since the feature is GA and locked to true, tests can no longer set it
to false. Cleaning up by removing all references to this feature gate
from tests.
Feature gate will be removed in v1.29.
T.Setenv ensures that the environment is returned to its prior state
when the test ends. It also panics when called from a parallel test to
prevent racy test interdependencies.
ReadWriteOncePod feature needs min requirement of 1.27 kubelet, add the
tag to skip test if kubelet version is smaller than 1.27
Change-Id: I27959156db90f2477cead6dfc16f42dbc54663bc
If kubelet plugin registration fails, it would be good to know more about the
communication with kubelet. Capturing the GRPC calls and then checking that
makes the failure messages more informative. Here's an example where a failure
was triggered by temporarily modifying the check so that it didn't find the
call:
[FAILED] Timed out after 30.000s.
Expected:
<[]app.GRPCCall | len:2, cap:2>: [
{
FullMethod: "/pluginregistration.Registration/GetInfo",
Request:
{},
Response:
endpoint: /var/lib/kubelet/plugins/test-driver/dra.sock
name: test-driver.cdi.k8s.io
supported_versions:
- 1.0.0
type: DRAPlugin,
Err: nil,
},
{
FullMethod: "/pluginregistration.Registration/NotifyRegistrationStatus",
Request:
plugin_registered: true,
Response:
{},
Err: nil,
},
]
to contain successful NotifyRegistrationStatus call
the e2e framwork use active loops to wait for certain async operations,
these loops need to retry on some operations and fail in others.
For the functions that depend on some operations to happen, the
apiserver may return 503 errors until that specific service is
available, so we should retry on those too.
Change-Id: Ib3d194184f6385b9d3d151c7055f27c97c21c3ff
Applying it to the entire spec included cleaning up, which makes predicting the
acceptable duration harder because it includes code not owned by the test
itself. It's better to specify a timeout only for the test code itself.
To test https://github.com/kubernetes/kubernetes/issues/117745,
restart kubelet with a CSI volume mounted *and* the API server running as a
static pod.
The test heavily uses `kind` containers and the fact that it uses the API
server as a static pod.
condidering NewSerializer* funcs are deprecated with
NewSerializerWithOptions(), the test functions are adjusted to the same.
Signed-off-by: Humble Chirammal <humble.devassy@gmail.com>
This is required because an empty name is no longer supported: the
perf-dashboard is run with --allow-parsers-matching-all-tests=false with causes
perfdash to skip current configuration for BenchmarkPerfResults as it does not
have name
set (4674704f45/perfdash/metrics-downloader.go (L165-L167)).
The perf-dash config needs to be updated accordingly.
* Enable dockerized build with --use-dockerized-build=true
* Build and create test artifacts for ARM64 with --target-build-arch=arm64
* Prepull multi-arch ready container image
* Download ARM64 binaries/packages if running on ARM64 machine
`Framework` variable has been removed from test/*
unwanted `[]byte` conversion has been removed
import alias has been avoided
Signed-off-by: Humble Chirammal <humble.devassy@gmail.com>
test/e2e images have lost the parity compared the e2e suite image
versions and this commit make them in parity.
Signed-off-by: Humble Chirammal <humble.devassy@gmail.com>
The plugins get called by scheduler goroutines. At least the polling seems to
be done concurrently and thus needs locking.
Locking the PreBindPlugin state is less obvious. It might be that the scheduler
is really done with the test pod, but that ordering doesn't seem to be enough
for the race detector. It's simpler to add mutex locking.
v1.17.0 has been built at present where this commit make use
of available latest version/tag for the build purpose.
Signed-off-by: Humble Chirammal <humble.devassy@gmail.com>
a) add namespacing to metrics: fixes interference between `should scale up when one metric is missing (Pod and External metrics)` and `should not scale down when one metric is missing (Container Resource and External Metrics)` specs, cause of flakiness.
b) replaces deployments containing unused exporters (metrics ignored) with deployments without any exporters: potential fix for often hitting a rate-limit on creating metrics descriptors (429 errors), also adds clarity.
c) fixes metric types: some external metrics tests used non-average type while expecting the value to be constant regardless of the number of pods. However, queries resulting from metric specs don't filter by pods, so a sum of metrics for all the pods is the fetched metric value (https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/#autoscaling-on-metrics-not-related-to-kubernetes-objects). Adding averaging back by the number of pods fixes a couple of specs where the tests were passing for the wrong reason (wanted d ifferent test conditions).
The nightly containerd binary no longer works in the current kind base images:
May 15 16:32:31 kind-worker containerd[222]: /usr/local/bin/containerd:
/lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by
/usr/local/bin/containerd)
kind now builds containerd directly with the base images. The official base
images still use containerd 1.6, so we have to use a special base image that
was prepared for this purpose.
Because the containerd config can be patched through kind, we don't need to
modify the generated node image anymore.
The goal is to only label workloads as "performance" which actually run long
enough to provide useful metrics. The throughput collector samples once per
second, so a workload should run at least 5, better 10 seconds to get at least
a minimal amount of samples for the percentile calculation.
For benchstat analysis of runs with sufficient repetitions to get statistically
meaningful results, each workload shouldn't run more than one minute, otherwise
before/after analysis becomes too slow.
The labels were chosen based on benchmark runs on a reasonably fast desktop. To
know how long each workload takes, a new "runtime_seconds" benchmark result
gets added.
This PR updates changes related references to the legacy
release bucket, excluding CHANGELOG updates.
Signed-off-by: Ricky Sadowski <richard.j.sadowski@gmail.com>
When certain status conditions are not expected, we need to see
the nested objects, but %#v doesn't handle pointers well. Output
as simple encoded JSON.
Add two new metrics to monitor the client-go logic that
generate http.Transports for the clients.
- rest_client_transport_cache_entries is a gauge metrics
with the number of existin entries in the internal cache
- rest_client_transport_create_calls_total is a counter
that increments each time a new transport is created, storing
the result of the operation needed to generate it: hit, miss
or uncacheable
Change-Id: I2d8bde25281153d8f8e8faa249385edde3c1cb39
This touches cases where FromInt() is used on numeric constants, or
values which are already int32s, or int variables which are defined
close by and can be changed to int32s with little impact.
Signed-off-by: Stephen Kitt <skitt@redhat.com>
* update serial number to a valid non-zero number in ca certificate
* fix the existing problem (0 SerialNumber in all certificate) as part of this PR in a separate commit
There are some e2e tets on networking services that depend on the
healthcheck nodeport to be available. However, the healtcheck nodeport
will be available asynchronously, so we should wait until it is
available on the tests and not fail hard if it is not.
Change-Id: I595402c070c263f0e7855ee8d5662ae975dbd1d3
The existing termination period of 600 seconds for pods on the
e2e test causes that those pods are kept running after the
test has finished. 100 seconds is a good compromise to avoid
leaving pods lingering and more than enought for the test to finish.
Change-Id: I993162a77125345df1829044dc2514e03b13a407
The default scheduler configuration must be based on the v1 API where the
plugin is enabled by default. Then if (and only if) the
DynamicResourceAllocation feature gate for a test is set, the corresponding
API group also gets enabled.
The normal dynamic resource claim controller is started if needed to create
ResourceClaims from ResourceClaimTemplates.
Without the upcoming optimizations in the scheduler, scheduling with dynamic
resources is fairly slow. The new test cases take around 15 minutes wall clock
time on my desktop.
add a new functionality to the agnhost image to run a server that
closes the connections received by sending a RST.
If a TCP servers closes the connection before all the socket is read,
it sends a RST. This implementations just reads only one byte from the
connection and closes it after that, that means that in order for this
to work the client has to send at least 2 bytes of data.
Using a simple curl is enough to trigger a RST:
curl http://127.0.0.1:8080
curl: (56) Recv failure: Connection reset by peer
Change-Id: I238fba0f790f2c92b37c732f51910a8b125f65db
image_list.go is one of the files included in the non-test variant Go build list, but its getSampleDevicePluginPod function references readDaemonSetV1OrDie function defined in device_plugin_test.go which is included in the test variant Go build list only. (The file name is *_test.go).
As a result, "go build" fails with the undefined reference error.
In practice, that may not be an issue since k8s project contributors aren't meant to run go build on this package. However, tools that depend on go build to operate - e.g., gopls or govulncheck ./... - will report this as an error.
Fix this error and make test/e2e package pass go build by moving this file to also test-only source code.
Sometimes, the kube-proxy endpoints take longer than 10 seconds to
be ready, so the initial connection check fails (via `gomega.Eventually`).
This patch uses a separate longer timeout for `gomega.Eventually`.
Signed-off-by: Ionut Balutoiu <ibalutoiu@cloudbasesolutions.com>
* Fix flaky HPA e2e tests by not failing on context cancelled
Consume requests are sent during test execution in a loop in a separate goroutine. Once the test completes, it is expected that a consumption request may be pending. Cancelling the request during cleanup should not cause test failures.
Tests started being flaky since #112923 introduced passing test context that gets cancelled during cleanup.
* Use PollUntilContextTimeout and restructure error ignoring logic
Additional test cases added:
Keeps device plugin assignments across pod and kubelet restarts (no device plugin re-registration)
Keeps device plugin assignments after the device plugin has re-registered (no kubelet or pod restart)
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Add a test suit to simulate node reboot (achieved by removing pods
using CRI API before kubelet is restarted).
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
This test captures that scenario where after kubelet restart,
application pod comes up and the device plugin pod hasn't re-registered
itself, the pod fails with admission error. It is worth noting that
once the device plugin pod has registered itself, another
application pod requesting devices ends up running
successfully.
For the test case where kubelet is restarted and device plugin
has re-registered without involving pod restart, since the
pod after kubelet restart ends up with admission error,
we cannot be certain the device that the second pod (pod2) would
get. As long as, it gets a device we consider the test to pass.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Breakdown of the steps implemented as part of this e2e test is as follows:
1. Create a file `registration` at path `/var/lib/kubelet/device-plugins/sample/`
2. Create sample device plugin with an environment variable with
`REGISTER_CONTROL_FILE=/var/lib/kubelet/device-plugins/sample/registration` that
waits for a client to delete the control file.
3. Trigger plugin registeration by deleting the abovementioned directory.
4. Create a test pod requesting devices exposed by the device plugin.
5. Stop kubelet.
6. Remove pods using CRI to ensure new pods are created after kubelet restart.
7. Restart kubelet.
8. Wait for the sample device plugin pod to be running. In this case,
the registration is not triggered.
9. Ensure that resource capacity/allocatable exported by the device plugin is zero.
10. The test pod should fail with `UnexpectedAdmissionError`
11. Delete the test pod.
12. Delete the sample device plugin pod.
13. Remove `/var/lib/kubelet/device-plugins/sample/` and its content, the directory
created to control registration
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
This commit reuses e2e tests implmented as part of https://github.com/kubernetes/kubernetes/pull/110729.
The commit is borrowed from the aforementioned PR as is to preserve
authorship. Subsequent commit will update the end to end test to
simulate the problem this PR is trying to solve by reproducing
the issue: 109595.
Co-authored-by: Francesco Romani <fromani@redhat.com>
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
This will change when adding dynamic resource allocation test cases. Instead of
changing mustSetupScheduler and StartScheduler for that, let's return the
informer factory and create informers as needed in the test.
Entire test cases and workloads can have labels attached to them. The union of
these must match the label filter which works as in GitHub. The benchmark by
default runs the tests that are labeled "performance", which is the same as
before.
Capture explicitly a test case pertaining to kubelet restart
but with no pod restart and device plugin re-registration.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Based on whether the test case requires pod restart or not, the sleep
interval needs to be updated and we define constants to represent the two
sleep intervals that can be used in the corresponding test cases.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Co-authored-by: Francesco Romani <fromani@redhat.com>
Explicitly state that the test involves kubelet restart and device plugin
re-registration (no pod restart)
We remove the part of the code where we wait for the pod to restart as this
test case should no longer involve pod restart.
In addition to that, we use `waitForNodeReady` instead of `WaitForAllNodesSchedulable`
for ensuring that the node is ready for pods to be scheduled on it.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Co-authored-by: Francesco Romani <fromani@redhat.com>
Rather than testing out for both pod restart and kubelet restart,
we change the tests to just handle pod restart scenario.
Clarify the test purpose and add extra check to tighten the test.
We would be adding additional tests to cover kubelet restart scenarios
in subsequent commits.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Francesco Romani <fromani@redhat.com>
With this change the error message are more helpful and easier
to troubleshoot in case of test failures.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
We rename to make the intent more explicit;
We make it global to be able to reuse the value all across the module
(e.g. to check the node readiness) later on.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Co-authored-by: Francesco Romani <fromani@redhat.com>
Rather than only returning a string forcing us to log failure with
`framework.Fail`, we return a string and error to handle error cases
more conventionally. This enables us to use the `parseLog` function
inside `Eventually` and `Consistently` blocks, or in general to delegate
the error processing and enable better composability.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Co-authored-by: Francesco Romani <fromani@redhat.com>
The previous solution had some shortcomings:
- It was based on the assumption that the goroutine gets woken up at regular
intervals. This is not actually guaranteed. Now the code keeps track of the
actual start and end of an interval and verifies that assumption.
- If no pod was scheduled (unlikely, but could happen), then
"0 pods/s" got recorded. In such a case, the metric was always either
zero or >= 1. A better solution is to extend the interval
until some pod gets scheduled. With the larger time interval
it is then possible to also track, for example, 0.5 pods/s.
This allows us to return with a timeout error as soon as the
context is canceled. Previously in cases where the mount will
never succeed pods can get stuck deleting for 2 minutes.
In the Sync*Pod methods that call VolumeManager.WaitFor*, we
must filter out wait.Interrupted errors from being logged as
they are part of control flow, not runtime problems. Any
early interruption should result in exiting the Sync*Pod method
as quickly as possible without logging intermediate errors.
Copied and modified RemoveString function from
k/k/pkg/util/slice/slice.go to e2e/framework/pod/pod_client.go
This is the last dependency from e2e framework to k/k/pkg/util
Copied and modified pod format function from
k/k/pkg/kubelet/util/format/pod.go to e2e/framework/pod/pod_client.go
This is the last dependency from e2e framework to k/k/pkg/kubelet
It's conceptually wrong to have dependencies to k/k/pkg in
the e2e framework code. They should be moved to corresponding
packages, in this particular case to the test/e2e_node.
* Convert file but warn user with impossible conversions
* Only continuing for NotRegisteredErrors. Using iostreams for warning user instead of stdError
* Formatting, correct tests to use valid DNS-1035.
By generating the unique name in advance, the label also can be set to a
matching value directly in the Create request. This makes test startup in
test/integration/scheduler_perf a bit faster because the extra patching can be
avoided.
It also leads to a better label because previously, the unique label value
didn't match the node name. This is required for simulating dynamic resource
allocation, which relies on the label to track where an allocated claim is
available.
Add a regression test for https://issues.k8s.io/116925. The test
exercises the following:
1) Start a restart never pod which will exit with
`v1.PodSucceeded` phase.
2) Start a graceful deletion of the pod (set a deletion timestamp)
3) Restart the kubelet as soon as the kubelet reports the pod is
terminal (but before the pod is deleted).
4) Verify that after kubelet restart, the pod is deleted.
As of v1.27, there is a delay between the pod being marked terminal
phaes, and the status manager deleting the pod. If the kubelet is
restarted in the middle, after starting up again, the kubelet needs to
ensure the pod will be deleted on the API server.
Signed-off-by: David Porter <david@porter.me>
It makes sense to define a test where, depending on the parameters, some
operation creations zero pods, namespaces or nodes. The validation didn't allow
that previously due to the way how it was implemented although the underlying
code works fine with zero as count.
collector.collect got called without ensuring that collector.run had
terminated, so it could have happened that collector.run adds another sample
while collector.collect is reading them.
test-e2e-node for AWS is out-of-tree so that we won't need to vendor
in AWS related packages. For this to work, some of the scripts/golang
code need to know where the k8s tree is git cloned.
So let's add an option to lookup the env var, so that we can then,
change directory to this specified directory to run some make commands
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
grpclog.SetLoggerV is not thread-safe and may only be called before code starts
using GRPC. Calling RunCustomEtcd multiple times, for example in
k8s.io/kubernetes/test/integration/apiserver.TestWatchCacheUpdatedByEtcd,
causes a data race:
WARNING: DATA RACE
Read at 0x00000c8e8d20 by goroutine 135612:
k8s.io/kubernetes/vendor/google.golang.org/grpc/grpclog.V()
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/google.golang.org/grpc/grpclog/grpclog.go:41 +0x30
k8s.io/kubernetes/vendor/google.golang.org/grpc/grpclog.(*componentData).V()
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/google.golang.org/grpc/grpclog/component.go:103 +0x4e
k8s.io/kubernetes/vendor/google.golang.org/grpc/internal/transport.(*loopyWriter).run.func1()
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/google.golang.org/grpc/internal/transport/controlbuf.go:528 +0xf1
runtime.deferreturn()
/home/prow/go/src/k8s.io/kubernetes/_output/local/.gimme/versions/go1.20.2.linux.amd64/src/runtime/panic.go:476 +0x32
k8s.io/kubernetes/vendor/google.golang.org/grpc/internal/transport.newHTTP2Client.func6()
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/google.golang.org/grpc/internal/transport/http2_client.go:442 +0x112
Previous write at 0x00000c8e8d20 by goroutine 140228:
k8s.io/kubernetes/vendor/google.golang.org/grpc/grpclog.SetLoggerV2()
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/google.golang.org/grpc/grpclog/loggerv2.go:76 +0xc6a
k8s.io/kubernetes/test/integration/framework.RunCustomEtcd()
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/integration/framework/etcd.go:153 +0xb89
k8s.io/kubernetes/test/integration/apiserver.multiEtcdSetup()
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/integration/apiserver/watchcache_test.go:40 +0xac
k8s.io/kubernetes/test/integration/apiserver.TestWatchCacheUpdatedByEtcd()
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/integration/apiserver/watchcache_test.go:88 +0x4a
testing.tRunner()
/home/prow/go/src/k8s.io/kubernetes/_output/local/.gimme/versions/go1.20.2.linux.amd64/src/testing/testing.go:1576 +0x216
testing.(*T).Run.func1()
/home/prow/go/src/k8s.io/kubernetes/_output/local/.gimme/versions/go1.20.2.linux.amd64/src/testing/testing.go:1629 +0x47
Users can pass resources into `kubectl events` command via `--for` flag,
if they have desire to only get events for the resource they specify.
However, current `kubectl events` does not support passing fully qualified
names(e.g. `replicasets.apps`, `cronjobs.v1.batch`, etc.). This PR adds support
for this.
The newly added `MirrorPodWithGracePeriod when create a mirror pod and
the container runtime is temporarily down during pod termination` test
is currently flaking because in some cases when it is run there are
other pods from other tests that are still in progress of being
terminated. This results in the test failing because it asserts metrics
that assume that there is only one pod running on the node.
To fix the flake, prior to starting the test, verify that no pods exist
in the api server other then the newly created mirror pod.
Signed-off-by: David Porter <david@porter.me>
this check needs to go after any mutations. After the mutating admission chain, rest.BeforeUpdate (which is responsible for reverting updates to immutable timestamp fields, among other things.) is called in the store.Update function. Without moving this check, it will be possible for an object to be written to etcd with only a change to its managed fields timestamp.
when adding a DisruptionTarget condition into a pod that will be deleted
- handle ResourceVersion and Preconditions correctly
- handle DryRun option correctly
Co-authored-by: Jordan Liggitt jordan@liggitt.net
* Slightly changed pod spec to repro issue #116262
* Refactor test to ensure that the static pod is deleted even if the
test fails
Signed-off-by: David Porter <david@porter.me>
We now get structured output using jsonpath for the
name and version fields of the node object and then
compare the outputs.
Signed-off-by: Madhav Jivrajani <madhav.jiv@gmail.com>
* api changes adding match conditions
* feature gate and registry strategy to drop fields
* matchConditions logic for admission webhooks
* feedback
* update test
* import order
* bears.com
* update fail policy ignore behavior
* update docs and matcher to hold fail policy as non-pointer
* update matcher error aggregation, fix early fail failpolicy ignore, update docs
* final cleanup
* openapi gen