Additional test cases added:
Keeps device plugin assignments across pod and kubelet restarts (no device plugin re-registration)
Keeps device plugin assignments after the device plugin has re-registered (no kubelet or pod restart)
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Add a test suit to simulate node reboot (achieved by removing pods
using CRI API before kubelet is restarted).
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
This test captures that scenario where after kubelet restart,
application pod comes up and the device plugin pod hasn't re-registered
itself, the pod fails with admission error. It is worth noting that
once the device plugin pod has registered itself, another
application pod requesting devices ends up running
successfully.
For the test case where kubelet is restarted and device plugin
has re-registered without involving pod restart, since the
pod after kubelet restart ends up with admission error,
we cannot be certain the device that the second pod (pod2) would
get. As long as, it gets a device we consider the test to pass.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Breakdown of the steps implemented as part of this e2e test is as follows:
1. Create a file `registration` at path `/var/lib/kubelet/device-plugins/sample/`
2. Create sample device plugin with an environment variable with
`REGISTER_CONTROL_FILE=/var/lib/kubelet/device-plugins/sample/registration` that
waits for a client to delete the control file.
3. Trigger plugin registeration by deleting the abovementioned directory.
4. Create a test pod requesting devices exposed by the device plugin.
5. Stop kubelet.
6. Remove pods using CRI to ensure new pods are created after kubelet restart.
7. Restart kubelet.
8. Wait for the sample device plugin pod to be running. In this case,
the registration is not triggered.
9. Ensure that resource capacity/allocatable exported by the device plugin is zero.
10. The test pod should fail with `UnexpectedAdmissionError`
11. Delete the test pod.
12. Delete the sample device plugin pod.
13. Remove `/var/lib/kubelet/device-plugins/sample/` and its content, the directory
created to control registration
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
This commit reuses e2e tests implmented as part of https://github.com/kubernetes/kubernetes/pull/110729.
The commit is borrowed from the aforementioned PR as is to preserve
authorship. Subsequent commit will update the end to end test to
simulate the problem this PR is trying to solve by reproducing
the issue: 109595.
Co-authored-by: Francesco Romani <fromani@redhat.com>
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
This will change when adding dynamic resource allocation test cases. Instead of
changing mustSetupScheduler and StartScheduler for that, let's return the
informer factory and create informers as needed in the test.
Entire test cases and workloads can have labels attached to them. The union of
these must match the label filter which works as in GitHub. The benchmark by
default runs the tests that are labeled "performance", which is the same as
before.
Capture explicitly a test case pertaining to kubelet restart
but with no pod restart and device plugin re-registration.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Based on whether the test case requires pod restart or not, the sleep
interval needs to be updated and we define constants to represent the two
sleep intervals that can be used in the corresponding test cases.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Co-authored-by: Francesco Romani <fromani@redhat.com>
Explicitly state that the test involves kubelet restart and device plugin
re-registration (no pod restart)
We remove the part of the code where we wait for the pod to restart as this
test case should no longer involve pod restart.
In addition to that, we use `waitForNodeReady` instead of `WaitForAllNodesSchedulable`
for ensuring that the node is ready for pods to be scheduled on it.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Co-authored-by: Francesco Romani <fromani@redhat.com>
Rather than testing out for both pod restart and kubelet restart,
we change the tests to just handle pod restart scenario.
Clarify the test purpose and add extra check to tighten the test.
We would be adding additional tests to cover kubelet restart scenarios
in subsequent commits.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Francesco Romani <fromani@redhat.com>
With this change the error message are more helpful and easier
to troubleshoot in case of test failures.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
We rename to make the intent more explicit;
We make it global to be able to reuse the value all across the module
(e.g. to check the node readiness) later on.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Co-authored-by: Francesco Romani <fromani@redhat.com>
Rather than only returning a string forcing us to log failure with
`framework.Fail`, we return a string and error to handle error cases
more conventionally. This enables us to use the `parseLog` function
inside `Eventually` and `Consistently` blocks, or in general to delegate
the error processing and enable better composability.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Co-authored-by: Francesco Romani <fromani@redhat.com>
The previous solution had some shortcomings:
- It was based on the assumption that the goroutine gets woken up at regular
intervals. This is not actually guaranteed. Now the code keeps track of the
actual start and end of an interval and verifies that assumption.
- If no pod was scheduled (unlikely, but could happen), then
"0 pods/s" got recorded. In such a case, the metric was always either
zero or >= 1. A better solution is to extend the interval
until some pod gets scheduled. With the larger time interval
it is then possible to also track, for example, 0.5 pods/s.
This allows us to return with a timeout error as soon as the
context is canceled. Previously in cases where the mount will
never succeed pods can get stuck deleting for 2 minutes.
In the Sync*Pod methods that call VolumeManager.WaitFor*, we
must filter out wait.Interrupted errors from being logged as
they are part of control flow, not runtime problems. Any
early interruption should result in exiting the Sync*Pod method
as quickly as possible without logging intermediate errors.
Copied and modified RemoveString function from
k/k/pkg/util/slice/slice.go to e2e/framework/pod/pod_client.go
This is the last dependency from e2e framework to k/k/pkg/util
Copied and modified pod format function from
k/k/pkg/kubelet/util/format/pod.go to e2e/framework/pod/pod_client.go
This is the last dependency from e2e framework to k/k/pkg/kubelet
It's conceptually wrong to have dependencies to k/k/pkg in
the e2e framework code. They should be moved to corresponding
packages, in this particular case to the test/e2e_node.
* Convert file but warn user with impossible conversions
* Only continuing for NotRegisteredErrors. Using iostreams for warning user instead of stdError
* Formatting, correct tests to use valid DNS-1035.
By generating the unique name in advance, the label also can be set to a
matching value directly in the Create request. This makes test startup in
test/integration/scheduler_perf a bit faster because the extra patching can be
avoided.
It also leads to a better label because previously, the unique label value
didn't match the node name. This is required for simulating dynamic resource
allocation, which relies on the label to track where an allocated claim is
available.
Add a regression test for https://issues.k8s.io/116925. The test
exercises the following:
1) Start a restart never pod which will exit with
`v1.PodSucceeded` phase.
2) Start a graceful deletion of the pod (set a deletion timestamp)
3) Restart the kubelet as soon as the kubelet reports the pod is
terminal (but before the pod is deleted).
4) Verify that after kubelet restart, the pod is deleted.
As of v1.27, there is a delay between the pod being marked terminal
phaes, and the status manager deleting the pod. If the kubelet is
restarted in the middle, after starting up again, the kubelet needs to
ensure the pod will be deleted on the API server.
Signed-off-by: David Porter <david@porter.me>
It makes sense to define a test where, depending on the parameters, some
operation creations zero pods, namespaces or nodes. The validation didn't allow
that previously due to the way how it was implemented although the underlying
code works fine with zero as count.
collector.collect got called without ensuring that collector.run had
terminated, so it could have happened that collector.run adds another sample
while collector.collect is reading them.
test-e2e-node for AWS is out-of-tree so that we won't need to vendor
in AWS related packages. For this to work, some of the scripts/golang
code need to know where the k8s tree is git cloned.
So let's add an option to lookup the env var, so that we can then,
change directory to this specified directory to run some make commands
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
grpclog.SetLoggerV is not thread-safe and may only be called before code starts
using GRPC. Calling RunCustomEtcd multiple times, for example in
k8s.io/kubernetes/test/integration/apiserver.TestWatchCacheUpdatedByEtcd,
causes a data race:
WARNING: DATA RACE
Read at 0x00000c8e8d20 by goroutine 135612:
k8s.io/kubernetes/vendor/google.golang.org/grpc/grpclog.V()
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/google.golang.org/grpc/grpclog/grpclog.go:41 +0x30
k8s.io/kubernetes/vendor/google.golang.org/grpc/grpclog.(*componentData).V()
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/google.golang.org/grpc/grpclog/component.go:103 +0x4e
k8s.io/kubernetes/vendor/google.golang.org/grpc/internal/transport.(*loopyWriter).run.func1()
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/google.golang.org/grpc/internal/transport/controlbuf.go:528 +0xf1
runtime.deferreturn()
/home/prow/go/src/k8s.io/kubernetes/_output/local/.gimme/versions/go1.20.2.linux.amd64/src/runtime/panic.go:476 +0x32
k8s.io/kubernetes/vendor/google.golang.org/grpc/internal/transport.newHTTP2Client.func6()
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/google.golang.org/grpc/internal/transport/http2_client.go:442 +0x112
Previous write at 0x00000c8e8d20 by goroutine 140228:
k8s.io/kubernetes/vendor/google.golang.org/grpc/grpclog.SetLoggerV2()
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/google.golang.org/grpc/grpclog/loggerv2.go:76 +0xc6a
k8s.io/kubernetes/test/integration/framework.RunCustomEtcd()
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/integration/framework/etcd.go:153 +0xb89
k8s.io/kubernetes/test/integration/apiserver.multiEtcdSetup()
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/integration/apiserver/watchcache_test.go:40 +0xac
k8s.io/kubernetes/test/integration/apiserver.TestWatchCacheUpdatedByEtcd()
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/integration/apiserver/watchcache_test.go:88 +0x4a
testing.tRunner()
/home/prow/go/src/k8s.io/kubernetes/_output/local/.gimme/versions/go1.20.2.linux.amd64/src/testing/testing.go:1576 +0x216
testing.(*T).Run.func1()
/home/prow/go/src/k8s.io/kubernetes/_output/local/.gimme/versions/go1.20.2.linux.amd64/src/testing/testing.go:1629 +0x47
Users can pass resources into `kubectl events` command via `--for` flag,
if they have desire to only get events for the resource they specify.
However, current `kubectl events` does not support passing fully qualified
names(e.g. `replicasets.apps`, `cronjobs.v1.batch`, etc.). This PR adds support
for this.
The newly added `MirrorPodWithGracePeriod when create a mirror pod and
the container runtime is temporarily down during pod termination` test
is currently flaking because in some cases when it is run there are
other pods from other tests that are still in progress of being
terminated. This results in the test failing because it asserts metrics
that assume that there is only one pod running on the node.
To fix the flake, prior to starting the test, verify that no pods exist
in the api server other then the newly created mirror pod.
Signed-off-by: David Porter <david@porter.me>
this check needs to go after any mutations. After the mutating admission chain, rest.BeforeUpdate (which is responsible for reverting updates to immutable timestamp fields, among other things.) is called in the store.Update function. Without moving this check, it will be possible for an object to be written to etcd with only a change to its managed fields timestamp.
when adding a DisruptionTarget condition into a pod that will be deleted
- handle ResourceVersion and Preconditions correctly
- handle DryRun option correctly
Co-authored-by: Jordan Liggitt jordan@liggitt.net
* Slightly changed pod spec to repro issue #116262
* Refactor test to ensure that the static pod is deleted even if the
test fails
Signed-off-by: David Porter <david@porter.me>
We now get structured output using jsonpath for the
name and version fields of the node object and then
compare the outputs.
Signed-off-by: Madhav Jivrajani <madhav.jiv@gmail.com>
* api changes adding match conditions
* feature gate and registry strategy to drop fields
* matchConditions logic for admission webhooks
* feedback
* update test
* import order
* bears.com
* update fail policy ignore behavior
* update docs and matcher to hold fail policy as non-pointer
* update matcher error aggregation, fix early fail failpolicy ignore, update docs
* final cleanup
* openapi gen
This PR makes the NodePrepareResources() and NodeUnprepareResource()
calls of the kubeletplugin API for DynamicResourceAllocation
symmetrical. It wasn't clear how one would use the set of CDIDevices
passed back in the NodeUnprepareResource() of the v1alpha1 API, and the
new API now passes back the full ResourceHandle that was originally
passed to the Prepare() call. Passing the ResourceHandle is strictly
more informative and a plugin could always (re)derive the set of
CDIDevice from it.
This is a breaking change, but this release is scheduled to break
multiple APIs for DynamicResourceAllocation, so it makes sense to do
this now instead of later.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
They contain some nice-to-have improvements (for example, better printing of
errors with gomega/format.Object) but nothing that is critical right now.
"go mod tidy" was run manually in
staging/src/k8s.io/kms/internal/plugins/mock (https://github.com/kubernetes/kubernetes/pull/116613
not merged yet).
This change updates KMS v2 to not create a new DEK for every
encryption. Instead, we re-use the DEK while the key ID is stable.
Specifically:
We no longer use a random 12 byte nonce per encryption. Instead, we
use both a random 4 byte nonce and an 8 byte nonce set via an atomic
counter. Since each DEK is randomly generated and never re-used,
the combination of DEK and counter are always unique. Thus there
can never be a nonce collision. AES GCM strongly encourages the use
of a 12 byte nonce, hence the additional 4 byte random nonce. We
could leave those 4 bytes set to all zeros, but there is no harm in
setting them to random data (it may help in some edge cases such as
live VM migration).
If the plugin is not healthy, the last DEK will be used for
encryption for up to three minutes (there is no difference on the
behavior of reads which have always used the DEK cache). This will
reduce the impact of a short plugin outage while making it easy to
perform storage migration after a key ID change (i.e. simply wait
ten minutes after the key ID change before starting the migration).
The DEK rotation cycle is performed in sync with the KMS v2 status
poll thus we always have the correct information to determine if a
read is stale in regards to storage migration.
Signed-off-by: Monis Khan <mok@microsoft.com>
The name "PodScheduling" was unusual because in contrast to most other names,
it was impossible to put an article in front of it. Now PodSchedulingContext is
used instead.
Annotation
As part of this change, kube-proxy accepts any value for either
annotation that is not "disabled".
Change-Id: Idfc26eb4cc97ff062649dc52ed29823a64fc59a4
The kube-apiserver validation expects the Count of an EventSeries to be
at least 2, otherwise it rejects the Event. There was is discrepancy
between the client and the server since the client was iniatizing an
EventSeries to a count of 1.
According to the original KEP, the first event emitted should have an
EventSeries set to nil and the second isomorphic event should have an
EventSeries with a count of 2. Thus, we should matcht the behavior
define by the KEP and update the client.
Also, as an effort to make the old clients compatible with the servers,
we should allow Events with an EventSeries count of 1 to prevent any
unexpected rejections.
Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
Implement DOS prevention wiring a global rate limit for podresources
API. The goal here is not to introduce a general ratelimiting solution
for the kubelet (we need more research and discussion to get there),
but rather to prevent misuse of the API.
Known limitations:
- the rate limits value (QPS, BurstTokens) are hardcoded to
"high enough" values.
Enabling user-configuration would require more discussion
and sweeping changes to the other kubelet endpoints, so it
is postponed for now.
- the rate limiting is global. Malicious clients can starve other
clients consuming the QPS quota.
Add e2e test to exercise the flow, because the wiring itself
is mostly boilerplate and API adaptation.
We have quite a few podresources e2e tests and, as the feature
progresses to GA, we should consider moving them to NodeConformance.
Unfortunately most of them require linux-specific features not in the
test themselves but in the test prelude (fixture) to check or create the
node conditions (e.g. presence or not of devices, online CPUS...) to be
verified in the test proper.
For this reason we promote only a single test for starters.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Since we can't rely on the test runner and hosts under test to
be on the same machine, we write to the terminate log from each
container and concatenate the results.
If a CRI error occurs during the terminating phase after a pod is
force deleted (API or static) then the housekeeping loop will not
deliver updates to the pod worker which prevents the pod's state
machine from progressing. The pod will remain in the terminating
phase but no further attempts to terminate or cleanup will occur
until the kubelet is restarted.
The pod worker now maintains a store of the pods state that it is
attempting to reconcile and uses that to resync unknown pods when
SyncKnownPods() is invoked, so that failures in sync methods for
unknown pods no longer hang forever.
The pod worker's store tracks desired updates and the last update
applied on podSyncStatuses. Each goroutine now synchronizes to
acquire the next work item, context, and whether the pod can start.
This synchronization moves the pending update to the stored last
update, which will ensure third parties accessing pod worker state
don't see updates before the pod worker begins synchronizing them.
As a consequence, the update channel becomes a simple notifier
(struct{}) so that SyncKnownPods can coordinate with the pod worker
to create a synthetic pending update for unknown pods (i.e. no one
besides the pod worker has data about those pods). Otherwise the
pending update info would be hidden inside the channel.
In order to properly track pending updates, we have to be very
careful not to mix RunningPods (which are calculated from the
container runtime and are missing all spec info) and config-
sourced pods. Update the pod worker to avoid using ToAPIPod()
and instead require the pod worker to directly use
update.Options.Pod or update.Options.RunningPod for the
correct methods. Add a new SyncTerminatingRuntimePod to prevent
accidental invocations of runtime only pod data.
Finally, fix SyncKnownPods to replay the last valid update for
undesired pods which drives the pod state machine towards
termination, and alter HandlePodCleanups to:
- terminate runtime pods that aren't known to the pod worker
- launch admitted pods that aren't known to the pod worker
Any started pods receive a replay until they reach the finished
state, and then are removed from the pod worker. When a desired
pod is detected as not being in the worker, the usual cause is
that the pod was deleted and recreated with the same UID (almost
always a static pod since API UID reuse is statistically
unlikely). This simplifies the previous restartable pod support.
We are careful to filter for active pods (those not already
terminal or those which have been previously rejected by
admission). We also force a refresh of the runtime cache to
ensure we don't see an older version of the state.
Future changes will allow other components that need to view the
pod worker's actual state (not the desired state the podManager
represents) to retrieve that info from the pod worker.
Several bugs in pod lifecycle have been undetectable at runtime
because the kubelet does not clearly describe the number of pods
in use. To better report, add the following metrics:
kubelet_desired_pods: Pods the pod manager sees
kubelet_active_pods: "Admitted" pods that gate new pods
kubelet_mirror_pods: Mirror pods the kubelet is tracking
kubelet_working_pods: Breakdown of pods from the last sync in
each phase, orphaned state, and static or not
kubelet_restarted_pods_total: A counter for pods that saw a
CREATE before the previous pod with the same UID was finished
kubelet_orphaned_runtime_pods_total: A counter for pods detected
at runtime that were not known to the kubelet. Will be
populated at Kubelet startup and should never be incremented
after.
Add a metric check to our e2e tests that verifies the values are
captured correctly during a serial test, and then verify them in
detail in unit tests.
Adds 23 series to the kubelet /metrics endpoint.
Add some additional init container tests that work via monitoring
container lifetime based on logs written to a common file. This allows
more easily writing assertions about the container lifetimes with
respect to one another.
6f2cd1b5bd swapped the order of cancel() and
closeFn() so that closeFn got called first when the test was done. This caused
it to block while waiting for goroutines which themselves were waiting for
the context cancellation. The test still shut down, it just took ~86s instead
of ~30s.
The fix is to register the cancel twice: once as soon as the context is
created (to clean up in case of an unexpected panic) and once after
closeFn (because then it'll get called first, as before).
The test creates a Service exposing two protocols on the same port
and a backend that replies on both protocols.
1. Test that Service with works for both protocol
2. Update Service to expose only the TCP port
3. Verify that TCP works and UDP does not work
4. Update Service to expose only the UDP port
5. Verify that TCP does not work and UDP does work
Change-Id: Ic4f3a6509e332aa5694d20dfc3b223d7063a7871
Test 2 scenarios:
- pod can connect to a terminating pods
- terminating pod can connect to other pods
Change-Id: Ia5dc4e7370cc055df452bf7cbaddd9901b4d229d
v1.Container is still changing a log which caused the test to fail each time a
new field was added. To test loading, let's better use something that is
unlikely to change. The runtimev1.VersionResponse gets logged by kubelet and
seems to be stable.
The benchmarks and unit tests were written so that they used custom APIs for
each log format. This made them less realistic because there were subtle
differences between the benchmark and a real Kubernetes component. Now all
logging configuration is done with the official
k8s.io/component-base/logs/api/v1.
To make the different test cases more comparable, "messages/s" is now reported
instead of the generic "ns/op".
When trying again with recent log files from the CI job, it was found that some
JSON messages get split across multiple lines, both in container logs and in
the systemd journal:
2022-12-21T07:09:47.914739996Z stderr F {"ts":1671606587914.691,"caller":"rest/request.go:1169","msg":"Response ...
2022-12-21T07:09:47.914984628Z stderr F 70 72 6f 78 79 10 01 1a 13 53 ... \".|\n","v":8}
Note the different time stamp on the second line. That first line is
long (17384 bytes). This seems to happen because the data must pass through a
stream-oriented pipe and thus may get split up by the Linux kernel.
The implication is that lines must get merged whenever the JSON decoder
encounters an incomplete line. The benchmark loader now supports that. To
simplifies this, stripping the non-JSON line prefixes must be done before using
a log as test data.
The updated README explains how to do that when downloading a CI job
result. The amount of manual work gets reduced by committing symlinks under
data to the expected location under ci-kubernetes-kind-e2e-json-logging and
ignoring them when the data is not there.
Support for symlinks gets removed and path/filepath is used instead of path
because it has better Windows support.
A Service can use multiple EndpointSlices for its backend, when
using custom Endpoint Slices, the data plane should forward traffic
to any of the endpoints in the Endpointslices that belong to the
Service.
Change-Id: I80b42522bf6ab443050697a29b94d8245943526f
* migrating pkg/controller/serviceaccount to contextual logging
Signed-off-by: Naman <namanlakhwani@gmail.com>
* small nit
Signed-off-by: Naman <namanlakhwani@gmail.com>
* capitalising first letter of error
Signed-off-by: Naman <namanlakhwani@gmail.com>
* addressed review comments
Signed-off-by: Naman <namanlakhwani@gmail.com>
* small nit to add key
Signed-off-by: Naman <namanlakhwani@gmail.com>
---------
Signed-off-by: Naman <namanlakhwani@gmail.com>
* migrated controller/replicaset to contextual logging
Signed-off-by: Naman <namanlakhwani@gmail.com>
* small nits
Signed-off-by: Naman <namanlakhwani@gmail.com>
* addressed changes
Signed-off-by: Naman <namanlakhwani@gmail.com>
* small nit
Signed-off-by: Naman <namanlakhwani@gmail.com>
* taking t as input
Signed-off-by: Naman <namanlakhwani@gmail.com>
---------
Signed-off-by: Naman <namanlakhwani@gmail.com>
Bump the timeout as the previous timeout was sometimes too short,
resulting in the pod status update not sent. Also, fixed a typo in
previous refactor.
Signed-off-by: David Porter <david@porter.me>
Breakdown of the steps implemented as part of this e2e test is as follows:
1. Create a file `registration` at path `/var/lib/kubelet/device-plugins/sample/`
2. Create sample device plugin with an environment variable with
`REGISTER_CONTROL_FILE=/var/lib/kubelet/device-plugins/sample/registration` that
waits for a client to delete the control file.
3. Trigger plugin registeration by deleting the abovementioned directory.
4. Create a test pod requesting devices exposed by the device plugin.
5. Stop kubelet.
6. Remove pods using CRI to ensure new pods are created after kubelet restart.
7. Restart kubelet.
8. Wait for the sample device plugin pod to be running. In this case,
the registration is not triggered.
9. Ensure that resource capacity/allocatable exported by the device plugin is zero.
10. The test pod should fail with `UnexpectedAdmissionError`
11. Delete the test pod.
12. Delete the sample device plugin pod.
13. Remove `/var/lib/kubelet/device-plugins/sample/` and its content, the directory
created to control registration
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
This commit reuses e2e tests implmented as part of https://github.com/kubernetes/kubernetes/pull/110729.
The commit is borrowed from the aforementioned PR as is to preserve
authorship. Subsequent commit will update the end to end test to
simulate the problem this PR is trying to solve by reproducing
the issue: 109595.
Co-authored-by: Francesco Romani <fromani@redhat.com>
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Add node e2e test to verify that static pods can be started after a
previous static pod with the same config temporarily failed termination.
The scenario is:
1. Static pod is started
2. Static pod is deleted
3. Static pod termination fails (internally `syncTerminatedPod` fails)
4. At later time, pod termination should succeed
5. New static pod with the same config is (re)-added
6. New static pod is expected to start successfully
To repro this scenario, setup a pod using a NFS mount. The NFS server is
stopped which will result in volumes failing to unmount and
`syncTerminatedPod` to fail. The NFS server is later started, allowing
the volume to unmount successfully.
xref:
1. https://github.com/kubernetes/kubernetes/pull/113145#issuecomment-1289587988
2. https://github.com/kubernetes/kubernetes/pull/113065
3. https://github.com/kubernetes/kubernetes/pull/113093
Signed-off-by: David Porter <david@porter.me>
Add a node e2e to verify that if a static pod is terminated while the
container runtime or CRI returns an error, the pod is eventually
terminated successfully.
This test serves as a regression test for k8s.io/issue/113145 which
fixes an issue where force deleted pods may not be terminated if the
container runtime fails during a `syncTerminatingPod`.
To test this behavior, start a static pod, stop the container runtime,
and later start the container runtime. The static pod is expected to
eventually terminate successfully.
To start and stop the container runtime, we need to find the container
runtime systemd unit name. Introduce a util function
`findContainerRuntimeServiceName` which finds the unit name by getting
the pid of the container runtime from the existing
`ContainerRuntimeProcessName` flag passed into node e2e and using
systemd dbus `GetUnitNameByPID` function to convert the pid of the
container runtime to a unit name. Using the unit name, introduce helper
functions to start and stop the container runtime.
Signed-off-by: David Porter <david@porter.me>
The existing path is incorrect (missing `sample-device-plugin`)
directory and thus causing test failures. The full path should be
`test/e2e/testing-manifests/sample-device-plugin/sample-device-plugin.yaml`.
Signed-off-by: David Porter <david@porter.me>
Update go-jose from v2.2.2 to v2.6.0.
This is to make the kubernetes code compatible with newer go-jose versions that have a small breaking change (`jwt.NewNumericDate()` returns a pointer).
Signed-off-by: Max Goltzsche <max.goltzsche@gmail.com>
This improves performance of the text formatting and ktesting.
Because ktesting no longer buffers messages by default, one unit
test needs to ask for that explicitly.
When running as part of the scheduler_perf benchmark testing, we want to print
less information by default, so we should use V to limit verbosity
Pretty-printing doesn't belong into "application" code. I am moving that into
the ktesting formatting (https://github.com/kubernetes/kubernetes/pull/116180).
Update the sample device plugin to enable the e2e node tests (or any
other entity with full access to the node filesystem) to control the
registration process. We add a new environment variable `REGISTER_CONTROL_FILE`.
The value of this variable must be a file which prevents the plugin
to register itself while it's present. Once removed, the plugin will
go on and complete the registration. The plugin will automatically
detect the parent directory on which the file resides and detect
deletions, unblocking the registration process. If the file is specified
but unaccessible, the plugin will fail. If the file is not specified,
the registration process will progress as usual and never pause.
The plugin will need read access to the parent directory.
This feature is useful because it is not possible to control the order
in which the pods are recovered after node reboot/kubelet restart.
In this approach, the testing environment will create a directory and
then a empty file to pause the registration process of the plugin.
Once pointed to that file, the plugin will start and wait for it to
be deleted. Only after the directory has been deleted,
the plugin would proceed to registration.
This feature is used in #114640 where e2e test is implemented to
simulate scenarios where application pods requesting devices come up before
the device plugin pod on node reboot/ kubelet restart.
Co-authored-by: Francesco Romani <fromani@redhat.com>
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
This replaces the pretty useless us/op metric (useless because it includes
setup and teardown times) with the same values that also get stored in the JSON
file.
The main advantage is that benchstat can be used to analyze and compare
results.
The upstream ktesting has to be very flexible to accommodate different ways of
using it. In Kubernetes, we can be opinionated and make certain choices, like
using klog flags, and only those.
In contrast to EtcdMain, it can be called by individual tests or benchmarks and
each caller will get a fresh etcd instance. However, it uses the same
underlying code and the same port for all instances, so tests cannot run in
parallel.
Add the following ginkgo flags for each node e2e similar to the
existing hack/ginkgo-e2e.sh script.
* --no-color, colors aren't rendered properly in prow and make examining
the log in text editors more difficult, so let's disable them.
`hack/ginkgo-e2e.sh` (used for kind e2e tests) also disables them
already.
* -v, enable verbose logs. This is needed so we get more detailed info
even when the tests pass. This is useful so we can compare successful
runs to failed runs.
Signed-off-by: David Porter <david@porter.me>
When running multiple node e2e with multiple machine images, the tests
are run separately for each node. The final build log has all of the
results for each of the hosts combined together which make debugging the
log difficult. To make it easier, emit a log for each host that was run.
This log will be written to the results directory and uploaded as an
artifact in prow jobs.
Signed-off-by: David Porter <david@porter.me>
This was never being used, the only config that used it was deleted in
https://github.com/kubernetes/test-infra/pull/26017 so we don't need
this anymore, so let's delete it.
Signed-off-by: David Porter <david@porter.me>
1. Scheduler bug-fix + scheduler-focussed E2E tests
2. Add cgroup v2 support for in-place pod resize
3. Enable full E2E pod resize test for containerd>=1.6.9 and EventedPLEG related changes.
Co-Authored-By: Vinay Kulkarni <vskibum@gmail.com>
1. Core Kubelet changes to implement In-place Pod Vertical Scaling.
2. E2E tests for In-place Pod Vertical Scaling.
3. Refactor kubelet code and add missing tests (Derek's kubelet review)
4. Add a new hash over container fields without Resources field to allow feature gate toggling without restarting containers not using the feature.
5. Fix corner-case where resize A->B->A gets ignored
6. Add cgroup v2 support to pod resize E2E test.
KEP: /enhancements/keps/sig-node/1287-in-place-update-pod-resources
Co-authored-by: Chen Wang <Chen.Wang1@ibm.com>
1. Define ContainerResizePolicy and add it to Container struct.
2. Add ResourcesAllocated and Resources fields to ContainerStatus struct.
3. Define ResourcesResizeStatus and add it to PodStatus struct.
4. Add InPlacePodVerticalScaling feature gate and drop disabled fields.
5. ResizePolicy validation & defaulting and Resources mutability for CPU/Memory.
6. Various fixes from code review feedback (originally committed on Apr 12, 2022)
KEP: /enhancements/keps/sig-node/1287-in-place-update-pod-resources
All wrappers except for ExpectNoError are identical to their gomega
counterparts. The only advantage that they have is that their invocations are
shorter.
That advantage does not outweigh their disadvantages:
- cannot be used in combination with gomega.Eventually/Consistently
- not a full replacement for gomega, so we just end up using both
- don't support passing a stack offset and thus cannot be used in helper
functions
- ginkgolinter does not work for them, so sub-optimal calls like this one
are not reported:
framework.ExpectEqual(len(items), 0)
->
gomega.Expect(items).To(gomega.BeEmpty())
- developers try to make do with what's available in the framework, leading
to sub-optimal checks like this:
framework.ExpectEqual(true, strings.Contains(event.Message, expectedEventError), "Event error should indicate non-root policy caused container to not start")
->
gomega.Expect(event.Message).To(gomega.ContainSubstring(expectedEventError), "Event error should indicate non-root policy caused container to not start")
So let's remove these wrappers. As a first step they get marked as deprecated.
This enables stricter
linting (https://github.com/kubernetes/kubernetes/pull/109728), once enabled,
to report new code which uses them.
All of these issues were reported by https://github.com/nunnatsa/ginkgolinter.
Fixing these issues is useful (several expressions get simpler, using
framework.ExpectNoError is better because it has additional support for
failures) and a necessary step for enabling that linter in our golangci-lint
invocation.
* refine: the server-side http Request Body is always non-nil
* revert changes under vendor
* Update staging/src/k8s.io/pod-security-admission/cmd/webhook/server/server.go
Co-authored-by: Jordan Liggitt <jordan@liggitt.net>
* Update main.go
---------
Co-authored-by: Jordan Liggitt <jordan@liggitt.net>
Fix the waiting logic in the e2e test loop to wait
for resources to be reported again instead of making logic on the
timestamp. The idea is that waiting for resource availability
is the canonical way clients should observe the desired state,
and it should also be more robust than comparing timestamps,
especially on CI environments.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Start to consolidate the sample device plugin utility
and constants in a central place, because we need
to use it in different e2e tests.
Having a central dependency is better than a maze of
entangled e2e tests depending on each other helpers.
Signed-off-by: Francesco Romani <fromani@redhat.com>
The podresources e2e tests want to exercise the case on which a device
plugin doesn't report topology affinity. The only known device plugin
which had this requirement and didn't depend on specialized hardware
was the kubevirt device plugin, which was however deprecated after
we started using it.
So the e2e tests are now broken, and in any case they can't depend on
unmaintained and liable to be obsolete code.
To unblock the state and preserve some e2e signal, we switch to the
sample device plugin, which is a stub implementation and which is
managed in-tree, so we can maintain it and ensure it fits the e2e test
usecase.
This is however a regression, because e2e tests should try their hardest
to use real devices and avoid any mocking or faking.
The upside is that using a OS-neutral device plugin for the tests enables
us to run on all the supported platform (windows!) so this could allow
us to transition these tests to conformance.
Signed-off-by: Francesco Romani <fromani@redhat.com>
rename getPodResources for clarity. Allow to return error
(and not use ginkgo expectations), so it can actually be used
as intended inside `Eventually` blocks without blow up at the
first failure.
Signed-off-by: Francesco Romani <fromani@redhat.com>
`f framework.Framework` does not need to be global, it's used only on a few
places.
This fixes vSphereDriver.PrepareTest() in in_tree.go that schedules
ginkgo.DeferCleanup() that uses the global `f` variable, but its value is not
valid at the time of ginkgo cleanup.
we should only use this env var for `arm`, since `arm64` is fully
supported by etcd folks, let us drop this!
(ex - https://github.com/etcd-io/etcd/releases/tag/v3.5.6)
ppc64le comment should be dropped as well
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
The check for "resources available on a node" must treat nodes that are not
listed as "no resources available". The previous logic only worked because all
nodes were listed during E2E testing. The upcoming integration testing is
covering additional scenarios and triggered this broken case.
It provides more readable output and has additional APIs for using it inside a
unit test. goleak.IgnoreCurrent is needed to filter out the goroutine that gets
started when importing go.opencensus.io/stats/view.
In order to handle background goroutines that get created on demand and cannot
be stopped (like the one for LogzHealth), a helper function ensures that those
are running before calling goleak.IgnoreCurrent. Keeping those goroutines
running is not a problem and thus not worth the effort of adding new APIs to
stop them.
Other goroutines are genuine leaks for which no fix is available. Those get
suppressed via IgnoreTopFunction, which works as long as that function
is unique enough.
Example output for the leak fixed in https://github.com/kubernetes/kubernetes/pull/115423:
E0202 09:30:51.641841 74789 etcd.go:205] "EtcdMain goroutine check" err=<
found unexpected goroutines:
[Goroutine 4889 in state chan receive, with k8s.io/apimachinery/pkg/watch.(*Broadcaster).loop on top of the stack:
goroutine 4889 [chan receive]:
k8s.io/apimachinery/pkg/watch.(*Broadcaster).loop(0xc0076183c0)
/nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/watch/mux.go:268 +0x65
created by k8s.io/apimachinery/pkg/watch.NewBroadcaster
/nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/watch/mux.go:77 +0x116
>
- test/e2e/framework/*.go should have very minimal dependencies.
We can enforce that via import-boss.
- What each test/e2e/framework/* sub-package uses is less relevant,
although ideally it also should be as minimal as possible in each case.
Enforcing this via import-boss ensures that new dependencies get flagged as
problem and thus will get additional scrutiny. It might be okay to add them,
but it needs to be considered.
Since the pod names are reused across the test, searching the logs is
currently difficult.
Use a uuid for each pod name to make grepping the logs easier. Also,
always include the pod name and pod namespace in any logs or error
messages to make debugging easier.
Signed-off-by: David Porter <david@porter.me>
The previous approach was based on the observation that some Prow jobs use the
--report-dir parameter instead of the E2E_REPORT_DIR env variable. Parsing the
command line was necessary to use the --json-report and --junit-report
parameters.
But that is complex and can be avoided by triggering the creation of complete
reports in the E2E test suite. The paths are hard-coded and relative to the
report directory to keep the code simple.
There was a report that k8s-triage started processing more data after
6db4b741dd was merged. It's unclear whether
that was because of the new <report-dir>/ginkgo_report.xml file. To avoid
this potential problem, the reports are now in a "ginkgo" sub-directory.
While at it, error checking gets enhanced:
- Create directories at the start of
the suite and bail out early if that fails.
- *All* e2e suites using the framework do this, not just test/e2e.
- Added missing error checking of truncated JUnit report writing.
Currently `kubectl debug` only supports passing names in command line.
However, users might want to pass resources in files by passing `-f` flag like
in all other kubectl commands.
This PR adds this ability.
With the change of the CRI-O jobs to use butane, we now have a
verification for base64 data urls in place. This means that the
following URL is invalid:
```
data:text/plain;base64,GCE_SSH_PUBLIC_KEY_FILE_CONTENT
```
This means we have to pass valid base64 to the URL. To fix that, we now
allow to inject SSH key values with both, the
`GCE_SSH_PUBLIC_KEY_FILE_CONTENT` field and its base64 encoded variant.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
The NPD test checks when NPD started to determine if it is needed to
check the kubelet start event. The current logic requires parsing the
journalctl logs which is quite fragile and is broken now because of
systemd changing the expected log format.
Newer versions of systemd do not print "end at" or "logs begin at" and
instead may print "No entries", which will result in the test panicking.
```
$ journalctl -u foo.service
-- No entries --
```
For units started, it will not print "end at" or "logs begin at":
```
root@ubuntu-jammy:~# journalctl -u foo.service
Feb 08 22:02:19 ubuntu-jammy systemd[1]: Started /usr/bin/sleep 1s.
Feb 08 22:02:20 ubuntu-jammy systemd[1]: foo.service: Deactivated successfully.
```
To avoid relying on log parsing which is fragile, let's instead directly
ask systemd when the NPD service started and parse the resulting
timestamp.
Signed-off-by: David Porter <david@porter.me>
There is a test in wait.sh integration suite which is checking the
given timeout value(passed by user) is equal to actual happened timeout value.
However, this test rarely gets `no matching resources found` error and
causes flakyiness. The reason is we are running wait command, immediately
after applying deployment. In reality, timeout test does not care about
deployment, since it is testing the timeout by passing invalid configurations.
But we need this deployment to not get `no matching resources found` error.
That's why, this PR adds deployment assertion before executing wait command.
The recently introduced failure handling in ExpectNoError depends on error
wrapping: if an error prefix gets added with `fmt.Errorf("foo: %v", err)`, then
ExpectNoError cannot detect that the root cause is an assertion failure and
then will add another useless "unexpected error" prefix and will not dump the
additional failure information (currently the backtrace inside the E2E
framework).
Instead of manually deciding on a case-by-case basis where %w is needed, all
error wrapping was updated automatically with
sed -i "s/fmt.Errorf\(.*\): '*\(%s\|%v\)'*\",\(.* err)\)/fmt.Errorf\1: %w\",\3/" $(git grep -l 'fmt.Errorf' test/e2e*)
This may be unnecessary in some cases, but it's not wrong.
Instead of pod responses being printed to the log each time polling fails, we
get a consolidated failure message with all unexpected pod responses if (and
only if) the check times out or a progress report gets produced.
This renames PodsResponding to WaitForPodsResponding for the sake of
consistency and adds a timeout parameter. That is necessary because some other
users of NewProxyResponseChecker used a much lower timeout (2min vs. 15min).
Besides simplifying some code, it also makes it easier to rewrite
ProxyResponseChecker because it only gets used in WaitForPodsResponding.
WaitForPodToDisappear was always called such that it listed all pods, which
made it less efficient than trying to get just the one pod it was checking for.
Being able to customize the poll interval in practice wasn't useful, therefore
it can be replaced with WaitForPodNotFoundInNamespace.
WaitForPods is now a generic function which lists pods and then checks the pods
that it found against some provided condition. A parameter determines how many
pods must be found resp. match the condition for the check to succeed.
The code becomes simpler (78 insertions, 91 deletions), easier to read (all
code entirely inside WaitForPodsRunningReady, no need to declare and later
overwrite variables) and possibly more correct (if all API calls failed,
the resulting error was ignored when allowedNotReadyPods > 0).
None of the users of the functions passed anything other than nil or an empty
map and the implementation ignore the parameter - it seems like a candidate for
simplification.
When a Gomega failure is converted to an error, the stack at the time when the
failure occurs may be useful: error wrapping provides some bread crumbs that
can be followed to determine where the failure really occurred, but error
wrapping may be missing or ambiguous.
To provide the additional information, a FailureError now includes a full stack
backtrace. The backtrace intentionally makes no attempt to exclude framework
functions besides the gomega support itself because helpers like
e2e/framework/pod may be relevant.
That backtrace is not included in the failure message for the sake of
brevity. Instead, it gets logged as part of the test's output.
gomega.Eventually provides better progress reports: instead of filling up the
log with rather useless one-line messages that are not enough to to understand
the current state, it integrates with Gingko's progress reporting (SIGUSR1,
--poll-progress-after) and then dumps the same complete failure message as
after a timeout. That makes it possible to understand why progress isn't
getting made without having to wait for the timeout.
The other advantage is that the failure message for some unexpected pod state
becomes more readable: instead of encapsulating it as "observed object" inside
an error, it directly gets rendered by gomega.
Calling gomega.Expect/Eventually/Consistently deep inside a helper call chain
has several challenges:
- the stack offset must be tracked correctly, otherwise the callstack
for the failure starts at some helper code, which is often not informative
- augmenting the failure message with additional information from each
caller implies that each caller must pass down a string and/or format
string plus arguments
Both challenges can be solved by returning errors:
- the stacktrace is taken at that level where the error is
treated as a failure instead of passing back an error, i.e.
inside the It callback
- traditional error wrapping can add additional information, if
desirable
What was missing was some easy way to generate an error via a gomega
assertion. The new infrastructure achieves that by mirroring the
Gomega/Assertion/AsyncAssertion interfaces with errors as return values instead
of calling a fail handler.
It is intentionally less flexible than the gomega APIs:
- A context must be passed to Eventually/Consistently as first
parameter because that is needed for proper timeout handling.
- No additional text can be added to the failure through this
API because error wrapping is meant to be used for this.
- No need to adjust the callstack offset because no backtrace
is recorded when a failure occurs.
To avoid the useless "unexpected error" log message when passing back a gomega
failure, ExpectNoError gets extended to recognize such errors and then skips
the logging.
Calling WaitForPodTerminatedInNamespace after testFlexVolume is useless because
the client pod that it waits for always gets deleted by testVolumeClient:
0fcc3dbd55/test/e2e/framework/volume/fixtures.go (L541-L546)
Worse, because WaitForPodTerminatedInNamespace treats "not found" as "must keep
polling", these two tests always kept waiting for 5 minutes:
Kubernetes e2e suite: [It] [sig-storage] Flexvolumes should be mountable
when non-attachable 6m4s
The only reason why these tests passed is that WaitForPodTerminatedInNamespace
used to return the "not found" API error. That is not guaranteed and about to
change.
Unknown pods are pods which are unknown pods to the kubelet, but are still
running in the container runtime. If kubelet detects a pod which is not in
the config (i.e. not present in API-server or static pod), but running as
detected in container runtime, kubelet should aggressively terminate the pod.
This situation can be encountered if a pod is running, then kubelet is
stopped, and while stopped, the manifest is deleted (by force deleting the
API pod or deleting the static pod manifest), and then restarting the
kubelet. Upon restart, kubelet will see the pod as running via the container
runtime, but it will not be present in the config, thus making the pod a
"unknown pod". Kubelet should then proceed to terminate these unknown pods.
Add two tests that ensure that unknown pods will be terminated (1)
static pods and (2) API pods. The test will start a pod, stop the
kubelet, force delete the pod (by deleting the manifest or force
deleting the pod), and then restarting the kubelet. The container
runtime is then queried to ensure the containers are terminated by
kubelet.
Signed-off-by: David Porter <david@porter.me>
This is used across multiple tests, so let's move into the util file.
Also, refactor it a bit to provide a better error message in case of a
failure.
Signed-off-by: David Porter <david@porter.me>
* Add integration tests for StatefulSetStartOrdinal feature
* Move expensive test setup (apiserver and running controller) to be run once in StatefulSetStartOrdinal parameterized tests
Image versions are already explicitly set in our manifests
configuration, so tests should not be setting these values
to ensure we're using the same versions across the board.
Adds integration tests for the following scenarios with
MultiCIDRRangeAllocator enabled:
- ClusterCIDR is released when an associated node is deleted.
- ClusterCIDR delete when a node is associated, validate the finalizer
behavior, make sure that deleted ClusterCIDR is cleaned up after the
associated node is deleted.
- ClusterCIDR marked as terminating due to deletion must not be used for
allocating PodCIDRs to new nodes.
- Tie break behavior when multiple ClusterCIDRs are eligible to
allocate PodCIDRs to a node.
The device plugin test expects that no other pods are running prior to
the test starting. However, it has been observed that in some cases
some resources may still be around from previous tests. This is because
the deletion of resources from other tests is handled by deleting that
test's framework's namespace which is done asynchronously without
waiting for the other test's namespace to be deleted.
As a result, when the node e2e device plugin starts, there may still be
other pods in process of termination. To work around this, add a retry
to the device plugin test to account for the time it takes to delete the
resources from the prior test.
Signed-off-by: David Porter <david@porter.me>
In the `NodeSupportsPreconfiguredRuntimeClassHandler`, update the check
for the runtime handler to return a failure if the
`/etc/containerd/config.toml` or `/etc/crio/crio.conf` config files do
not exist. If an error is returned, then the underlying test will be
skipped.
Test manually with starting a kind cluster and moving the containerd
config file and verifying that the test is skipped:
```
$ docker exec -it kind-worker /bin/bash
root@kind-worker:/# mv /etc/containerd/config.toml /etc/containerd/config.toml.bak
```
```
make WHAT="test/e2e/e2e.test"
$ ./_output/bin/e2e.test -kubeconfig /tmp/kubeconfig_kind -ginkgo.focus=".*should run a Pod requesting a RuntimeClass with a configured handler.*" --num-nodes=1 2>&1 -ginkgo.v=1 | tee -i "/tmp/build-log.txt"
[sig-node] RuntimeClass [It] should run a Pod requesting a RuntimeClass with a configured handler [NodeFeature:RuntimeHandler]
test/e2e/common/node/runtimeclass.go:85
[SKIPPED] Skipping test as node does not have E2E runtime class handler preconfigured in container runtime config: command terminated with exit code 1
```
Signed-off-by: David Porter <david@porter.me>
The recently introduced failure handling in ExpectNoError depends on error
wrapping: if an error prefix gets added with `fmt.Errorf("foo: %v", err)`, then
ExpectNoError cannot detect that the root cause is an assertion failure and
then will add another useless "unexpected error" prefix and will not dump the
additional failure information (currently the backtrace inside the E2E
framework).
Instead of manually deciding on a case-by-case basis where %w is needed, all
error wrapping was updated automatically with
sed -i "s/fmt.Errorf\(.*\): '*\(%s\|%v\)'*\",\(.* err)\)/fmt.Errorf\1: %w\",\3/" $(git grep -l 'fmt.Errorf' test/e2e*)
This may be unnecessary in some cases, but it's not wrong.
Instead of pod responses being printed to the log each time polling fails, we
get a consolidated failure message with all unexpected pod responses if (and
only if) the check times out or a progress report gets produced.
This renames PodsResponding to WaitForPodsResponding for the sake of
consistency and adds a timeout parameter. That is necessary because some other
users of NewProxyResponseChecker used a much lower timeout (2min vs. 15min).
Besides simplifying some code, it also makes it easier to rewrite
ProxyResponseChecker because it only gets used in WaitForPodsResponding.
WaitForPodToDisappear was always called such that it listed all pods, which
made it less efficient than trying to get just the one pod it was checking for.
Being able to customize the poll interval in practice wasn't useful, therefore
it can be replaced with WaitForPodNotFoundInNamespace.
The test validates the following endpoints
- deleteApiregistrationV1CollectionAPIService
- patchApiregistrationV1APIServiceStatus
- replaceApiregistrationV1APIService
- replaceApiregistrationV1APIServiceStatus
WaitForPods is now a generic function which lists pods and then checks the pods
that it found against some provided condition. A parameter determines how many
pods must be found resp. match the condition for the check to succeed.
The code becomes simpler (78 insertions, 91 deletions), easier to read (all
code entirely inside WaitForPodsRunningReady, no need to declare and later
overwrite variables) and possibly more correct (if all API calls failed,
the resulting error was ignored when allowedNotReadyPods > 0).
None of the users of the functions passed anything other than nil or an empty
map and the implementation ignore the parameter - it seems like a candidate for
simplification.
When a Gomega failure is converted to an error, the stack at the time when the
failure occurs may be useful: error wrapping provides some bread crumbs that
can be followed to determine where the failure really occurred, but error
wrapping may be missing or ambiguous.
To provide the additional information, a FailureError now includes a full stack
backtrace. The backtrace intentionally makes no attempt to exclude framework
functions besides the gomega support itself because helpers like
e2e/framework/pod may be relevant.
That backtrace is not included in the failure message for the sake of
brevity. Instead, it gets logged as part of the test's output.
gomega.Eventually provides better progress reports: instead of filling up the
log with rather useless one-line messages that are not enough to to understand
the current state, it integrates with Gingko's progress reporting (SIGUSR1,
--poll-progress-after) and then dumps the same complete failure message as
after a timeout. That makes it possible to understand why progress isn't
getting made without having to wait for the timeout.
The other advantage is that the failure message for some unexpected pod state
becomes more readable: instead of encapsulating it as "observed object" inside
an error, it directly gets rendered by gomega.
Calling gomega.Expect/Eventually/Consistently deep inside a helper call chain
has several challenges:
- the stack offset must be tracked correctly, otherwise the callstack
for the failure starts at some helper code, which is often not informative
- augmenting the failure message with additional information from each
caller implies that each caller must pass down a string and/or format
string plus arguments
Both challenges can be solved by returning errors:
- the stacktrace is taken at that level where the error is
treated as a failure instead of passing back an error, i.e.
inside the It callback
- traditional error wrapping can add additional information, if
desirable
What was missing was some easy way to generate an error via a gomega
assertion. The new infrastructure achieves that by mirroring the
Gomega/Assertion/AsyncAssertion interfaces with errors as return values instead
of calling a fail handler.
It is intentionally less flexible than the gomega APIs:
- A context must be passed to Eventually/Consistently as first
parameter because that is needed for proper timeout handling.
- No additional text can be added to the failure through this
API because error wrapping is meant to be used for this.
- No need to adjust the callstack offset because no backtrace
is recorded when a failure occurs.
To avoid the useless "unexpected error" log message when passing back a gomega
failure, ExpectNoError gets extended to recognize such errors and then skips
the logging.
Calling WaitForPodTerminatedInNamespace after testFlexVolume is useless because
the client pod that it waits for always gets deleted by testVolumeClient:
0fcc3dbd55/test/e2e/framework/volume/fixtures.go (L541-L546)
Worse, because WaitForPodTerminatedInNamespace treats "not found" as "must keep
polling", these two tests always kept waiting for 5 minutes:
Kubernetes e2e suite: [It] [sig-storage] Flexvolumes should be mountable
when non-attachable 6m4s
The only reason why these tests passed is that WaitForPodTerminatedInNamespace
used to return the "not found" API error. That is not guaranteed and about to
change.
The `runPausePod` timeout was 1 minute previously which appears to be
too short and timing out in some tests.
Switch to `f.Timeouts.PodStartShort` which is the common timeout used to wait
for pods to start which defaults to 5min.
Also refactor to remove `runPausePodWithoutTimeout` and instead rely on
`runPausePod` since we do not make the timeout customizable directly
(it can be changed via the test framework if desired).
Signed-off-by: David Porter <david@porter.me>
Adds integration tests for the following scenarios with
MultiCIDRRangeAllocator enabled:
- ClusterCIDR is released when an associated node is deleted.
- ClusterCIDR delete when a node is associated, validate the finalizer
behavior, make sure that deleted ClusterCIDR is cleaned up after the
associated node is deleted.
- ClusterCIDR marked as terminating due to deletion must not be used for
allocating Pod CIDRs to new nodes.
- Tie break behavior when multiple ClusterCIDRs are eligible to
allocate Pod CIDRs to a node.
PVCs using the ReadWriteOncePod access mode can only be referenced by a
single pod. When a pod is scheduled that uses a ReadWriteOncePod PVC,
return "Unschedulable" if the PVC is already in-use in the cluster.
To support preemption, the "VolumeRestrictions" scheduler plugin
computes cycle state during the PreFilter phase. This cycle state
contains the number of references to the ReadWriteOncePod PVCs used by
the pod-to-be-scheduled.
During scheduler simulation (AddPod and RemovePod), we add and remove
reference counts from the cycle state if they use any of these
ReadWriteOncePod PVCs.
In the Filter phase, the scheduler checks if there are any PVC reference
conflicts, and returns "Unschedulable" if there is a conflict.
This is a required feature for the ReadWriteOncePod beta. See for more context:
https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2485-read-write-once-pod-pv-access-mode#beta
Node E2E tests do not run a scheduler, so the host exec pod must have
the `spec.nodeName` set explicitly.
Signed-off-by: David Porter <david@porter.me>
The ClusterIP allocator tries to reserve on part of the ServiceCIDR
to allocate static IPs to the Services.
The heuristic of the allocator to obtain the offset was taking into
account the whole range size, not the IPs available in the range, the
subnet address and the broadcast address for IPv4 are not available.
This caused that for CIDRs with 16 hosts, /28 for IPv4 and /124 for
IPv6, the offset calculated was higher than the max number of available
addresses on the allocator, causing this to panic.
Change-Id: I6c6f527b0a600b3612be37769e405b8fb3dd33a8
There are two runtime class tests which required the container runtime
config to include explicit configuration for `test-handler`. The current
logic skips these tests in non GCE environments. This skip is too strict
since the test is skipped in node e2e environments and in other
environments such as kind, which support running the test and also
configure `test-handler`.
Instead of skipping based on provider, add a new function
`NodeSupportsPreconfiguredRuntimeClassHandler` which examines the
underlying container runtime config and checks if the config includes
`test-handler`. The check is a bit brittle since it assumes container
runtime config paths, but it is a net improvement over skipping the test
entirely on non GCE environments.
This results in the test working in the common test environments, namely
GCE kube-up, node e2e, and kind.
Signed-off-by: David Porter <david@porter.me>
When the e2e_node/checkpoint_container.go test was introduced no CRI
implementation supported the new CheckpointContainer RPC yet.
With the release of CRI-O 1.25 the CheckpointContainer is implemented
and the test has been extended to see if the content of the checkpoint
is as expected.
The test is skipped if the ContainerCheckpoint feature gate is disabled
or if the CRI implementation does not support the CheckpointContainer
RPC.
Signed-off-by: Adrian Reber <areber@redhat.com>
we are saving this information in an env variable `KUBE_INTEGRATION_ETCD_URL`
So just pick it up from there when needed. Currently when someone uses
framework.RunCustomEtcd directly, the global variable is *not* set and the
code that uses `GetEtcdURL` returns empty string.
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
Since we need to gather kubelet metrics for CPU Manager and Topology
Manager, renaming this function to a more generic name.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
The GlogSetter method is used by three components to change verbosity at
runtime through HTTP APIs. This used to work only for text output with klog
calls, but not for text output through the klog logger or for JSON output.
Now loggers can also provide a callback for changing their verbosity at
runtime. Implementing that implies that the Create factory method has to be
extended, which is an API break for the Go package, but not an API break for
the configuration file and command line flags, which is what matters for the
"api/v1" component API.
The condition methods will eventually all take a context. Since we
have been provided one, alter the accepted condition type and
change the four references in tree.
Collers of ExponentialBackoffWithContext should use a condition
aware function (ConditionWithContextFunc). If the context can be
ignored the helper ConditionFunc.WithContext can be used to convert
an existing function to the new type.
Using WaitTimeoutForPodRunningInNamespace followed by ExpectError was not very
precise (any error passed the check, not just the expected timeout) and
hard to read. Now the test's expectation is spelled out explicitly: the pod
must stay in pending.
These helper functions can be used in combination with
omega.Eventually/Consistently to implement polling of objects that is aware of
Kubernetes apiserver conventions:
- retry on certain errors instead of giving up,
with "not found" handling decided by the caller (may or may not
be fatal, depending on the test)
- sleep if requested by apiserver
The E2E framework contains several functions which only differ in how they get
name and namespace: from an API object (WaitForPodRunningInNamespace) or as
separate parameters (WaitTimeoutForPodRunningInNamespace).
NamespacedName and the NamedObject interface enable writing helper functions
that can be called with both an API object (like *v1.Pod, which implements
Object and thus NamedObject) and name+namespace string (via
NamespacedName).
The other advantage of NamespacedName is that the order of name and namespace
parameter cannot be mixed up.
NamespacedName was derived from k8s.io/apimachinery/pkg/types.NamespacedName. A
separate type in the framework package was chosen a) to avoid additional
imports in test code and b) because the interface might not be suitable for
k8s.io/apimachinery.
* Improving the output of tests in case of error
* Better error message
Also, the condition in the second case was reversed
* Fixing 2 tests whose condition was inverted
* Again I got the conditions wrong
* Sorry for the confusion
* Improved error messages on failures
This significantly reduces the surface area of the fieldmanager package
by hiding all the private "managers" objects, as well as the interface
that was made specifically for these. There is no reason to configure
these.
Primarily this protects against accidentally polling with the default interval
of 10ms. Setting these defaults may also make some tests simpler because they
don't need to override the defaults.
Various different tests all have their own poll intervals. As a start towards
consolidating that, the interval from test/e2e/framework/pod (as one of the
most common cases for polling) is moved into the framework.
Changing other helper packages and tests needs to follow.
This consolidates timeout handling. In the future, configuration of all
timeouts via a configuration file might get added. For now, the same three
legacy command line flags for the timeouts that get moved continue to be
supported.
All usage of builder pattern is convertible to cpuset.New()
with the same or fewer lines of code.
Migrate Builder.Add to a private method of CPUSet, with a comment
that it is only intended for internal use to preserve immutable
propoerty of the exported interface.
This also removes 'require' library dependency, which avoids
non-standard library usage.
In 'set', conversions to slice are done also, but with different names:
ToSliceNoSort() -> UnsortedList()
ToSlice() -> List()
Reimplement List() in terms of UnsortedList to save some duplication.
Removes exit/fatal from cpuset library.
Usage in podresources test was not necessary.
Library reference in cpu_manager_test was moved to a local function, and
converted to use e2e test framework error catching.
The terminationGracePeriodSeconds field was capitalized wrongly, leading
to unknown field warning when running deployment lifecycle e2e:
STEP: patching the Deployment @ 01/06/23 10:59:25.969
W0106 10:59:25.978292 433473 warnings.go:70] unknown field "spec.template.spec.TerminationGracePeriodSeconds"
The issue was undiscovered because the patched value didn't change, so
the test would succeed anyway.
Signed-off-by: Quan Tian <qtian@vmware.com>
Before, in RunPostFilterPlugins, we didn't distinguish between unschedulable and unresolvable
because we only have one postFilterPlugin by default, now, we have at least two, we should
make sure that once a postFilterPlugin returns unresolvable, we'll return directly
Signed-off-by: Kante Yin <kerthcet@gmail.com>
If we were to add new fields in TimeoutContext, the current users of
NewFrameworkWithCustomTimeouts might run into failures unless they get modified
to also set those new fields. This is error-prone.
A better approach is to let users of NewFrameworkWithCustomTimeouts override
fields by setting just those and use the normal defaults for the others.
Ginkgo relies on all workers defining all tests in exactly the same order. This
wasn't guaranteed for these tests, with the result that some tests might have
been executed more than once and others not at all when running in parallel.
This was noticed when some of these tests started to flake and then were
reported both as failure and success, as if they had been retried.
In v1.26.0, these tests only used the timeout context while waiting for a CSI
call. This restores that behavior, just in case that it is relevant. No test
flakes are known because of this.
The intend of timeout handling (for the entire "It" and not just a few calls)
becomes more obvious and simpler when using ginkgo.NodeTimeout as decorator.
It doesn't make sense for the E2E framework to have command line options that
don't do anything because then all test suites built with the framework inherit
those options.
For -list-images and -list-conformance-tests the solution is to move the
implementation into the framework (-list-images) respectively move the flag
into test/e2e (-list-conformance-tests).
The placement was decided based on the observation that image patching is
common functionality while conformance testing is specific to one test suite.
The "[sig-network] DNS HostNetwork should resolve DNS of partial qualified
names for services on hostNetwork pods with dnsPolicy:
ClusterFirstWithHostNet" test assumes that a service named "kube-dns"
exists in the "kube-system" namespace. This assumption is valid if the
cluster was configured using kubeadm, but the assumption may be invalid
otherwise.
As the test uses dnsPolicy: ClusterFirst (as opposed to dnsPolicy: None),
it does not need to specify the name server in dnsConfig. Omitting
dnsConfig.nameservers obviates the need to look up the service.
Follow-up to commit add4652352.
* test/e2e/network/dns.go: Don't look up or use the kube-dns cluster IP
address as it might not exist on clusters that were not configured using
kubeadm.
Bring back the number of test spec which was dropped earlier.
It's now available in the reporting node of `ReportBeforeSuite` by extracting
the number from report.PreRunStats.SpecsThatWillBeRun.
Signed-off-by: Dave Chen <dave.chen@arm.com>
The old tests were no longer passing with Ginkgo v2.5.0. Instead of keeping the
old approach of checking recorded spec results, now the tests actually cover
what we care about most: the results recorded in JUnit.
This also gets rid of having to repeat the stack backtrace twice (once as part
of the output, once for the separate backtrace field).
All information that we want will be written into the failure XML element's
data. We don't need the message tag and don't want it because our
tools (kettle, testgrid, spyglass) would then just concatenate the two strings.
This gets implemented for us by Ginkgo. However, truncating the failure message
is not supported there at the moment. It's unclear how important that is,
therefore this (recently added feature) gets removed.
The NodePort functionality can be tested within the cluster.
Testing from outside the cluster assumes that there is connectivity
between the e2e.test binary and the cluster under test, that is not
always true, and in some cases is exposed to external factors or
misconfigurations like wrong routes or firewall rules that impact
on the test.
Change-Id: Ie2fc8929723e80273c0933dbaeb6a42729c819d0
* Wire generic context to better handle timeout
* Add integration test for wait timeout
* kubectl wait: Fix integration test always passing issue
Currently, `kubectl wait` integration test always passes even if
it gets an error. Problem is object check is done after errexit is
turned off.
This PR redirects error to output and correctly assures that
object is expected status and if it is not, test should fail.
The background goroutine was started with the context from ginkgo.BeforeEach,
which then led to "context canceled" errors. While at it, the entire goroutine
start/stop gets moved into the BeforeEach and simplified.
These tests create their own pods, however, they were inside a block
that was always creating a pod, that was not used later on the tests,
but can influence on the conditions asserted to succeed the test.
Change-Id: I3bb9a0f123fb0766d75934ef8e197f92e3f5f3b8
Using the ctx of the ginkgo.BeforeEach in callbacks that are invoked after the
BeforeEach is done causes "context canceled" errors. Previously, this code used
context.TODO(). The best solution is to create a new context and cancel it
during test cleanup, then that context can be used for the API calls and as
stop channel.
After adding error checking in df5d84ae81, the "[sig-cli] Kubectl client
Simple pod should return command exit codes [Slow] running a failing command
without --restart=Never, but with --rm" test was found to time out.
Doubling the timeout might help. Alternatively, the entire
WaitForPodToDisappear could get removed, which would make this scenario similar
to the others which also don't wait.
the test uses a BeforeEach block to create a pod with the name defined
but the simplePodName variables, however, some test use the value of
the variable directly.
To avoid future problems use the variable name instead of the value on
all tests.
Change-Id: I21a01019d91fe5ae7e35566184420001978ce355
All code must use the context from Ginkgo when doing API calls or polling for a
change, otherwise the code would not return immediately when the test gets
aborted.
* Add tracker types and tests
* Modify ResourceEventHandler interface's OnAdd member
* Add additional ResourceEventHandlerDetailedFuncs struct
* Fix SharedInformer to let users track HasSynced for their handlers
* Fix in-tree controllers which weren't computing HasSynced correctly
* Deprecate the cache.Pop function
The Feature:SCTPConnectivity tests cannot run at the same time as the
"X doesn't cause sctp.ko to be loaded" tests, since they may cause
sctp.ko to be loaded. We had dealt with this in the past by marking
them [Disruptive], but this isn't really fair; the problem is more
with the sctp.ko-checking tests than it is with the SCTPConnectivity
tests. So make them not [Disruptive] and instead make the
sctp.ko-checking tests be [Serial].
There were two SCTP tests grouped together in
test/e2e/network/service.go, but one of them wasn't a service test...
so move the SCTP service test to be grouped with the other service
tests, and the SCTP hostport tests to be grouped with other
non-service tests.
The SCTP HostPort test was checking that creating a pod with an SCTP
HostPort would create a certain iptables rule, but the handling of
HostPorts is now up to CRI, not kubelet, so kubernetes e2e cannot
assume it will implement the feature in any specific way.
(The test still ensures that (a) the apiserver accepts SCTP HostPorts,
and (b) neither kubelet nor the runtime causes the SCTP kernel module
to be loaded as part of creating a pod with an SCTP HostPort.)
We had a test that creating a Service with an SCTP port would create
an iptables rule with "-p sctp" in it, which let us test that
kube-proxy was doing vaguely the right thing with SCTP even if the e2e
environment didn't have SCTP support. But this would really make much
more sense as a unit test.
Windows ComputerNames cannot exceed 15 characters. This causes a few tests to fail
when the node names exceed that limit. Additionally, the checks should be case
insensitive.
The Gingo v2 time suffix is hh:mm:ss without the .xyz sub-second details if the
time stamp happens to land exactly on a second.
This change fixes test flakes like the following:
-STEP: Building a namespace api object, basename test-namespace
+STEP: Building a namespace api object, basename test-namespace 12/13/22 11:43:53
--- FAIL: TestCleanup (36.79s)
ginkgo.DeferCleanup has multiple advantages:
- The cleanup operation can get registered if and only if needed.
- No need to return a cleanup function that the caller must invoke.
- Automatically determines whether a context is needed, which will
simplify the introduction of context parameters.
- Ginkgo's timeline shows when it executes the cleanup operation.
- use `ginkgo.DeferCleanup` instead of clean up in the AfterEach block
- encourage use of ginkgo by not extending expect.go
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Add a test case with a DaemonSet behind a simple load balancer whose
address is being constantly hit via HTTP requests.
The test passes if there are no errors when doing HTTP requests to the
load balancer address, during DaemonSet `RollingUpdate` operations.
Signed-off-by: Ionut Balutoiu <ibalutoiu@cloudbasesolutions.com>
Some node e2e tests check for expected number of pods running
on the node to verify the correct state of that node after running
test scenarios. An example of such a check is in the device plugin
end to end test here: [1].
If the node is not left in a clean state after an e2e test finishes
running, it can lead to flaky tests because the node might have
unexpected pods running on the node.
In order to avoid that, we make sure that the test pods are
cleaned up after the test runs.
[1]: https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/device_plugin_test.go#L189-L190
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
wait.Until catches panics and logs them, which leads to confusing
output. Besides, the test is written so that failures must get reported to the
main goroutine.
Looking up the expected nodes in the goroutine raced with the test making
changes to the configuration. When doing (unrelated?) changes, the test started
to fail:
Oct 23 15:47:03.092: INFO: Unexpected error:
<*errors.errorString | 0xc001154c70>: {
s: "no subset of available IP address found for the endpoint test-rolling-update-with-lb within timeout 2m0s",
}
Oct 23 15:47:03.092: FAIL: no subset of available IP address found for the endpoint test-rolling-update-with-lb within timeout 2m0s
Now that everything is connected to a per-test context, the gRPC server might
encounter an error before it gets shut down normally. We must not panic in that
case because it would kill the entire Ginkgo worker process. This is not even
an error, so just log it as info message.
The wrapper can be used in combination with ginkgo.DeferCleanup to ignore
harmless "not found" errors during delete operations.
Original code suggested by Onsi Fakhouri.
It is set in all of the test/e2e* suites, but not in the ginkgo output
tests. This check is needed before adding a test case there which would trigger
this nil pointer access.
Adding the "context" import in the previous commit must get compensated by
removing one of the blank lines in the output unit tests, otherwise the stack
backtrace don't match expectations.
Adding "ctx" as parameter in the previous commit led to some linter errors
about code that overwrites "ctx" without using it.
This gets fixed by replacing context.Background or context.TODO in those code
lines with the new ctx parameter.
Two context.WithCancel calls can get removed completely because the context
automatically gets cancelled by Ginkgo when the test returns.
Every ginkgo callback should return immediately when a timeout occurs or the
test run manually gets aborted with CTRL-C. To do that, they must take a ctx
parameter and pass it through to all code which might block.
This is a first automated step towards that: the additional parameter got added
with
sed -i 's/\(framework.ConformanceIt\|ginkgo.It\)\(.*\)func() {$/\1\2func(ctx context.Context) {/' \
$(git grep -l -e framework.ConformanceIt -e ginkgo.It )
$GOPATH/bin/goimports -w $(git status | grep modified: | sed -e 's/.* //')
log_test.go was left unchanged.
Endpoints generated by the endpoints controller are in the canonical
form, however, custom endpoints can not be in canonical format
(there was a time they were canonicalized in the apiserver, but this
caused performance issues because the endpoint controller kept
updating them since the created endpoint were different than the
stored one due to the canonicalization)
There are cases where a custom endpoint may generate multiple slices
due to the controller, per example, when the same address is present
in different subsets.
The endpointslice mirroring controller should canonicalize the
endpoints subsets before start processing them to be consistent
on the slices generated, there is no risk of hotlooping because
the endpoint is only used as input.
Change-Id: I2a8cd53c658a640aea559a88ce33e857fa98cc5c
This ensures that the daemonset controller updates daemonset statuses in
a best-effort manner even if syncDaemonSet fails.
In order to add an integration test, this also replaces
`cmd/kube-apiserver/app/testing.StartTestServer` with
`test/integration/framework.StartTestServer` and adds
`setupWithServerSetup` to configure the admission control of the
apiserver.
Currently, if user executes `kubectl scale --dry-run`, output has no
indicator showing that this is not applied in reality.
This PR adds dry run suffix to the output as well as more integration
tests to verify it.
`kubectl scale` calls visitor two times. Second call fails when
the piped input is passed by returning an
`error: no objects passed to scale` error.
This PR uses the result of first visitor and fixes that piped
input problem. In addition to that, this PR also adds new
scale test to verify.
`kubectl exec` command supports getting files as inputs. However,
if the file contains multiple resources, it returns unclear error message;
`cannot attach to *v1.List: selector for *v1.List not implemented`.
Since `exec` command does not support multi resources, this PR
handles that and returns descriptive error message earlier.
One of the cpumanager tests doesn't remove the pod
that got created during the test.
This causes pollution of other tests and failures
from time to time (depends on the test execution order).
In order to defalke the tests, we should delete the pod
and wait for it to be completely remove.
Signed-off-by: Talor Itzhak <titzhak@redhat.com>
This introduces `singularNameProvider`. This provider will be used
by core types to have their singular names are defined in discovery
endpoint. Thanks to that, core resources singular name always have
higher precedence than CRDs shortcuts or singular names.
This adds new integration tests to test shortnames and
singular names are expanding to correct resources.
In this case, core types have always higher precendence than
CRDs.
This change will leverage the new PreFilterResult
to reduce down the list of eligible nodes for pod
using Bound Local PVs during PreFilter stage so
that only the node(s) which local PV node affinity
matches will be cosnidered in subsequent scheduling
stages.
Today, the NodeAffinity check is done during Filter
which means all nodes will be considered even though
there may be a large number of nodes that are not
eligible due to not matching the pod's bound local
PV(s)' node affinity requirement. Here we can
reduce down the node list in PreFilter to ensure that
during Filter we are only considering the reduced
list and thus can provide a more clear message to
users when node(s) are not available for scheduling
since the list only contains relevant nodes.
If error is encountered (e.g. PV cache read error) or
if node list reduction cannot be done (e.g. pod uses
no local PVs), then we will still proceed to consider
all nodes for the rest of scheduling stages.
Signed-off-by: Yibo Zhuang <yibzhuang@gmail.com>
In the Dynamic Resource allocation example specs, the claim
parameter name specified was inconsistent.
This commit fixes that with a better/more consistent name,
which is used to define the configmap and referenced in
the `ResourceClaimTemplate` spec.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
A recent PR [1] updated the image versions we use for E2E tests. However, the ``windows-nanoserver`` image is meant to be in a private authenticated registry: ``gcr.io/authenticated-image-pulling/windows-nanoserver``, which requires credentials to pull images from it. This image is required by the ``[sig-node] Container Runtime blackbox test when running a container with a new image should be able to pull from private registry with secret [NodeConformance]`` test for Windows. The ``v3`` image does not exist, there's no automatic promotion process for that registry. Previously, it was built and pushed manually.
Because of this, the https://testgrid.k8s.io/sig-windows-signal#capz-windows-containerd-master jobs have started to fail.
Reverts the image version to ``v1``.
[1] https://github.com/kubernetes/kubernetes/pull/113900
Many clusters block direct requests from internal resources to the nodes
external IPs as best practice. All accesses from internal resources that
want to access resources running on nodes go through load balancers,
nodes being on private or public subnets. Let's prefer internal IPs
first, so the tests can work even when there are security group rules
present blocking requests to the external IPs.
We should not require ExternalIP for Conformance, but should keep
testing ExternalIPs in sig network.
Signed-off-by: Rafael Fonseca <r4f4rfs@gmail.com>
These instructions bring up a kind cluster with containerd 34d078e99, the
latest commit from the main branch. This version of containerd has
support for CDI.
The driver can be used manually against a cluster started with
local-up-cluster.sh and is also used for E2E testing. Because the tests proxy
connections from the nodes into the e2e.test binary and create/delete files via
the equivalent of "kubectl exec dd/rm", they can be run against arbitrary
clusters. Each test gets its own driver instance and resource class, therefore
they can run in parallel.
Add volumePath parameter to all disruptive checks, so subpath tests can use
"/test-volume" and disruptive tests can use "/mnt/volume1" for their
respective Pods.
This adds a new resource.k8s.io API group with v1alpha1 as version. It contains
four new types: resource.ResourceClaim, resource.ResourceClass, resource.ResourceClaimTemplate, and
resource.PodScheduling.
This removes WaitTimeoutForPodNoLongerRunningOrNotFoundInNamespace
introduced in f2b9479f8e and changes
the test to use goroutines to speed up the cleanups.
Most CI jobs run an OS that does not support SELinux, therefore tests that
need it should be skipped by default.
* [Feature:SELinux] marks tests that need SELinux (for any feature)
* [Feature:SELinuxMountReadWriteOncePod] marks tests that need
SELinuxMountReadWriteOncePod alpha gate enabled.
Currently, all SELinux tests have both, but it will change in the future.
Also make some design changes exposed in testing and review.
Do not remove the ambiguous old metric
`apiserver_flowcontrol_request_concurrency_limit` because reviewers
though it is too early. This creates a problem, that metric can not
keep both of its old meanings. I chose the configured concurrency
limit.
Testing has revealed a design flaw, which concerns the initialization
of the seat demand state tracking. The current design in the KEP is
as follows.
> Adjustment is also done on configuration change … For a newly
> introduced priority level, we set HighSeatDemand, AvgSeatDemand, and
> SmoothSeatDemand to NominalCL-LendableSD/2 and StDevSeatDemand to
> zero.
But this does not work out well at server startup. As part of its
construction, the APF controller does a configuration change with zero
objects read, to initialize its request-handling state. As always,
the two mandatory priority levels are implicitly added whenever they
are not read. So this initial reconfig has one non-exempt priority
level, the mandatory one called catch-all --- and it gets its
SmoothSeatDemand initialized to the whole server concurrency limit.
From there it decays slowly, as per the regular design. So for a
fairly long time, it appears to have a high demand and competes
strongly with the other priority levels. Its Target is higher than
all the others, once they start to show up. It properly gets a low
NominalCL once other levels show up, which actually makes it compete
harder for borrowing: it has an exceptionally high Target and a rather
low NominalCL.
I have considered the following fix. The idea is that the designed
initialization is not appropriate before all the default objects are
read. So the fix is to have a mode bit in the controller. In the
initial state, those seat demand tracking variables are set to zero.
Once the config-producing controller detects that all the default
objects are pre-existing, it flips the mode bit. In the later mode,
the seat demand tracking variables are initialized as originally
designed.
However, that still gives preferential treatment to the default
PriorityLevelConfiguration objects, over any that may be added later.
So I have made a universal and simpler fix: always initialize those
seat demand tracking variables to zero. Even if a lot of load shows
up quickly, remember that adjustments are frequent (every 10 sec) and
the very next one will fully respond to that load.
Also: revise logging logic, to log at numerically lower V level when
there is a change.
Also: bug fix in float64close.
Also, separate imports in some file
Co-authored-by: Han Kang <hankang@google.com>
This change enables hot reload of encryption config file when api server
flag --encryption-provider-config-automatic-reload is set to true. This
allows the user to change the encryption config file without restarting
kube-apiserver. The change is detected by polling the file and is done
by using fsnotify watcher. When file is updated it's process to generate
new set of transformers and close the old ones.
Signed-off-by: Nilekh Chaudhari <1626598+nilekhc@users.noreply.github.com>
This change adds a flag --encryption-provider-config-automatic-reload
which will be used to drive automatic reloading of the encryption
config at runtime. While this flag is set to true, or when KMS v2
plugins are used without KMS v1 plugins, the /healthz endpoints
associated with said plugins are collapsed into a single endpoint at
/healthz/kms-providers - in this state, it is not possible to
configure exclusions for specific KMS providers while including the
remaining ones - ex: using /readyz?exclude=kms-provider-1 to exclude
a particular KMS is not possible. This single healthz check handles
checking all configured KMS providers. When reloading is enabled
but no KMS providers are configured, it is a no-op.
k8s.io/apiserver does not support dynamic addition and removal of
healthz checks at runtime. Reloading will instead have a single
static healthz check and swap the underlying implementation at
runtime when a config change occurs.
Signed-off-by: Monis Khan <mok@microsoft.com>
so that it explicitly describe group information defined in the
container image will be kept. This also adds e2e test case of
SupplementalGroups with pre-defined groups in the container
image to make the behaivier clearer.
- New API field .spec.schedulingGates
- Validation and drop disabled fields
- Disallow binding a Pod carrying non-nil schedulingGates
- Disallow creating a Pod with non-nil nodeName and non-nil schedulingGates
- Adds a {type:PodScheduled, reason:WaitingForGates} condition if necessary
- New literal SchedulingGated in the STATUS column of `k get pod`
In the PR https://github.com/kubernetes/kubernetes/pull/86139, two more lifecycle hook tests (poststart / prestop)
were added using HTTPS. They are similar with the existing HTTP tests.
However, this causes failures on Windows due to how networking
works there. We previously fixed this in the HTTP tests via f9e4a015e2.
This commit applies the same fix on the lifecycle hook HTTPS tests.
Signed-off-by: Ionut Balutoiu <ibalutoiu@cloudbasesolutions.com>
The cloud-provider and the e2e test were racing on deleting the
cloud resources.
Also, the cloud-provider should not leave orphan resources, that will
be detected by the job and fail, thus we should not have additional
logic to cleanup masking these errors.
Removed the unit tests that test the cases when the MixedProtocolLBService feature flag was false - the feature flag is locked to true with GA
Added an integration test to test whether the API server accepts an LB Service with different protocols.
Added an e2e test to test whether a service which is exposed by a multi-protocol LB Service is accessible via both ports.
Removed the conditional validation that compared the new and the old Service definitions during an update - the feature flag is locked to true with GA.
The original intention (adding more information for later analysis)
is probably obsolete because there is no code which does anything
with the extended error.
The code in upgrade_suite.go collected it in an in-memory JUnit report, but
then didn't do anything with that field. The code also wouldn't work for
failures detected by Ginkgo itself, like the upcoming timeout handling. If the
upgrade suite needs the information, it probably should get it from Gingko with
a ReportAfterSuite call instead of depending in some fragile interception
mechanism.
Tests scheduler enforcement of the ReadWriteOncePod PVC access mode.
- Creates a pod using a PVC with ReadWriteOncePod
- Creates a second pod using the same PVC
- Observes the second pod fails to schedule because PVC is in-use
- Deletes the first pod
- Observes the second pod successfully schedules
Some of our API types contain fields that get rendered very poorly by
gomega.format.Object because they contain lots of internal information, for
example CreationTimestamp. As a result, dumping full API object typically gets
truncated.
What we want is a representation that is a) multi-line (in contrast to the
stringer implemented by our types) and b) drops empty fields where it
was defined that this is okay.
The normal YAML representation fits that requirement. We just need to teach
gomega how and when to do that. This cannot be done for each type through a
generated GomegaString method (lots of code, additional dependency in public
API on YAML encoder), but it can be done inside tests by adding a formatting
handler (new gomega feature).
Our tooling cannot handle very long failure messages well:
- when unfolding a test in the spyglass UI, it fills the entire screen
- failure correlation for http://go.k8s.io/triage has resource constraints
We cannot enforce that all tests only produce short failure messages and even
if we could, depending on the test failure, including more information may be
useful to understand it.
To achieve both goals (summary for correlation and overview, all details
available when digging deeper), too longer failure messages now get truncated,
with the full message guaranteed to be captured in the test output.
"Too long" is arbitrarily chosen to be similar to the gomega.MaxLength because
that has been a limit for failure message size in the past.
When gomega.format exceeds the default size of 4000, it truncates and prints:
Gomega truncated this representation as it exceeds 'format.MaxLength'.
Consider having the object provide a custom 'GomegaStringer' representation
or adjust the parameters in Gomega's 'format' package.
Learn more here: https://onsi.github.io/gomega/#adjusting-output
These instructions don't help the user of the e2e.test binary unless we provide
a command line flag.
Commit 99e9096034 was only supposed to remove the
FailfWithOffset function, but it also changed the behavior by skipping one
additional stack frame. That makes no sense and is inconsistent with Fail, which
also logs the direct caller.
The Windows Server Core images are quite large (~2GB each), and pulling
it for multiple build jobs / E2E images is inefficient, especially if
have to build for multiple OS versions.
The windows-servercore-cache image is meant to simply cache the Windows files
we need from the Windows Server core images, so we can pull the small cache image
instead of the entire image. It is never meant to be a promotable image,
the version is not meant to be bumped.
The other images (e.g.: agnhost) rely on the version 1.0 images.
In the `should correctly account for terminated pods after restart`, the
test first creates a set of `restartNever` pods, followed by a set of
`restartAlways` pods. Both the `restartNever` and `restartAlways` pods
request an entire CPU. As a result, the `restartAlways` pods will not be
admitted, if the `restartNever` pods did not terminate yet.
Depending on the timing/how fast the pods terminate, the test can pass
sometimes fail which results in flakes. To de-flake the test, the test
should wait until the `restartNever` pods enter a terminal `Succeeded`
phase, before creating the `restartAlways` pods.
To do this, generalize the function `waitForPods` to accept a pod
condition (`testutils.PodRunningReadyOrSucceeded`, or
`testutils.PodSucceeded`). Also introduce a new "Succeeded" pod
condition, so the test can explicitly wait until the pods enter the
Succeeded phase.
Signed-off-by: David Porter <david@porter.me>
Currently, when running node e2e it's not possible to use the ginkgo `--repeat`
flag to run the test suite multiple times. This is useful when debugging tests
and ensuring they are not flaky by re-running them several times. Currently if
using `--repeat` ginkgo flag, the 2nd run of the test will fail due to kubelet
not starting with message like:
```
Failed to start transient service unit: Unit kubelet-20221020T040841.service already exists.
```
This is because during the test startup, kubelet is started as a transient unit
file via `systemd-run`. The unit is started with the `--remain-after-exit` flag
to ensure that the unit will remain even if the kubelet is restarted. The test
suite currently uses `systemd kill` command to stop kubelet. This works fine for
stopping the kubelet, but on the second run, when `systemd-run` is used to start
systemd unit again it will fail because the unit already exists. This is because
`systemd kill` will not delete the systemd unit, only send SIGTERM signal to it.
To fix this, add `unitName` as a field to the `server` struct. When
kubelet server is constructed, set the unit name. As part of e2e test
termination, in `E2EServices.Stop()``, stop the kubelet systemd unit. By
stopping the kubelet systemd unit, systemd will delete the systemd
transient unit, allowing it to be created and started again in a
subsequent e2e run.
Signed-off-by: David Porter <david@porter.me>
The original intention was to address "frustration of end users running the e2e
suite is that they take a significant amount of time and it is difficult to
gauge progress".
But Ginkgo's output is different now than it was in Kubernetes 1.19. If users
want to see progress, then "ginkgo --progress" might provide enough
information.
Printing to os.Stdout doesn't work as intended anyway when output redirection
is enabled (the default for parallel runs) and causes these JSON snippets to
appear as "show stdout" for each failed test in a Prow job, which is
distracting.
Tests should accept a context from Ginkgo and pass it through to all functions
which may block for a longer period of time. In particular all Kubernetes API
calls through client-go should use that context. Then if a timeout occurs,
the test returns immediately because everything that it could block on will
return.
Cleanup code then needs to run in a separate Ginkgo node, typically
DeferCleanup, which ensures that it gets a separate context which has not timed
out yet.
Align the behavior of HTTP-based lifecycle handlers and HTTP-based
probers, converging on the probers implementation. This fixes multiple
deficiencies in the current implementation of lifecycle handlers
surrounding what functionality is available.
The functionality is gated by the features.ConsistentHTTPGetHandlers feature gate.
The device plugin test in https://testgrid.k8s.io/sig-node-release-blocking#node-kubelet-serial-containerd
has been flaky for a while now when it runs on the test infrastructure.
Locally running this test resulted in test passing without issues.
Based on the existing logs, it is not clear why podresource
API endpoint is returning 3 pods rather than the expected
two pods (device plugin pod and the test pod requesting
devices). For more clarity and debugaability on why an
addtional pod seems to be appearing we expose the output
from podresource API endpoint.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
* add superuser fallback to authorizer
* change the order of authorizers
* change the order of authorizers
* remove the duplicate superuser authorizer
* add integration test for superuser permissions
e2e test validates the following 3 endpoints
- patchCoreV1NamespacedResourceQuotaStatus
- readCoreV1NamespacedResourceQuotaStatus
- replaceCoreV1ResourceQuotaForAllNamespacesStatus
This addresses a problem caused by
https://github.com/kubernetes/kubernetes/pull/112043: because the AfterEach
which invokes AllNodesReady always runs, including tests that skipped early,
those tests ran into a nil pointer access. This increased the size of log
files. The tests still worked.
Adds two tests for the enforcement of the ReadWriteOncePod
PersistentVolume access mode.
1. Tests that when two Pods are scheduled that reference the same
ReadWriteOncePod PVC, the latter-scheduled Pod will be marked
unschedulable because the PVC is in-use.
2. Tests that when two Pods are scheduled on the same node (setting
Pod.Spec.NodeName to bypass scheduling for the second Pod), the
latter Pod will fail to start because the PVC is already mounted on
the Node.
Included are changes to update the hostpath CSI driver to accept new CSI
access modes. Its sidecar containers are already at supported versions
for ReadWriteOncePod and don't need updating. The GCP PD CSI driver does
not yet support the new CSI access modes, but its sidecar containers are
at supported versions and so the feature will work.
To support ReadWriteOncePod, the following CSI sidecars must be updated
to these versions or greater:
- csi-provisioner:v3.0.0+
- csi-attacher:v3.3.0+
- csi-resizer:v1.3.0+
For more details, see:
https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/2485-read-write-once-pod-pv-access-mode/README.md
The reason for the issue is that the metrics were bumped before the
final job status update. In case the update failed the path was
repeated by the next syncJob leading to double-counting of the metrics.
The solution is to delay recording metrics and broadcasting events
after the job status update succeeds.
This change updates the API server code to load the encryption
config once at start up instead of multiple times. Previously the
code would set up the storage transformers and the etcd healthz
checks in separate parse steps. This is problematic for KMS v2 key
ID based staleness checks which need to be able to assert that the
API server has a single view into the KMS plugin's current key ID.
Signed-off-by: Monis Khan <mok@microsoft.com>
The change made in https://github.com/kubernetes/kubernetes/pull/112644
resulted in an update to the rejection message. In the memory manager
node e2e test, we still checked against the old expected error message
giving the impression that the pod succeeded to run even though it failed
as expected mainly because the check wasn't performed correctly.
In this patch, we update to the correct rejection message to make sure
that the memory manager is no longer failing.
NOTE: This test is supposed to run on multi NUMA systems and if the
underlying node does not have multi NUMA nodes, the test is skipped
which is what happens in upstream test infrastructure as it is mainly
composed of single NUMA nodes. Because of this, this test failure
wasn't evident via testgrid.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
It turned out to be unreliable (see
https://github.com/kubernetes/kubernetes/issues/112834). Encoding the data
inside the command as input for base64 is a workaround that is fine for small
amounts of data. It becomes less efficient and/or unusable for large amounts.
Mark remotecommand.Executor as deprecated and related modifications.
Handle crash when streamer.stream panics
Add a test to verify if stream is closed after connection being closed
Remove blank line and update waiting time to 1s to avoid test flakes in CI.
Refine the tests of StreamExecutor according to comments.
Remove the comment of context controlling the negotiation progress and misc.
Signed-off-by: arkbriar <arkbriar@gmail.com>
add tests to ensure the podresources metrics are exposed,
and basic sanity tests for their values.
Signed-off-by: Francesco Romani <fromani@redhat.com>
This adds test coverage for NewFrameworkExtensions and shows better how
BeforeEach callbacks are invokved. The unit test is not strictly about just the
cleanup operations anymore, but that's okay(ish).
The "todo" packages were necessary while moving code around to avoid hitting
cyclic dependencies. Now that any sub package can depend on the framework, they
are no longer needed and the code can be moved into the normal sub packages.
This helps getting rid of the ssh dependency. The same init package as for
dumping namespaces takes care of adding the functionality back to framework
instances.
This helps getting rid of the ssh dependency. The same init package as for
dumping namespaces takes care of adding the functionality back to framework
instances.
This reduces the size of the test/e2e/framework itself. Because it does not
gather metrics data anymore by default, E2E test suites must set their
callbacks function or set the original one by importing
"k8s.io/kubernetes/test/e2e/framework/todo/metrics/init".
This reduces the size of the test/e2e/framework itself. Because it does not
check nodes anymore by default, E2E test suites must set their own check
function or set the original one by importing
"k8s.io/kubernetes/test/e2e/framework/todo/node/init".
This reduces the size of the test/e2e/framework itself. Because it does not
dump anything anymore by default, E2E test suites must set their own dump
function or set the original one by importing
"k8s.io/kubernetes/test/e2e/framework/debug/init".
This will be used as mechanism for invoking some of the code which is currently
hard-coded in the framework once that code is placed in optional packages.
During the September 29th, 2022 SIG-Network meeting we decided to
demote the two affinity timeout conformance tests. This was because:
(a) there is no documented correct behavior for these tests other than
"what kube-proxy does"
(b) even the kube-proxy behavior differs depending on the backend implementation
of iptables, IPVS, or [win]userspace (and winkernel doesn't at all)
(c) iptables uses only srcip matching, while userspace and IPVS use srcip+srcport
(d) IPVS and iptables have different minimum timeouts and we had
to hack up the test itself to make IPVS pass
(e) popular 3rd party network plugins also vary in their implementation
Our plan is to deprecate the current affinity options and re-add specific
options for various behaviors so it's clear exactly what plugins support
and which behavior (if any) we want to require for conformance in the future.
Signed-off-by: Dan Williams <dcbw@redhat.com>
e2e test validates the following 3 endpoints
- listCoreV1LimitRangeForAllNamespaces
- patchCoreV1NamespacedLimitRange
- deleteCoreV1CollectionNamespacedLimitRange
* Update Url string to have only one slash
Signed-off-by: Akanksha Kumari <akankshakumari393@gmail.com>
* Trim / from Right in hostname
Signed-off-by: Akanksha Kumari <akankshakumari393@gmail.com>
Considering we have removed the glusterfs driver from the
repo, we no longer need the test server for CI. This commit
remove the same
Ref# https://github.com/kubernetes/kubernetes/pull/112015
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
- Moves kms proto apis to the staging repo
- Updates generate and verify kms proto scripts to check staging repo
Signed-off-by: Anish Ramasekar <anish.ramasekar@gmail.com>
* Add e2e tests for events command
* Run events tests as normal e2e instead conformance
Conformance tests are only for GA features. Since `kubectl events`
currently is in alpha stage, e2e tests for this command should be
run as standard e2e.
The kubectl delete -f commands in stop-kubemark.sh may get stuck if the
cluster setup is removed concurrently. It is not important for these
commands to succeed so a timeout of 5m is set.
Since we have upgraded to snapshot controller version to v6, the
snapshot tests looks to be failing in the testgrid. It has been mainly
because the latest version of snapshot controller stopped serving
v1beta1 APIs. The sidecar image versions in the tests also has to be
updated to make sure these are compatible.
This commit add missing RBAC rules for the controller as per the
latest version.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
When waiting for the default service account in a new namespace, not finding
one was reported as "unexpected error: timed out waiting for the condition"...
etcd only fully supports linux && amd64, the other architectures
and OS are only guaranteed to build, see:
https://etcd.io/docs/v3.5/op-guide/supported-platform/#support-tiers
Skip the test that use etcd on not well supported environment to
guarantee the stability of the test.
Getting a message about "pod ran to completion" is confusing when the pod
hasn't been able to start at all. The failed state now has a different message.
To address the previous ambiguity, the success state is described as "ran to
completion successfully".
When Ginkgo shows a BeforeEach/AfterEach/DeferCleanup, then it can only show
the source code where the callback was registered because there is no
description parameter. This can be improved by passing a custom CodeLocation.
Because a description like "set up framework" might not be enough, the source
code is still shown, too.
This is useful for running a driver on a subset of all ready nodes:
- use e2enode.GetBoundedReadySchedulableNodes with a suitable
maximum number of nodes to determine how much nodes are available
for a test
- define pod anti-affinity in the PodTemplate:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/instance: xxxxxxx
topologyKey: kubernetes.io/hostname
- set the ReplicaSetSpec.Replicas value to the number of nodes
If the control plane emits anything at the time when the test runs, for example
"unable to sync kubernetes service", the test breaks because that additional
output is unexpected.
Pulling the CreateKubeConfig function from the expensive to build
test/utils/apiserver package had a considerable impact on the overall build
time because that package depends on a lot of other packages.
Because only that one function is needed by the framework, that extra build
time can be avoided by moving it into its own package.
The framework.AddCleanupAction API was a workaround for Ginkgo v1 not invoking
AfterEach callbacks after a test failure. Ginkgo v2 not only fixed that, but
also added a DeferCleanup API which can be used to run some code if (and only
if!) the corresponding setup code ran. In several cases that makes the test
cleanup simpler.
This covers multiple facets of the current framework and of Ginkgo:
- Ginkgo output is verbose and includes detailed progress
messages (BeforeEach/AfterEach tracing).
- Namespace creation.
- Order of callback invocation.
This runs etcd and an apiserver using it inside the test process. The caller
can either use the ClientSet or the config file. More options might get added
in the future.
Co-author: Antonio Ojea <antonio.ojea.garcia@gmail.com>
When using By or some other Ginkgo output functions, Ginkgo v2 now adds a time
stamp at the end of the line that we need to ignore. Will become relevant when
testing more complete output.
For cleanup purposes the ginkgo.DeferCleanup is a better replacement for
f.AddAfterEach:
- the cleanup only gets executed when the corresponding setup code ran
and can use the same local variables
- the callback runs after the test and before the framework
deletes namespaces (as before)
- if one callback fails, the others still get executed
For the original purpose (https://github.com/kubernetes/kubernetes/pull/86177 "This is
very useful for custom gathering scripts.") it is now possible to use
ginkgo.AfterEach because it will always get executed. Just beware that its
callbacks run in first-in-first-out order.
In contrast to ginkgo.AfterEach, ginkgo.DeferCleanup runs the callback in
first-in-last-out order. Using it makes the following test code work as
expected:
f := framework.NewDefaultFramework("some test")
ginkgo.AfterEach(func() {
// do something with f.ClientSet
})
Previously, f.ClientSet was already set to nil by the framework's cleanup code.
After updating gRPC in node-driver-registrar from v1.40.0 to v1.47.0 the
behavior of gRPC change in a way such that it no longer detected the
single-sided closing of the stream as a loss of connection. This caused gRPC in
the e2e.test to get stuck, possibly in a Read or Write for the HTTP stream
because those have neither a context nor a timeout.
Changing the connection handling so that all active connections are tracking in
the listener and closing them when the listener gets closed fixed this problem.
Some scripts and tools still relied on the deprecated flags, the ones
which are about to be removed.
This is intentionally not a complete removal of all those flags in the entire
repo. This would lead to much more code churn also in places where commands
still accept the flags because they use klog directly.
The custom progress reporter gets invoked via ginkgo.ReportAfterEach after each
test. The problem was that the e2e framework unconditionally enables Ginkgo's
-progress output which shows execution of all nodes, including this
ReportAfterEach. The effect were over 1000 lines of useless output at the start
of a test run while skipping disabled tests.
The solution is to tell Ginkgo that the ReportAfterEach isn't meant to be
reported.
This change updates TestAggregatedAPIServer and the related test
server wiring to exercise the full network path between the Kube API
server and the aggregated API server. We now assert that the wardle
API service and Kube API server discovery endpoints are fully healthy.
CRUD operations are performed through the Kube API server to the
wardle API server.
Signed-off-by: Monis Khan <mok@microsoft.com>
Contextual logging cannot be enabled manually because there is no feature gate
flag. Enabling the feature unconditionally:
- should be low risk for E2E testing
- if it fails, we want to know
- is useful to get better log output from code which already supports it
We don't want klog to print to anything other than GinkgoWriter, but it still
used os.Stderr in addition to GinkgoWriter when printing log entries with
severity >= error. Changing "stderrthreshold" fixes that.
The unit test for framework output handling didn't test klog behavior. Now it
does:
- os.Stderr is redirected, should be empty
- a new test invokes klog
The WaitFor* refactoring in 07c34eb400 had an oversight what timeout parameter
is used for calling WaitForAllPodsCondition() in WaitForPodsWithLabelRunningReady()
so the calls to WaitForPodsWithLabelRunningReady() ended up ignoring the user
provided timeout. Fix that.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Followup on https://github.com/kubernetes/kubernetes/pull/111846. This
particular test was left out from that PR because once it was enabled it
started failing. It was desired to merge
https://github.com/kubernetes/kubernetes/pull/111846 irrespective of
this particular test.
The failure in the test was caused due to the
`createFSGroupRequestPreHook` mock CSI driver hook function assuming
that the request object passed to it is an instance of the respective
struct, but it's actually a pointer instead. This resulted in the hook
function not fulfilling its purpose, and the so the test failed.
Fixes instances of #98213 (to ultimately complete #98213 linting is
required).
This commit fixes a few instances of a common mistake done when writing
parallel subtests or Ginkgo tests (basically any test in which the test
closure is dynamically created in a loop and the loop doesn't wait for
the test closure to complete).
I'm developing a very specific linter that detects this king of mistake
and these are the only violations of it it found in this repo (it's not
airtight so there may be more).
In the case of Ginkgo tests, without this fix, only the last entry in
the loop iteratee is actually tested. In the case of Parallel tests I
think it's the same problem but maybe a bit different, iiuc it depends
on the execution speed.
Waiting for the CI to confirm the tests are still passing, even after
this fix - since it's likely it's the first time those test cases are
executed - they may be buggy or testing code that is buggy.
Another instance of this is in `test/e2e/storage/csi_mock_volume.go` and
is still failing so it has been left out of this commit and will be
addressed in a separate one
The main purpose of this change is to update the e2e Netpol tests to use
the srandard CreateNamespace function from the Framework. Before this
change, a custom Namespace creation function was used, with the
following consequences:
* Pod security admission settings had to be enforced locally (not using
the centralized mechanism)
* the custom function was brittle, not waiting for default Namespace
ServiceAccount creation, causing tests to fail in some infrastructures
* tests were not benefiting from standard framework capabilities:
Namespace name generation, automatic Namespace deletion, etc.
As part of this change, we also do the following:
* clearly decouple responsibilities between the Model, which defines the
K8s objects to be created, and the KubeManager, which has access to
runtime information (actual Namespace names after their creation by
the framework, Service IPs, etc.)
* simplify / clean-up tests and remove as much unneeded logic / funtions
as possible for easier long-term maintenance
* remove the useFixedNamespaces compile-time constant switch, which
aimed at re-using existing K8s resources across test cases. The
reasons: a) it is currently broken as setting it to true causes most
tests to panic on the master branch, b) it is not a good idea to have
some switch like this which changes the behavior of the tests and is
never exercised in CI, c) it cannot possibly work as different test
cases have different Model requirements (e.g., the protocols list can
differ) and hence different K8s resource requirements.
For #108298
Signed-off-by: Antonin Bas <abas@vmware.com>
Introduce networking/v1alpha1 api group.
Add `ClusterCIDR` type to networking/v1alpha1 api group, this type
will enable the NodeIPAM controller to support multiple ClusterCIDRs.
Updates predicate to check for a length >=2 to avoid
the index out of bounds panic.
Signed-off-by: Edwin Xie <exie@vmware.com>
Co-authored-by: Tyler Schultz <tschultz@vmware.com>
- add feature gate
- add encrypted object and run generated_files
- generate protobuf for encrypted object and add unit tests
- move parse endpoint to util and refactor
- refactor interface and remove unused interceptor
- add protobuf generate to update-generated-kms.sh
- add integration tests
- add defaulting for apiVersion in kmsConfiguration
- handle v1/v2 and default in encryption config parsing
- move metrics to own pkg and reuse for v2
- use Marshal and Unmarshal instead of serializer
- add context for all service methods
- check version and keyid for healthz
Signed-off-by: Anish Ramasekar <anish.ramasekar@gmail.com>
Including the full information for successful tests makes the resulting XML
file too large for the 200GB limit in Spyglass when running large jobs (like
scale testing).
The original solution from https://github.com/kubernetes/kubernetes/pull/111627
broke JUnit reporting in other test suites, in particular
test/e2e_node. Keeping the code inside the framework ensures that all test
suites continue to have the JUnit reporting.
AfterReadingAllFlags is a good place to set this up because all test suites
using the test context are expected to call it before running tests and after
parsing flags.
Removing the ReportEntries added by ginkgo.By from all test reports usually
avoids the `system-err` part in the JUnit file, which in Spyglass avoids
the extra "open stdout" button.
Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
Co-authored-by: Dave Chen <dave.chen@arm.com>
It is used to request that a pod runs in a unique user namespace.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Co-authored-by: Rodrigo Campos <rodrigoca@microsoft.com>
Including the full information for successful tests makes the resulting XML
file too large for the 200GB limit in Spyglass when running large jobs (like
scale testing).
Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
Co-authored-by: Dave Chen <dave.chen@arm.com>
- PreemptionByKubeScheduler (Pod preempted by kube-scheduler)
- DeletionByTaintManager (Pod deleted by taint manager due to NoExecute taint)
- EvictionByEvictionAPI (Pod evicted by Eviction API)
- DeletionByPodGC (an orphaned Pod deleted by PodGC)PreemptedByScheduler (Pod preempted by kube-scheduler)
We now partly drop the support for seccomp annotations which is planned
for v1.25 as part of the KEP:
https://github.com/kubernetes/enhancements/issues/135
Pod security policies are not touched by this change and therefore we
have to keep the annotation key constants.
This means we only allow the usage of the annotations for backwards
compatibility reasons while the synchronization of the field to
annotation is no longer supported. Using the annotations for static pods
is also not supported any more.
Making the annotations fully non-functional will be deferred to a
future release.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>