Add volumePath parameter to all disruptive checks, so subpath tests can use
"/test-volume" and disruptive tests can use "/mnt/volume1" for their
respective Pods.
This adds a new resource.k8s.io API group with v1alpha1 as version. It contains
four new types: resource.ResourceClaim, resource.ResourceClass, resource.ResourceClaimTemplate, and
resource.PodScheduling.
This removes WaitTimeoutForPodNoLongerRunningOrNotFoundInNamespace
introduced in f2b9479f8e and changes
the test to use goroutines to speed up the cleanups.
Most CI jobs run an OS that does not support SELinux, therefore tests that
need it should be skipped by default.
* [Feature:SELinux] marks tests that need SELinux (for any feature)
* [Feature:SELinuxMountReadWriteOncePod] marks tests that need
SELinuxMountReadWriteOncePod alpha gate enabled.
Currently, all SELinux tests have both, but it will change in the future.
Also make some design changes exposed in testing and review.
Do not remove the ambiguous old metric
`apiserver_flowcontrol_request_concurrency_limit` because reviewers
though it is too early. This creates a problem, that metric can not
keep both of its old meanings. I chose the configured concurrency
limit.
Testing has revealed a design flaw, which concerns the initialization
of the seat demand state tracking. The current design in the KEP is
as follows.
> Adjustment is also done on configuration change … For a newly
> introduced priority level, we set HighSeatDemand, AvgSeatDemand, and
> SmoothSeatDemand to NominalCL-LendableSD/2 and StDevSeatDemand to
> zero.
But this does not work out well at server startup. As part of its
construction, the APF controller does a configuration change with zero
objects read, to initialize its request-handling state. As always,
the two mandatory priority levels are implicitly added whenever they
are not read. So this initial reconfig has one non-exempt priority
level, the mandatory one called catch-all --- and it gets its
SmoothSeatDemand initialized to the whole server concurrency limit.
From there it decays slowly, as per the regular design. So for a
fairly long time, it appears to have a high demand and competes
strongly with the other priority levels. Its Target is higher than
all the others, once they start to show up. It properly gets a low
NominalCL once other levels show up, which actually makes it compete
harder for borrowing: it has an exceptionally high Target and a rather
low NominalCL.
I have considered the following fix. The idea is that the designed
initialization is not appropriate before all the default objects are
read. So the fix is to have a mode bit in the controller. In the
initial state, those seat demand tracking variables are set to zero.
Once the config-producing controller detects that all the default
objects are pre-existing, it flips the mode bit. In the later mode,
the seat demand tracking variables are initialized as originally
designed.
However, that still gives preferential treatment to the default
PriorityLevelConfiguration objects, over any that may be added later.
So I have made a universal and simpler fix: always initialize those
seat demand tracking variables to zero. Even if a lot of load shows
up quickly, remember that adjustments are frequent (every 10 sec) and
the very next one will fully respond to that load.
Also: revise logging logic, to log at numerically lower V level when
there is a change.
Also: bug fix in float64close.
Also, separate imports in some file
Co-authored-by: Han Kang <hankang@google.com>
This change enables hot reload of encryption config file when api server
flag --encryption-provider-config-automatic-reload is set to true. This
allows the user to change the encryption config file without restarting
kube-apiserver. The change is detected by polling the file and is done
by using fsnotify watcher. When file is updated it's process to generate
new set of transformers and close the old ones.
Signed-off-by: Nilekh Chaudhari <1626598+nilekhc@users.noreply.github.com>
This change adds a flag --encryption-provider-config-automatic-reload
which will be used to drive automatic reloading of the encryption
config at runtime. While this flag is set to true, or when KMS v2
plugins are used without KMS v1 plugins, the /healthz endpoints
associated with said plugins are collapsed into a single endpoint at
/healthz/kms-providers - in this state, it is not possible to
configure exclusions for specific KMS providers while including the
remaining ones - ex: using /readyz?exclude=kms-provider-1 to exclude
a particular KMS is not possible. This single healthz check handles
checking all configured KMS providers. When reloading is enabled
but no KMS providers are configured, it is a no-op.
k8s.io/apiserver does not support dynamic addition and removal of
healthz checks at runtime. Reloading will instead have a single
static healthz check and swap the underlying implementation at
runtime when a config change occurs.
Signed-off-by: Monis Khan <mok@microsoft.com>
so that it explicitly describe group information defined in the
container image will be kept. This also adds e2e test case of
SupplementalGroups with pre-defined groups in the container
image to make the behaivier clearer.
- New API field .spec.schedulingGates
- Validation and drop disabled fields
- Disallow binding a Pod carrying non-nil schedulingGates
- Disallow creating a Pod with non-nil nodeName and non-nil schedulingGates
- Adds a {type:PodScheduled, reason:WaitingForGates} condition if necessary
- New literal SchedulingGated in the STATUS column of `k get pod`
In the PR https://github.com/kubernetes/kubernetes/pull/86139, two more lifecycle hook tests (poststart / prestop)
were added using HTTPS. They are similar with the existing HTTP tests.
However, this causes failures on Windows due to how networking
works there. We previously fixed this in the HTTP tests via f9e4a015e2.
This commit applies the same fix on the lifecycle hook HTTPS tests.
Signed-off-by: Ionut Balutoiu <ibalutoiu@cloudbasesolutions.com>
The cloud-provider and the e2e test were racing on deleting the
cloud resources.
Also, the cloud-provider should not leave orphan resources, that will
be detected by the job and fail, thus we should not have additional
logic to cleanup masking these errors.
Removed the unit tests that test the cases when the MixedProtocolLBService feature flag was false - the feature flag is locked to true with GA
Added an integration test to test whether the API server accepts an LB Service with different protocols.
Added an e2e test to test whether a service which is exposed by a multi-protocol LB Service is accessible via both ports.
Removed the conditional validation that compared the new and the old Service definitions during an update - the feature flag is locked to true with GA.
The original intention (adding more information for later analysis)
is probably obsolete because there is no code which does anything
with the extended error.
The code in upgrade_suite.go collected it in an in-memory JUnit report, but
then didn't do anything with that field. The code also wouldn't work for
failures detected by Ginkgo itself, like the upcoming timeout handling. If the
upgrade suite needs the information, it probably should get it from Gingko with
a ReportAfterSuite call instead of depending in some fragile interception
mechanism.
Tests scheduler enforcement of the ReadWriteOncePod PVC access mode.
- Creates a pod using a PVC with ReadWriteOncePod
- Creates a second pod using the same PVC
- Observes the second pod fails to schedule because PVC is in-use
- Deletes the first pod
- Observes the second pod successfully schedules
Some of our API types contain fields that get rendered very poorly by
gomega.format.Object because they contain lots of internal information, for
example CreationTimestamp. As a result, dumping full API object typically gets
truncated.
What we want is a representation that is a) multi-line (in contrast to the
stringer implemented by our types) and b) drops empty fields where it
was defined that this is okay.
The normal YAML representation fits that requirement. We just need to teach
gomega how and when to do that. This cannot be done for each type through a
generated GomegaString method (lots of code, additional dependency in public
API on YAML encoder), but it can be done inside tests by adding a formatting
handler (new gomega feature).
Our tooling cannot handle very long failure messages well:
- when unfolding a test in the spyglass UI, it fills the entire screen
- failure correlation for http://go.k8s.io/triage has resource constraints
We cannot enforce that all tests only produce short failure messages and even
if we could, depending on the test failure, including more information may be
useful to understand it.
To achieve both goals (summary for correlation and overview, all details
available when digging deeper), too longer failure messages now get truncated,
with the full message guaranteed to be captured in the test output.
"Too long" is arbitrarily chosen to be similar to the gomega.MaxLength because
that has been a limit for failure message size in the past.
When gomega.format exceeds the default size of 4000, it truncates and prints:
Gomega truncated this representation as it exceeds 'format.MaxLength'.
Consider having the object provide a custom 'GomegaStringer' representation
or adjust the parameters in Gomega's 'format' package.
Learn more here: https://onsi.github.io/gomega/#adjusting-output
These instructions don't help the user of the e2e.test binary unless we provide
a command line flag.
Commit 99e9096034 was only supposed to remove the
FailfWithOffset function, but it also changed the behavior by skipping one
additional stack frame. That makes no sense and is inconsistent with Fail, which
also logs the direct caller.
The Windows Server Core images are quite large (~2GB each), and pulling
it for multiple build jobs / E2E images is inefficient, especially if
have to build for multiple OS versions.
The windows-servercore-cache image is meant to simply cache the Windows files
we need from the Windows Server core images, so we can pull the small cache image
instead of the entire image. It is never meant to be a promotable image,
the version is not meant to be bumped.
The other images (e.g.: agnhost) rely on the version 1.0 images.
In the `should correctly account for terminated pods after restart`, the
test first creates a set of `restartNever` pods, followed by a set of
`restartAlways` pods. Both the `restartNever` and `restartAlways` pods
request an entire CPU. As a result, the `restartAlways` pods will not be
admitted, if the `restartNever` pods did not terminate yet.
Depending on the timing/how fast the pods terminate, the test can pass
sometimes fail which results in flakes. To de-flake the test, the test
should wait until the `restartNever` pods enter a terminal `Succeeded`
phase, before creating the `restartAlways` pods.
To do this, generalize the function `waitForPods` to accept a pod
condition (`testutils.PodRunningReadyOrSucceeded`, or
`testutils.PodSucceeded`). Also introduce a new "Succeeded" pod
condition, so the test can explicitly wait until the pods enter the
Succeeded phase.
Signed-off-by: David Porter <david@porter.me>
Currently, when running node e2e it's not possible to use the ginkgo `--repeat`
flag to run the test suite multiple times. This is useful when debugging tests
and ensuring they are not flaky by re-running them several times. Currently if
using `--repeat` ginkgo flag, the 2nd run of the test will fail due to kubelet
not starting with message like:
```
Failed to start transient service unit: Unit kubelet-20221020T040841.service already exists.
```
This is because during the test startup, kubelet is started as a transient unit
file via `systemd-run`. The unit is started with the `--remain-after-exit` flag
to ensure that the unit will remain even if the kubelet is restarted. The test
suite currently uses `systemd kill` command to stop kubelet. This works fine for
stopping the kubelet, but on the second run, when `systemd-run` is used to start
systemd unit again it will fail because the unit already exists. This is because
`systemd kill` will not delete the systemd unit, only send SIGTERM signal to it.
To fix this, add `unitName` as a field to the `server` struct. When
kubelet server is constructed, set the unit name. As part of e2e test
termination, in `E2EServices.Stop()``, stop the kubelet systemd unit. By
stopping the kubelet systemd unit, systemd will delete the systemd
transient unit, allowing it to be created and started again in a
subsequent e2e run.
Signed-off-by: David Porter <david@porter.me>
The original intention was to address "frustration of end users running the e2e
suite is that they take a significant amount of time and it is difficult to
gauge progress".
But Ginkgo's output is different now than it was in Kubernetes 1.19. If users
want to see progress, then "ginkgo --progress" might provide enough
information.
Printing to os.Stdout doesn't work as intended anyway when output redirection
is enabled (the default for parallel runs) and causes these JSON snippets to
appear as "show stdout" for each failed test in a Prow job, which is
distracting.
Tests should accept a context from Ginkgo and pass it through to all functions
which may block for a longer period of time. In particular all Kubernetes API
calls through client-go should use that context. Then if a timeout occurs,
the test returns immediately because everything that it could block on will
return.
Cleanup code then needs to run in a separate Ginkgo node, typically
DeferCleanup, which ensures that it gets a separate context which has not timed
out yet.
Align the behavior of HTTP-based lifecycle handlers and HTTP-based
probers, converging on the probers implementation. This fixes multiple
deficiencies in the current implementation of lifecycle handlers
surrounding what functionality is available.
The functionality is gated by the features.ConsistentHTTPGetHandlers feature gate.
The device plugin test in https://testgrid.k8s.io/sig-node-release-blocking#node-kubelet-serial-containerd
has been flaky for a while now when it runs on the test infrastructure.
Locally running this test resulted in test passing without issues.
Based on the existing logs, it is not clear why podresource
API endpoint is returning 3 pods rather than the expected
two pods (device plugin pod and the test pod requesting
devices). For more clarity and debugaability on why an
addtional pod seems to be appearing we expose the output
from podresource API endpoint.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
* add superuser fallback to authorizer
* change the order of authorizers
* change the order of authorizers
* remove the duplicate superuser authorizer
* add integration test for superuser permissions
e2e test validates the following 3 endpoints
- patchCoreV1NamespacedResourceQuotaStatus
- readCoreV1NamespacedResourceQuotaStatus
- replaceCoreV1ResourceQuotaForAllNamespacesStatus
This addresses a problem caused by
https://github.com/kubernetes/kubernetes/pull/112043: because the AfterEach
which invokes AllNodesReady always runs, including tests that skipped early,
those tests ran into a nil pointer access. This increased the size of log
files. The tests still worked.
Adds two tests for the enforcement of the ReadWriteOncePod
PersistentVolume access mode.
1. Tests that when two Pods are scheduled that reference the same
ReadWriteOncePod PVC, the latter-scheduled Pod will be marked
unschedulable because the PVC is in-use.
2. Tests that when two Pods are scheduled on the same node (setting
Pod.Spec.NodeName to bypass scheduling for the second Pod), the
latter Pod will fail to start because the PVC is already mounted on
the Node.
Included are changes to update the hostpath CSI driver to accept new CSI
access modes. Its sidecar containers are already at supported versions
for ReadWriteOncePod and don't need updating. The GCP PD CSI driver does
not yet support the new CSI access modes, but its sidecar containers are
at supported versions and so the feature will work.
To support ReadWriteOncePod, the following CSI sidecars must be updated
to these versions or greater:
- csi-provisioner:v3.0.0+
- csi-attacher:v3.3.0+
- csi-resizer:v1.3.0+
For more details, see:
https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/2485-read-write-once-pod-pv-access-mode/README.md
The reason for the issue is that the metrics were bumped before the
final job status update. In case the update failed the path was
repeated by the next syncJob leading to double-counting of the metrics.
The solution is to delay recording metrics and broadcasting events
after the job status update succeeds.
This change updates the API server code to load the encryption
config once at start up instead of multiple times. Previously the
code would set up the storage transformers and the etcd healthz
checks in separate parse steps. This is problematic for KMS v2 key
ID based staleness checks which need to be able to assert that the
API server has a single view into the KMS plugin's current key ID.
Signed-off-by: Monis Khan <mok@microsoft.com>
The change made in https://github.com/kubernetes/kubernetes/pull/112644
resulted in an update to the rejection message. In the memory manager
node e2e test, we still checked against the old expected error message
giving the impression that the pod succeeded to run even though it failed
as expected mainly because the check wasn't performed correctly.
In this patch, we update to the correct rejection message to make sure
that the memory manager is no longer failing.
NOTE: This test is supposed to run on multi NUMA systems and if the
underlying node does not have multi NUMA nodes, the test is skipped
which is what happens in upstream test infrastructure as it is mainly
composed of single NUMA nodes. Because of this, this test failure
wasn't evident via testgrid.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
It turned out to be unreliable (see
https://github.com/kubernetes/kubernetes/issues/112834). Encoding the data
inside the command as input for base64 is a workaround that is fine for small
amounts of data. It becomes less efficient and/or unusable for large amounts.
Mark remotecommand.Executor as deprecated and related modifications.
Handle crash when streamer.stream panics
Add a test to verify if stream is closed after connection being closed
Remove blank line and update waiting time to 1s to avoid test flakes in CI.
Refine the tests of StreamExecutor according to comments.
Remove the comment of context controlling the negotiation progress and misc.
Signed-off-by: arkbriar <arkbriar@gmail.com>
add tests to ensure the podresources metrics are exposed,
and basic sanity tests for their values.
Signed-off-by: Francesco Romani <fromani@redhat.com>
This adds test coverage for NewFrameworkExtensions and shows better how
BeforeEach callbacks are invokved. The unit test is not strictly about just the
cleanup operations anymore, but that's okay(ish).
The "todo" packages were necessary while moving code around to avoid hitting
cyclic dependencies. Now that any sub package can depend on the framework, they
are no longer needed and the code can be moved into the normal sub packages.
This helps getting rid of the ssh dependency. The same init package as for
dumping namespaces takes care of adding the functionality back to framework
instances.
This helps getting rid of the ssh dependency. The same init package as for
dumping namespaces takes care of adding the functionality back to framework
instances.
This reduces the size of the test/e2e/framework itself. Because it does not
gather metrics data anymore by default, E2E test suites must set their
callbacks function or set the original one by importing
"k8s.io/kubernetes/test/e2e/framework/todo/metrics/init".
This reduces the size of the test/e2e/framework itself. Because it does not
check nodes anymore by default, E2E test suites must set their own check
function or set the original one by importing
"k8s.io/kubernetes/test/e2e/framework/todo/node/init".
This reduces the size of the test/e2e/framework itself. Because it does not
dump anything anymore by default, E2E test suites must set their own dump
function or set the original one by importing
"k8s.io/kubernetes/test/e2e/framework/debug/init".
This will be used as mechanism for invoking some of the code which is currently
hard-coded in the framework once that code is placed in optional packages.
During the September 29th, 2022 SIG-Network meeting we decided to
demote the two affinity timeout conformance tests. This was because:
(a) there is no documented correct behavior for these tests other than
"what kube-proxy does"
(b) even the kube-proxy behavior differs depending on the backend implementation
of iptables, IPVS, or [win]userspace (and winkernel doesn't at all)
(c) iptables uses only srcip matching, while userspace and IPVS use srcip+srcport
(d) IPVS and iptables have different minimum timeouts and we had
to hack up the test itself to make IPVS pass
(e) popular 3rd party network plugins also vary in their implementation
Our plan is to deprecate the current affinity options and re-add specific
options for various behaviors so it's clear exactly what plugins support
and which behavior (if any) we want to require for conformance in the future.
Signed-off-by: Dan Williams <dcbw@redhat.com>
e2e test validates the following 3 endpoints
- listCoreV1LimitRangeForAllNamespaces
- patchCoreV1NamespacedLimitRange
- deleteCoreV1CollectionNamespacedLimitRange
* Update Url string to have only one slash
Signed-off-by: Akanksha Kumari <akankshakumari393@gmail.com>
* Trim / from Right in hostname
Signed-off-by: Akanksha Kumari <akankshakumari393@gmail.com>
Considering we have removed the glusterfs driver from the
repo, we no longer need the test server for CI. This commit
remove the same
Ref# https://github.com/kubernetes/kubernetes/pull/112015
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
- Moves kms proto apis to the staging repo
- Updates generate and verify kms proto scripts to check staging repo
Signed-off-by: Anish Ramasekar <anish.ramasekar@gmail.com>
* Add e2e tests for events command
* Run events tests as normal e2e instead conformance
Conformance tests are only for GA features. Since `kubectl events`
currently is in alpha stage, e2e tests for this command should be
run as standard e2e.
The kubectl delete -f commands in stop-kubemark.sh may get stuck if the
cluster setup is removed concurrently. It is not important for these
commands to succeed so a timeout of 5m is set.
Since we have upgraded to snapshot controller version to v6, the
snapshot tests looks to be failing in the testgrid. It has been mainly
because the latest version of snapshot controller stopped serving
v1beta1 APIs. The sidecar image versions in the tests also has to be
updated to make sure these are compatible.
This commit add missing RBAC rules for the controller as per the
latest version.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
When waiting for the default service account in a new namespace, not finding
one was reported as "unexpected error: timed out waiting for the condition"...
etcd only fully supports linux && amd64, the other architectures
and OS are only guaranteed to build, see:
https://etcd.io/docs/v3.5/op-guide/supported-platform/#support-tiers
Skip the test that use etcd on not well supported environment to
guarantee the stability of the test.
Getting a message about "pod ran to completion" is confusing when the pod
hasn't been able to start at all. The failed state now has a different message.
To address the previous ambiguity, the success state is described as "ran to
completion successfully".
When Ginkgo shows a BeforeEach/AfterEach/DeferCleanup, then it can only show
the source code where the callback was registered because there is no
description parameter. This can be improved by passing a custom CodeLocation.
Because a description like "set up framework" might not be enough, the source
code is still shown, too.
This is useful for running a driver on a subset of all ready nodes:
- use e2enode.GetBoundedReadySchedulableNodes with a suitable
maximum number of nodes to determine how much nodes are available
for a test
- define pod anti-affinity in the PodTemplate:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/instance: xxxxxxx
topologyKey: kubernetes.io/hostname
- set the ReplicaSetSpec.Replicas value to the number of nodes
If the control plane emits anything at the time when the test runs, for example
"unable to sync kubernetes service", the test breaks because that additional
output is unexpected.
Pulling the CreateKubeConfig function from the expensive to build
test/utils/apiserver package had a considerable impact on the overall build
time because that package depends on a lot of other packages.
Because only that one function is needed by the framework, that extra build
time can be avoided by moving it into its own package.
The framework.AddCleanupAction API was a workaround for Ginkgo v1 not invoking
AfterEach callbacks after a test failure. Ginkgo v2 not only fixed that, but
also added a DeferCleanup API which can be used to run some code if (and only
if!) the corresponding setup code ran. In several cases that makes the test
cleanup simpler.
This covers multiple facets of the current framework and of Ginkgo:
- Ginkgo output is verbose and includes detailed progress
messages (BeforeEach/AfterEach tracing).
- Namespace creation.
- Order of callback invocation.
This runs etcd and an apiserver using it inside the test process. The caller
can either use the ClientSet or the config file. More options might get added
in the future.
Co-author: Antonio Ojea <antonio.ojea.garcia@gmail.com>
When using By or some other Ginkgo output functions, Ginkgo v2 now adds a time
stamp at the end of the line that we need to ignore. Will become relevant when
testing more complete output.
For cleanup purposes the ginkgo.DeferCleanup is a better replacement for
f.AddAfterEach:
- the cleanup only gets executed when the corresponding setup code ran
and can use the same local variables
- the callback runs after the test and before the framework
deletes namespaces (as before)
- if one callback fails, the others still get executed
For the original purpose (https://github.com/kubernetes/kubernetes/pull/86177 "This is
very useful for custom gathering scripts.") it is now possible to use
ginkgo.AfterEach because it will always get executed. Just beware that its
callbacks run in first-in-first-out order.
In contrast to ginkgo.AfterEach, ginkgo.DeferCleanup runs the callback in
first-in-last-out order. Using it makes the following test code work as
expected:
f := framework.NewDefaultFramework("some test")
ginkgo.AfterEach(func() {
// do something with f.ClientSet
})
Previously, f.ClientSet was already set to nil by the framework's cleanup code.
After updating gRPC in node-driver-registrar from v1.40.0 to v1.47.0 the
behavior of gRPC change in a way such that it no longer detected the
single-sided closing of the stream as a loss of connection. This caused gRPC in
the e2e.test to get stuck, possibly in a Read or Write for the HTTP stream
because those have neither a context nor a timeout.
Changing the connection handling so that all active connections are tracking in
the listener and closing them when the listener gets closed fixed this problem.
Some scripts and tools still relied on the deprecated flags, the ones
which are about to be removed.
This is intentionally not a complete removal of all those flags in the entire
repo. This would lead to much more code churn also in places where commands
still accept the flags because they use klog directly.
The custom progress reporter gets invoked via ginkgo.ReportAfterEach after each
test. The problem was that the e2e framework unconditionally enables Ginkgo's
-progress output which shows execution of all nodes, including this
ReportAfterEach. The effect were over 1000 lines of useless output at the start
of a test run while skipping disabled tests.
The solution is to tell Ginkgo that the ReportAfterEach isn't meant to be
reported.
This change updates TestAggregatedAPIServer and the related test
server wiring to exercise the full network path between the Kube API
server and the aggregated API server. We now assert that the wardle
API service and Kube API server discovery endpoints are fully healthy.
CRUD operations are performed through the Kube API server to the
wardle API server.
Signed-off-by: Monis Khan <mok@microsoft.com>
Contextual logging cannot be enabled manually because there is no feature gate
flag. Enabling the feature unconditionally:
- should be low risk for E2E testing
- if it fails, we want to know
- is useful to get better log output from code which already supports it
We don't want klog to print to anything other than GinkgoWriter, but it still
used os.Stderr in addition to GinkgoWriter when printing log entries with
severity >= error. Changing "stderrthreshold" fixes that.
The unit test for framework output handling didn't test klog behavior. Now it
does:
- os.Stderr is redirected, should be empty
- a new test invokes klog
The WaitFor* refactoring in 07c34eb400 had an oversight what timeout parameter
is used for calling WaitForAllPodsCondition() in WaitForPodsWithLabelRunningReady()
so the calls to WaitForPodsWithLabelRunningReady() ended up ignoring the user
provided timeout. Fix that.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Followup on https://github.com/kubernetes/kubernetes/pull/111846. This
particular test was left out from that PR because once it was enabled it
started failing. It was desired to merge
https://github.com/kubernetes/kubernetes/pull/111846 irrespective of
this particular test.
The failure in the test was caused due to the
`createFSGroupRequestPreHook` mock CSI driver hook function assuming
that the request object passed to it is an instance of the respective
struct, but it's actually a pointer instead. This resulted in the hook
function not fulfilling its purpose, and the so the test failed.
Fixes instances of #98213 (to ultimately complete #98213 linting is
required).
This commit fixes a few instances of a common mistake done when writing
parallel subtests or Ginkgo tests (basically any test in which the test
closure is dynamically created in a loop and the loop doesn't wait for
the test closure to complete).
I'm developing a very specific linter that detects this king of mistake
and these are the only violations of it it found in this repo (it's not
airtight so there may be more).
In the case of Ginkgo tests, without this fix, only the last entry in
the loop iteratee is actually tested. In the case of Parallel tests I
think it's the same problem but maybe a bit different, iiuc it depends
on the execution speed.
Waiting for the CI to confirm the tests are still passing, even after
this fix - since it's likely it's the first time those test cases are
executed - they may be buggy or testing code that is buggy.
Another instance of this is in `test/e2e/storage/csi_mock_volume.go` and
is still failing so it has been left out of this commit and will be
addressed in a separate one
The main purpose of this change is to update the e2e Netpol tests to use
the srandard CreateNamespace function from the Framework. Before this
change, a custom Namespace creation function was used, with the
following consequences:
* Pod security admission settings had to be enforced locally (not using
the centralized mechanism)
* the custom function was brittle, not waiting for default Namespace
ServiceAccount creation, causing tests to fail in some infrastructures
* tests were not benefiting from standard framework capabilities:
Namespace name generation, automatic Namespace deletion, etc.
As part of this change, we also do the following:
* clearly decouple responsibilities between the Model, which defines the
K8s objects to be created, and the KubeManager, which has access to
runtime information (actual Namespace names after their creation by
the framework, Service IPs, etc.)
* simplify / clean-up tests and remove as much unneeded logic / funtions
as possible for easier long-term maintenance
* remove the useFixedNamespaces compile-time constant switch, which
aimed at re-using existing K8s resources across test cases. The
reasons: a) it is currently broken as setting it to true causes most
tests to panic on the master branch, b) it is not a good idea to have
some switch like this which changes the behavior of the tests and is
never exercised in CI, c) it cannot possibly work as different test
cases have different Model requirements (e.g., the protocols list can
differ) and hence different K8s resource requirements.
For #108298
Signed-off-by: Antonin Bas <abas@vmware.com>
Introduce networking/v1alpha1 api group.
Add `ClusterCIDR` type to networking/v1alpha1 api group, this type
will enable the NodeIPAM controller to support multiple ClusterCIDRs.
Updates predicate to check for a length >=2 to avoid
the index out of bounds panic.
Signed-off-by: Edwin Xie <exie@vmware.com>
Co-authored-by: Tyler Schultz <tschultz@vmware.com>
- add feature gate
- add encrypted object and run generated_files
- generate protobuf for encrypted object and add unit tests
- move parse endpoint to util and refactor
- refactor interface and remove unused interceptor
- add protobuf generate to update-generated-kms.sh
- add integration tests
- add defaulting for apiVersion in kmsConfiguration
- handle v1/v2 and default in encryption config parsing
- move metrics to own pkg and reuse for v2
- use Marshal and Unmarshal instead of serializer
- add context for all service methods
- check version and keyid for healthz
Signed-off-by: Anish Ramasekar <anish.ramasekar@gmail.com>
Including the full information for successful tests makes the resulting XML
file too large for the 200GB limit in Spyglass when running large jobs (like
scale testing).
The original solution from https://github.com/kubernetes/kubernetes/pull/111627
broke JUnit reporting in other test suites, in particular
test/e2e_node. Keeping the code inside the framework ensures that all test
suites continue to have the JUnit reporting.
AfterReadingAllFlags is a good place to set this up because all test suites
using the test context are expected to call it before running tests and after
parsing flags.
Removing the ReportEntries added by ginkgo.By from all test reports usually
avoids the `system-err` part in the JUnit file, which in Spyglass avoids
the extra "open stdout" button.
Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
Co-authored-by: Dave Chen <dave.chen@arm.com>
It is used to request that a pod runs in a unique user namespace.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Co-authored-by: Rodrigo Campos <rodrigoca@microsoft.com>
Including the full information for successful tests makes the resulting XML
file too large for the 200GB limit in Spyglass when running large jobs (like
scale testing).
Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
Co-authored-by: Dave Chen <dave.chen@arm.com>
- PreemptionByKubeScheduler (Pod preempted by kube-scheduler)
- DeletionByTaintManager (Pod deleted by taint manager due to NoExecute taint)
- EvictionByEvictionAPI (Pod evicted by Eviction API)
- DeletionByPodGC (an orphaned Pod deleted by PodGC)PreemptedByScheduler (Pod preempted by kube-scheduler)
We now partly drop the support for seccomp annotations which is planned
for v1.25 as part of the KEP:
https://github.com/kubernetes/enhancements/issues/135
Pod security policies are not touched by this change and therefore we
have to keep the annotation key constants.
This means we only allow the usage of the annotations for backwards
compatibility reasons while the synchronization of the field to
annotation is no longer supported. Using the annotations for static pods
is also not supported any more.
Making the annotations fully non-functional will be deferred to a
future release.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
this is specifically so that we have more structured information when
the metric is parsed and stored as a stable metric. This change does not
change the name of the actual metrics.
Change-Id: I861482401ad9a0ae12306b93abf91d6f76d7a407
In order to promote kubectl alpha events to beta,
it should at least support flags which is already
supported by kubectl get events as well as new flags.
This PR adds;
--output: json|yaml support and does essential refactorings to
integrate other printing options easier in the future.
--no-headers: kubectl get events can hide headers when this flag is set for default printing.
Adds this ability to hide headers also for kubectl alpha events.
This flag has no effect when output is json or yaml for both commands.
--types: This will be used to filter certain events to be printed and
discard others(default behavior is same with --event=Normal,Warning).
Also add entrypoints for verifying and updating a test file for easier
debugging. This is considerably faster than running the stablity checks
against the entire Kubernetes codebase.
Change-Id: I5d5e5b3abf396ebf1317a44130f20771a09afb7f
- Run hack/update-codegen.sh
- Run hack/update-generated-device-plugin.sh
- Run hack/update-generated-protobuf.sh
- Run hack/update-generated-runtime.sh
- Run hack/update-generated-swagger-docs.sh
- Run hack/update-openapi-spec.sh
- Run hack/update-gofmt.sh
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
The "pause" pods that are being run in the scheduling tests are
sometimes launched in system namespaces. Therefore even if a test
is considered to be running on a "baseline" Pod Security admission
level, its "baseline" pods would fail to run if the global PSa
enforcement policy is set to "restricted" - the system namespaces
have no PSa labels.
The "pause" pods run by this test can actually easily run with
"restricted" security context, and so this patch turns them
into just that.
The test breaks the controllers that depend on api services to be resolvable,
per example, the namespace controller, that is heavily used by the e2e framework to clean the environment
NPD can run in "standalone" or "daemonset" mode in cluster GCE tests.
For standalone mode, NPD runs as a systemd unit, daemonset mode it runs
as pod.
If NPD is running in standalone mode, as soon as we detect that the npd
service does not exist with `systemctl status`, we should skip the rest
of the checks such as looking at the journalctl logs or pid.
Signed-off-by: David Porter <david@porter.me>
Fix rollout history bug where the latest revision was
always shown when requesting a specific revision and
specifying an output.
Add unit and integration tests for rollout history.
As what suggested by Ginkgo migration guide, `Measure` node was
deprecated and replaced with `It` node which creates `gmeasure.Experiment`.
Signed-off-by: Dave Chen <dave.chen@arm.com>
Some operations with Azure Disk volumes can be arbitrarily slow
sometimes. This causes CI flakes that are hard to debug.
Bumping the timeouts has sown to improve or eliminate this issue.
Ginkgo is now writing the JUnit file itself. The -report-dir parameter is used
as fallback for enabling JUnit output in case that users haven't migrated to
the new -junit-report parameter.
Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
Signed-off-by: Dave Chen <dave.chen@arm.com>
Besides, the using of method might lead to a `concurrent map writes`
issue per the discussion here: https://github.com/onsi/ginkgo/issues/970
Signed-off-by: Dave Chen <dave.chen@arm.com>
Default timeout setting has been reduced from `24h` down to `1h` in
Ginkgo V2, but for some long running test this is too short.
How long to abort the test was controlled by the the linux command `timeout`
in V1. e.g. `'timeout -k 30s 150m ...`, and is configured in the file
like `sig-network-misc.yaml`.
Set the timeout manually for Ginkgo V2 to avoid the early aborting.
Signed-off-by: Dave Chen <dave.chen@arm.com>
`FullStackTrace` is not available in v2 if no exception found
with test execution.
The change is needed for conformance test's spec validation.
pls see: https://github.com/onsi/ginkgo/issues/960 for details.
Signed-off-by: Dave Chen <dave.chen@arm.com>
The change is needed to verify the conformance test spec, as this
is verified in `verify-conformance-yaml.sh`.
Signed-off-by: Dave Chen <dave.chen@arm.com>
Full stack traces are on by default. The approach for collecting results is
different. Tests run in their own goroutine, therefore runTests is no longer
part of their callstack. To cover stack traces with more than one entry, a new
test case gets added with a separate helper function.
Gomega object formatting now includes the type.
This removes the last remaining reference to Ginkgo v1.
Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
Signed-off-by: Dave Chen <dave.chen@arm.com>
- update all the import statements
- run hack/pin-dependency.sh to change pinned dependency versions
- run hack/update-vendor.sh to update go.mod files and the vendor directory
- update the method signatures for custom reporters
Signed-off-by: Dave Chen <dave.chen@arm.com>
Seemingly on slow connections if the response to /configz request was
chunked the kubectl proxy was terminated before the response body was
received and read, causing unexpected EOF errors. This patch changes the
configz polling code so that the whole response body is read before
closing the proxy connection.
The test validates the following endpoints
- createAppsV1NamespacedControllerRevision
- deleteAppsV1CollectionNamespacedControllerRevision
- deleteAppsV1NamespacedControllerRevision
- listAppsV1ControllerRevisionForAllNamespaces
- patchAppsV1NamespacedControllerRevision
- readAppsV1NamespacedControllerRevision
- replaceAppsV1NamespacedControllerRevision
As outlined in the KEP, we now graduate the Kubelet feature to beta
which means that it is enabled by default. The corresponding Kubelet
flag still defaults to `false`, but we now have the chance to e2e test
the feature by using a new serial test case.
KEP: https://github.com/kubernetes/enhancements/issues/2413
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
e2e test validates the following 3 extra endpoints
- deleteCoreV1CollectionNamespacedResourceQuota
- listCoreV1ResourceQuotaForAllNamespaces
- patchCoreV1NamespacedResourceQuota
As described in 8c76845b03 ("test/e2e/network: fix a bug in the hostport e2e
test") if we have two pods with the same hostPort, hostIP, but different
protocols, a CNI may be buggy and decide to forward all traffic only to one of
these pods. Add a check that we receiving requests from different pods.
Co-authored-by: Antonio Ojea <antonio.ojea.garcia@gmail.com>
Making the LoggingConfiguration part of the versioned component-base/config API
had the theoretic advantage that components could have offered different
configuration APIs with experimental features limited to alpha versions (for
example, sanitization offered only in a v1alpha1.KubeletConfiguration). Some
components could have decided to only use stable logging options.
In practice, this wasn't done. Furthermore, we don't want different components
to make different choices regarding which logging features they offer to
users. It should always be the same everywhere, for the sake of consistency.
This can be achieved with a saner Go API by dropping the distinction between
internal and external LoggingConfiguration types. Different stability levels of
indidividual fields have to be covered by documentation (done) and potentially
feature gates (not currently done).
Advantages:
- everything related to logging is under component-base/logs;
previously this was scattered across different packages and
different files under "logs" (why some code was in logs/config.go
vs. logs/options.go vs. logs/logs.go always confused me again
and again when coming back to the code):
- long-term config and command line API are clearly separated
into the "api" package underneath that
- logs/logs.go itself only deals with legacy global flags and
logging configuration
- removal of separate Go APIs like logs.BindLoggingFlags and
logs.Options
- LogRegistry becomes an implementation detail, with less code
and less exported functionality (only registration needs to
be exported, querying is internal)
The hostport e2e test (sonobuoy run --e2e-focus 'validates that there is no
conflict between pods with same hostPort but different hostIP and protocol')
checks, in particular, that two pods with the same hostPort, the same hostIP,
but different L4 protocols can coexist on one node.
In order to do this, the test creates two pods with the same hostIP:hostPort,
one TCP-based, another UDP-based. However, both pods listen on both protocols:
netexec --http-port=8080 --udp-port=8080
This can happen that a CNI which doesn't distinguish between TCP and UDP
hostPorts forwards all traffic, TCP or UDP, to the same pod. As this pod
listens on both protocols it will reply to both requests, and the test
will think that everything works properly while the second pod is indeed
disconnected. Fix this by executing different commands in different pods:
TCP: netexec --http-port=8080 --udp-port=-1
UDP: netexec --http-port=8008 --udp-port=8080
The TCP pod now doesn't listen on UDP, and the UDP pod doesn't listen on TCP on
the target hostPort. The UDP pod still needs to listen on TCP on another port
so that a pod readiness check can be made.
Use a watch to detect invalid pod status updates in graceful node
shutdown node e2e test. By using a watch, all pod updates will be
captured while the previous logic required polling the api-server which
could miss some intermediate updates.
Signed-off-by: David Porter <david@porter.me>
This is useful in environments where the Deployment image is replaced
by another image from an internal registry (via fixture). In that case,
the populator running in populate mode should use the same image as the
populator running in controller mode.
The test is about SCTP and the accessed service only forwarded SCTP
traffic to the server Pod but the client Pod used TCP protocol, so the
test traffic never reached the server Pod and the test NetworkPolicy
was never enforced, which lead to test success even if the default-deny
policy was implemented wrongly. In some cases it may got failure result
if there was an external server having same IP as the cluster IP and
listening to TCP 80 port.
Signed-off-by: Quan Tian <qtian@vmware.com>
The advantage is that the extra error information is guaranteed to be printed
directly before the failure and we avoid one extra log line that would have to
be correlated with the failure.
The failure message from Gomega was hard to read because explanation and error
text were separated by the error dump. In many cases, that error dump doesn't
add any relevant information.
Now the error is dumped first as info message (just in case that it is
relevant) and then a shorter failure message is created from explanation and
error text.
The test validates the following endpoints
- deleteApiregistrationV1CollectionAPIService
- patchApiregistrationV1APIServiceStatus
- replaceApiregistrationV1APIService
- replaceApiregistrationV1APIServiceStatus
- don't require clusteconfiguration.kubernetesVersion
- use the helper functions GetConfigMap, ExpectRole, ExpectRoleBinding
(these were used before UnversionedKubeletConfig went Beta)
The updated klog provides a reusable test suite for output handling.
Using it increases our test coverage without having to copy the test cases from
there into some JSON specific test suite.
The tests in nodes_test.go check if the Node objects
in a kubeadm cluster are annotated with a CRI socket
path. It is used by kubeadm to store a CRI socket per node.
Add a new test condition to verify if the CRI socket path
is prefixed with URL scheme "unix://".
Remove namespace from manifest to fix the error: the namespace from
the provided object "my-ns" does not match the namespace
"kubectl-8939". You must pass '--namespace=my-ns' to perform this
operation.
Since ClientCAs are provided by "client-ca::kube-system::extension-apiserver-authentication::client-ca-file" controller
we need to wait until it picks up the configmap (via a lister) before checking the CAs otherwise the response might contain an empty result.
the job controller used by the tests must wait for the caches to sync
since the tests don't check /readyz there is no way
the tests can tell it is safe to call the server and requests won't be rejected
- iniconfiguration.go: stop applying the "master" taint
for new clusters; update related unit tests in _test.go
- apply.go: Remove logic related to cleanup of the "master" label
during upgrade
- apply.go: Add cleanup of the "master" taint on CP nodes
during upgrade
- controlplane_nodes_test.go: remove test for old "master" taint
on nodes (this needs backport to 1.24, because we have a kubeadm
1.25 vs kubernetes test suite 1.24 e2e test)
Previously, the e2e test was overriding the plugins socket directory to
"/var/lib/kubelet/plugins_registry". This seems wrong, and with that
setting the e2e test was already failing, because the registration
process was timing out, in turn because the kubelet was trying to call
back the device plugin in the wrong place (see below for details).
I can't explain why it worked before - or it if worked at all - but
it really seems that `pluginapi.DevicePluginPath` is the right
setting here.
+++
In a nutshell, the device plugin registration process works like this:
1. The kubelet runs and creates the device plugin socket registration
endpoint:
KubeletSocket = DevicePluginPath + "kubelet.sock"
DevicePluginPath = "/var/lib/kubelet/device-plugins/"
2. Each device plugin will listen to an ENDPOINT the kubelet will connect
backk to. IOW the kubelet will act like a client to each device plugin,
to perform allocation requests (and more)
Each device plugin will serve from a endpoint.
The endpoint name is plugin-specific, but they all must be inside a
well-known directory: pluginapi.DevicePluginPath
3. The kubelet creates the device plugin pod, like any other pod
4. During the startup, each device plugin wants to register itself in the
kubelet. So it sends a request through
the registration endpoint. Key details:
grpc.Dial(kubelet registration socket)
registration request
reqt := &pluginapi.RegisterRequest{
Version: pluginapi.Version,
Endpoint: endpointSocket, <- socket relative to pluginapi.DevicePluginPath
ResourceName: resourceName, <- resource name to be exposed
}
5. While handling the registration request, kubelet dial back the
device plugin on socketDir + req.Endpoint.
But socketDir is hardcoded in the device manager code to
pluginapi.KubeletSocket
Signed-off-by: Francesco Romani <fromani@redhat.com>
In the AfterEach check of the e2e node device plugin tests,
the tests want really bad to clean up after themselves:
- delete the sample device plugin
- restart again the kubelet
- ensure that after the restart, no stale sample devices
(provided by the sample device plugin) are reported anymore.
We observed that in the AfterEach block of these e2e tests
we have quite reliably a flip/flop of the kubelet readiness
state, possibly related to a race with/ a slow runtime/PLEG check.
What happens is that the kubelet readiness state is true,
but goes false for a quick interval and then goes true again
and it's pretty stable after that (observed adding more logs
to the check loop).
The key factor here is the function `getLocalNode` aborts the
test (as in `framework.ExpectNoError`) if the node state is
not ready. So any occurrence of this scenario, even if it
is transient, will cause a test failure. I believe this will
make the e2e test unnecessarily fragile without making it more
correct.
For the purpose of the test we can tolerate this kind of glitches,
with kubelet flip/flopping the ready state, granted that we meet
eventually the final desired condition on which the node reports
ready AND reports no sample devices present - which was the condition
the code was trying to check.
So, we add a variant of `getLocalNode`, which just fetches the
node object the e2e_node framework created, alongside to a flag
reporting the node readiness. The new helper does not make
implicitly the test abort if the node is not ready, just bubbles
up this information.
Signed-off-by: Francesco Romani <fromani@redhat.com>