Node E2E tests do not run a scheduler, so the host exec pod must have
the `spec.nodeName` set explicitly.
Signed-off-by: David Porter <david@porter.me>
The ClusterIP allocator tries to reserve on part of the ServiceCIDR
to allocate static IPs to the Services.
The heuristic of the allocator to obtain the offset was taking into
account the whole range size, not the IPs available in the range, the
subnet address and the broadcast address for IPv4 are not available.
This caused that for CIDRs with 16 hosts, /28 for IPv4 and /124 for
IPv6, the offset calculated was higher than the max number of available
addresses on the allocator, causing this to panic.
Change-Id: I6c6f527b0a600b3612be37769e405b8fb3dd33a8
There are two runtime class tests which required the container runtime
config to include explicit configuration for `test-handler`. The current
logic skips these tests in non GCE environments. This skip is too strict
since the test is skipped in node e2e environments and in other
environments such as kind, which support running the test and also
configure `test-handler`.
Instead of skipping based on provider, add a new function
`NodeSupportsPreconfiguredRuntimeClassHandler` which examines the
underlying container runtime config and checks if the config includes
`test-handler`. The check is a bit brittle since it assumes container
runtime config paths, but it is a net improvement over skipping the test
entirely on non GCE environments.
This results in the test working in the common test environments, namely
GCE kube-up, node e2e, and kind.
Signed-off-by: David Porter <david@porter.me>
When the e2e_node/checkpoint_container.go test was introduced no CRI
implementation supported the new CheckpointContainer RPC yet.
With the release of CRI-O 1.25 the CheckpointContainer is implemented
and the test has been extended to see if the content of the checkpoint
is as expected.
The test is skipped if the ContainerCheckpoint feature gate is disabled
or if the CRI implementation does not support the CheckpointContainer
RPC.
Signed-off-by: Adrian Reber <areber@redhat.com>
we are saving this information in an env variable `KUBE_INTEGRATION_ETCD_URL`
So just pick it up from there when needed. Currently when someone uses
framework.RunCustomEtcd directly, the global variable is *not* set and the
code that uses `GetEtcdURL` returns empty string.
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
Since we need to gather kubelet metrics for CPU Manager and Topology
Manager, renaming this function to a more generic name.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
The GlogSetter method is used by three components to change verbosity at
runtime through HTTP APIs. This used to work only for text output with klog
calls, but not for text output through the klog logger or for JSON output.
Now loggers can also provide a callback for changing their verbosity at
runtime. Implementing that implies that the Create factory method has to be
extended, which is an API break for the Go package, but not an API break for
the configuration file and command line flags, which is what matters for the
"api/v1" component API.
The condition methods will eventually all take a context. Since we
have been provided one, alter the accepted condition type and
change the four references in tree.
Collers of ExponentialBackoffWithContext should use a condition
aware function (ConditionWithContextFunc). If the context can be
ignored the helper ConditionFunc.WithContext can be used to convert
an existing function to the new type.
Using WaitTimeoutForPodRunningInNamespace followed by ExpectError was not very
precise (any error passed the check, not just the expected timeout) and
hard to read. Now the test's expectation is spelled out explicitly: the pod
must stay in pending.
These helper functions can be used in combination with
omega.Eventually/Consistently to implement polling of objects that is aware of
Kubernetes apiserver conventions:
- retry on certain errors instead of giving up,
with "not found" handling decided by the caller (may or may not
be fatal, depending on the test)
- sleep if requested by apiserver
The E2E framework contains several functions which only differ in how they get
name and namespace: from an API object (WaitForPodRunningInNamespace) or as
separate parameters (WaitTimeoutForPodRunningInNamespace).
NamespacedName and the NamedObject interface enable writing helper functions
that can be called with both an API object (like *v1.Pod, which implements
Object and thus NamedObject) and name+namespace string (via
NamespacedName).
The other advantage of NamespacedName is that the order of name and namespace
parameter cannot be mixed up.
NamespacedName was derived from k8s.io/apimachinery/pkg/types.NamespacedName. A
separate type in the framework package was chosen a) to avoid additional
imports in test code and b) because the interface might not be suitable for
k8s.io/apimachinery.
* Improving the output of tests in case of error
* Better error message
Also, the condition in the second case was reversed
* Fixing 2 tests whose condition was inverted
* Again I got the conditions wrong
* Sorry for the confusion
* Improved error messages on failures
This significantly reduces the surface area of the fieldmanager package
by hiding all the private "managers" objects, as well as the interface
that was made specifically for these. There is no reason to configure
these.
Primarily this protects against accidentally polling with the default interval
of 10ms. Setting these defaults may also make some tests simpler because they
don't need to override the defaults.
Various different tests all have their own poll intervals. As a start towards
consolidating that, the interval from test/e2e/framework/pod (as one of the
most common cases for polling) is moved into the framework.
Changing other helper packages and tests needs to follow.
This consolidates timeout handling. In the future, configuration of all
timeouts via a configuration file might get added. For now, the same three
legacy command line flags for the timeouts that get moved continue to be
supported.
All usage of builder pattern is convertible to cpuset.New()
with the same or fewer lines of code.
Migrate Builder.Add to a private method of CPUSet, with a comment
that it is only intended for internal use to preserve immutable
propoerty of the exported interface.
This also removes 'require' library dependency, which avoids
non-standard library usage.
In 'set', conversions to slice are done also, but with different names:
ToSliceNoSort() -> UnsortedList()
ToSlice() -> List()
Reimplement List() in terms of UnsortedList to save some duplication.
Removes exit/fatal from cpuset library.
Usage in podresources test was not necessary.
Library reference in cpu_manager_test was moved to a local function, and
converted to use e2e test framework error catching.
Before, in RunPostFilterPlugins, we didn't distinguish between unschedulable and unresolvable
because we only have one postFilterPlugin by default, now, we have at least two, we should
make sure that once a postFilterPlugin returns unresolvable, we'll return directly
Signed-off-by: Kante Yin <kerthcet@gmail.com>
If we were to add new fields in TimeoutContext, the current users of
NewFrameworkWithCustomTimeouts might run into failures unless they get modified
to also set those new fields. This is error-prone.
A better approach is to let users of NewFrameworkWithCustomTimeouts override
fields by setting just those and use the normal defaults for the others.
Ginkgo relies on all workers defining all tests in exactly the same order. This
wasn't guaranteed for these tests, with the result that some tests might have
been executed more than once and others not at all when running in parallel.
This was noticed when some of these tests started to flake and then were
reported both as failure and success, as if they had been retried.
In v1.26.0, these tests only used the timeout context while waiting for a CSI
call. This restores that behavior, just in case that it is relevant. No test
flakes are known because of this.
The intend of timeout handling (for the entire "It" and not just a few calls)
becomes more obvious and simpler when using ginkgo.NodeTimeout as decorator.
It doesn't make sense for the E2E framework to have command line options that
don't do anything because then all test suites built with the framework inherit
those options.
For -list-images and -list-conformance-tests the solution is to move the
implementation into the framework (-list-images) respectively move the flag
into test/e2e (-list-conformance-tests).
The placement was decided based on the observation that image patching is
common functionality while conformance testing is specific to one test suite.
The "[sig-network] DNS HostNetwork should resolve DNS of partial qualified
names for services on hostNetwork pods with dnsPolicy:
ClusterFirstWithHostNet" test assumes that a service named "kube-dns"
exists in the "kube-system" namespace. This assumption is valid if the
cluster was configured using kubeadm, but the assumption may be invalid
otherwise.
As the test uses dnsPolicy: ClusterFirst (as opposed to dnsPolicy: None),
it does not need to specify the name server in dnsConfig. Omitting
dnsConfig.nameservers obviates the need to look up the service.
Follow-up to commit add4652352.
* test/e2e/network/dns.go: Don't look up or use the kube-dns cluster IP
address as it might not exist on clusters that were not configured using
kubeadm.
Bring back the number of test spec which was dropped earlier.
It's now available in the reporting node of `ReportBeforeSuite` by extracting
the number from report.PreRunStats.SpecsThatWillBeRun.
Signed-off-by: Dave Chen <dave.chen@arm.com>
The old tests were no longer passing with Ginkgo v2.5.0. Instead of keeping the
old approach of checking recorded spec results, now the tests actually cover
what we care about most: the results recorded in JUnit.
This also gets rid of having to repeat the stack backtrace twice (once as part
of the output, once for the separate backtrace field).
All information that we want will be written into the failure XML element's
data. We don't need the message tag and don't want it because our
tools (kettle, testgrid, spyglass) would then just concatenate the two strings.
This gets implemented for us by Ginkgo. However, truncating the failure message
is not supported there at the moment. It's unclear how important that is,
therefore this (recently added feature) gets removed.
The NodePort functionality can be tested within the cluster.
Testing from outside the cluster assumes that there is connectivity
between the e2e.test binary and the cluster under test, that is not
always true, and in some cases is exposed to external factors or
misconfigurations like wrong routes or firewall rules that impact
on the test.
Change-Id: Ie2fc8929723e80273c0933dbaeb6a42729c819d0
* Wire generic context to better handle timeout
* Add integration test for wait timeout
* kubectl wait: Fix integration test always passing issue
Currently, `kubectl wait` integration test always passes even if
it gets an error. Problem is object check is done after errexit is
turned off.
This PR redirects error to output and correctly assures that
object is expected status and if it is not, test should fail.
The background goroutine was started with the context from ginkgo.BeforeEach,
which then led to "context canceled" errors. While at it, the entire goroutine
start/stop gets moved into the BeforeEach and simplified.
These tests create their own pods, however, they were inside a block
that was always creating a pod, that was not used later on the tests,
but can influence on the conditions asserted to succeed the test.
Change-Id: I3bb9a0f123fb0766d75934ef8e197f92e3f5f3b8
Using the ctx of the ginkgo.BeforeEach in callbacks that are invoked after the
BeforeEach is done causes "context canceled" errors. Previously, this code used
context.TODO(). The best solution is to create a new context and cancel it
during test cleanup, then that context can be used for the API calls and as
stop channel.
After adding error checking in df5d84ae81, the "[sig-cli] Kubectl client
Simple pod should return command exit codes [Slow] running a failing command
without --restart=Never, but with --rm" test was found to time out.
Doubling the timeout might help. Alternatively, the entire
WaitForPodToDisappear could get removed, which would make this scenario similar
to the others which also don't wait.
the test uses a BeforeEach block to create a pod with the name defined
but the simplePodName variables, however, some test use the value of
the variable directly.
To avoid future problems use the variable name instead of the value on
all tests.
Change-Id: I21a01019d91fe5ae7e35566184420001978ce355
All code must use the context from Ginkgo when doing API calls or polling for a
change, otherwise the code would not return immediately when the test gets
aborted.
* Add tracker types and tests
* Modify ResourceEventHandler interface's OnAdd member
* Add additional ResourceEventHandlerDetailedFuncs struct
* Fix SharedInformer to let users track HasSynced for their handlers
* Fix in-tree controllers which weren't computing HasSynced correctly
* Deprecate the cache.Pop function
The Feature:SCTPConnectivity tests cannot run at the same time as the
"X doesn't cause sctp.ko to be loaded" tests, since they may cause
sctp.ko to be loaded. We had dealt with this in the past by marking
them [Disruptive], but this isn't really fair; the problem is more
with the sctp.ko-checking tests than it is with the SCTPConnectivity
tests. So make them not [Disruptive] and instead make the
sctp.ko-checking tests be [Serial].
There were two SCTP tests grouped together in
test/e2e/network/service.go, but one of them wasn't a service test...
so move the SCTP service test to be grouped with the other service
tests, and the SCTP hostport tests to be grouped with other
non-service tests.
The SCTP HostPort test was checking that creating a pod with an SCTP
HostPort would create a certain iptables rule, but the handling of
HostPorts is now up to CRI, not kubelet, so kubernetes e2e cannot
assume it will implement the feature in any specific way.
(The test still ensures that (a) the apiserver accepts SCTP HostPorts,
and (b) neither kubelet nor the runtime causes the SCTP kernel module
to be loaded as part of creating a pod with an SCTP HostPort.)
We had a test that creating a Service with an SCTP port would create
an iptables rule with "-p sctp" in it, which let us test that
kube-proxy was doing vaguely the right thing with SCTP even if the e2e
environment didn't have SCTP support. But this would really make much
more sense as a unit test.
Windows ComputerNames cannot exceed 15 characters. This causes a few tests to fail
when the node names exceed that limit. Additionally, the checks should be case
insensitive.
The Gingo v2 time suffix is hh:mm:ss without the .xyz sub-second details if the
time stamp happens to land exactly on a second.
This change fixes test flakes like the following:
-STEP: Building a namespace api object, basename test-namespace
+STEP: Building a namespace api object, basename test-namespace 12/13/22 11:43:53
--- FAIL: TestCleanup (36.79s)
ginkgo.DeferCleanup has multiple advantages:
- The cleanup operation can get registered if and only if needed.
- No need to return a cleanup function that the caller must invoke.
- Automatically determines whether a context is needed, which will
simplify the introduction of context parameters.
- Ginkgo's timeline shows when it executes the cleanup operation.
- use `ginkgo.DeferCleanup` instead of clean up in the AfterEach block
- encourage use of ginkgo by not extending expect.go
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Add a test case with a DaemonSet behind a simple load balancer whose
address is being constantly hit via HTTP requests.
The test passes if there are no errors when doing HTTP requests to the
load balancer address, during DaemonSet `RollingUpdate` operations.
Signed-off-by: Ionut Balutoiu <ibalutoiu@cloudbasesolutions.com>
Some node e2e tests check for expected number of pods running
on the node to verify the correct state of that node after running
test scenarios. An example of such a check is in the device plugin
end to end test here: [1].
If the node is not left in a clean state after an e2e test finishes
running, it can lead to flaky tests because the node might have
unexpected pods running on the node.
In order to avoid that, we make sure that the test pods are
cleaned up after the test runs.
[1]: https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/device_plugin_test.go#L189-L190
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
wait.Until catches panics and logs them, which leads to confusing
output. Besides, the test is written so that failures must get reported to the
main goroutine.
Looking up the expected nodes in the goroutine raced with the test making
changes to the configuration. When doing (unrelated?) changes, the test started
to fail:
Oct 23 15:47:03.092: INFO: Unexpected error:
<*errors.errorString | 0xc001154c70>: {
s: "no subset of available IP address found for the endpoint test-rolling-update-with-lb within timeout 2m0s",
}
Oct 23 15:47:03.092: FAIL: no subset of available IP address found for the endpoint test-rolling-update-with-lb within timeout 2m0s
Now that everything is connected to a per-test context, the gRPC server might
encounter an error before it gets shut down normally. We must not panic in that
case because it would kill the entire Ginkgo worker process. This is not even
an error, so just log it as info message.
The wrapper can be used in combination with ginkgo.DeferCleanup to ignore
harmless "not found" errors during delete operations.
Original code suggested by Onsi Fakhouri.
It is set in all of the test/e2e* suites, but not in the ginkgo output
tests. This check is needed before adding a test case there which would trigger
this nil pointer access.
Adding the "context" import in the previous commit must get compensated by
removing one of the blank lines in the output unit tests, otherwise the stack
backtrace don't match expectations.
Adding "ctx" as parameter in the previous commit led to some linter errors
about code that overwrites "ctx" without using it.
This gets fixed by replacing context.Background or context.TODO in those code
lines with the new ctx parameter.
Two context.WithCancel calls can get removed completely because the context
automatically gets cancelled by Ginkgo when the test returns.
Every ginkgo callback should return immediately when a timeout occurs or the
test run manually gets aborted with CTRL-C. To do that, they must take a ctx
parameter and pass it through to all code which might block.
This is a first automated step towards that: the additional parameter got added
with
sed -i 's/\(framework.ConformanceIt\|ginkgo.It\)\(.*\)func() {$/\1\2func(ctx context.Context) {/' \
$(git grep -l -e framework.ConformanceIt -e ginkgo.It )
$GOPATH/bin/goimports -w $(git status | grep modified: | sed -e 's/.* //')
log_test.go was left unchanged.
Endpoints generated by the endpoints controller are in the canonical
form, however, custom endpoints can not be in canonical format
(there was a time they were canonicalized in the apiserver, but this
caused performance issues because the endpoint controller kept
updating them since the created endpoint were different than the
stored one due to the canonicalization)
There are cases where a custom endpoint may generate multiple slices
due to the controller, per example, when the same address is present
in different subsets.
The endpointslice mirroring controller should canonicalize the
endpoints subsets before start processing them to be consistent
on the slices generated, there is no risk of hotlooping because
the endpoint is only used as input.
Change-Id: I2a8cd53c658a640aea559a88ce33e857fa98cc5c
This ensures that the daemonset controller updates daemonset statuses in
a best-effort manner even if syncDaemonSet fails.
In order to add an integration test, this also replaces
`cmd/kube-apiserver/app/testing.StartTestServer` with
`test/integration/framework.StartTestServer` and adds
`setupWithServerSetup` to configure the admission control of the
apiserver.
Currently, if user executes `kubectl scale --dry-run`, output has no
indicator showing that this is not applied in reality.
This PR adds dry run suffix to the output as well as more integration
tests to verify it.
`kubectl scale` calls visitor two times. Second call fails when
the piped input is passed by returning an
`error: no objects passed to scale` error.
This PR uses the result of first visitor and fixes that piped
input problem. In addition to that, this PR also adds new
scale test to verify.
`kubectl exec` command supports getting files as inputs. However,
if the file contains multiple resources, it returns unclear error message;
`cannot attach to *v1.List: selector for *v1.List not implemented`.
Since `exec` command does not support multi resources, this PR
handles that and returns descriptive error message earlier.
One of the cpumanager tests doesn't remove the pod
that got created during the test.
This causes pollution of other tests and failures
from time to time (depends on the test execution order).
In order to defalke the tests, we should delete the pod
and wait for it to be completely remove.
Signed-off-by: Talor Itzhak <titzhak@redhat.com>
This introduces `singularNameProvider`. This provider will be used
by core types to have their singular names are defined in discovery
endpoint. Thanks to that, core resources singular name always have
higher precedence than CRDs shortcuts or singular names.
This adds new integration tests to test shortnames and
singular names are expanding to correct resources.
In this case, core types have always higher precendence than
CRDs.
This change will leverage the new PreFilterResult
to reduce down the list of eligible nodes for pod
using Bound Local PVs during PreFilter stage so
that only the node(s) which local PV node affinity
matches will be cosnidered in subsequent scheduling
stages.
Today, the NodeAffinity check is done during Filter
which means all nodes will be considered even though
there may be a large number of nodes that are not
eligible due to not matching the pod's bound local
PV(s)' node affinity requirement. Here we can
reduce down the node list in PreFilter to ensure that
during Filter we are only considering the reduced
list and thus can provide a more clear message to
users when node(s) are not available for scheduling
since the list only contains relevant nodes.
If error is encountered (e.g. PV cache read error) or
if node list reduction cannot be done (e.g. pod uses
no local PVs), then we will still proceed to consider
all nodes for the rest of scheduling stages.
Signed-off-by: Yibo Zhuang <yibzhuang@gmail.com>
In the Dynamic Resource allocation example specs, the claim
parameter name specified was inconsistent.
This commit fixes that with a better/more consistent name,
which is used to define the configmap and referenced in
the `ResourceClaimTemplate` spec.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
A recent PR [1] updated the image versions we use for E2E tests. However, the ``windows-nanoserver`` image is meant to be in a private authenticated registry: ``gcr.io/authenticated-image-pulling/windows-nanoserver``, which requires credentials to pull images from it. This image is required by the ``[sig-node] Container Runtime blackbox test when running a container with a new image should be able to pull from private registry with secret [NodeConformance]`` test for Windows. The ``v3`` image does not exist, there's no automatic promotion process for that registry. Previously, it was built and pushed manually.
Because of this, the https://testgrid.k8s.io/sig-windows-signal#capz-windows-containerd-master jobs have started to fail.
Reverts the image version to ``v1``.
[1] https://github.com/kubernetes/kubernetes/pull/113900
Many clusters block direct requests from internal resources to the nodes
external IPs as best practice. All accesses from internal resources that
want to access resources running on nodes go through load balancers,
nodes being on private or public subnets. Let's prefer internal IPs
first, so the tests can work even when there are security group rules
present blocking requests to the external IPs.
We should not require ExternalIP for Conformance, but should keep
testing ExternalIPs in sig network.
Signed-off-by: Rafael Fonseca <r4f4rfs@gmail.com>
These instructions bring up a kind cluster with containerd 34d078e99, the
latest commit from the main branch. This version of containerd has
support for CDI.
The driver can be used manually against a cluster started with
local-up-cluster.sh and is also used for E2E testing. Because the tests proxy
connections from the nodes into the e2e.test binary and create/delete files via
the equivalent of "kubectl exec dd/rm", they can be run against arbitrary
clusters. Each test gets its own driver instance and resource class, therefore
they can run in parallel.
Add volumePath parameter to all disruptive checks, so subpath tests can use
"/test-volume" and disruptive tests can use "/mnt/volume1" for their
respective Pods.
This adds a new resource.k8s.io API group with v1alpha1 as version. It contains
four new types: resource.ResourceClaim, resource.ResourceClass, resource.ResourceClaimTemplate, and
resource.PodScheduling.
This removes WaitTimeoutForPodNoLongerRunningOrNotFoundInNamespace
introduced in f2b9479f8e and changes
the test to use goroutines to speed up the cleanups.
Most CI jobs run an OS that does not support SELinux, therefore tests that
need it should be skipped by default.
* [Feature:SELinux] marks tests that need SELinux (for any feature)
* [Feature:SELinuxMountReadWriteOncePod] marks tests that need
SELinuxMountReadWriteOncePod alpha gate enabled.
Currently, all SELinux tests have both, but it will change in the future.
Also make some design changes exposed in testing and review.
Do not remove the ambiguous old metric
`apiserver_flowcontrol_request_concurrency_limit` because reviewers
though it is too early. This creates a problem, that metric can not
keep both of its old meanings. I chose the configured concurrency
limit.
Testing has revealed a design flaw, which concerns the initialization
of the seat demand state tracking. The current design in the KEP is
as follows.
> Adjustment is also done on configuration change … For a newly
> introduced priority level, we set HighSeatDemand, AvgSeatDemand, and
> SmoothSeatDemand to NominalCL-LendableSD/2 and StDevSeatDemand to
> zero.
But this does not work out well at server startup. As part of its
construction, the APF controller does a configuration change with zero
objects read, to initialize its request-handling state. As always,
the two mandatory priority levels are implicitly added whenever they
are not read. So this initial reconfig has one non-exempt priority
level, the mandatory one called catch-all --- and it gets its
SmoothSeatDemand initialized to the whole server concurrency limit.
From there it decays slowly, as per the regular design. So for a
fairly long time, it appears to have a high demand and competes
strongly with the other priority levels. Its Target is higher than
all the others, once they start to show up. It properly gets a low
NominalCL once other levels show up, which actually makes it compete
harder for borrowing: it has an exceptionally high Target and a rather
low NominalCL.
I have considered the following fix. The idea is that the designed
initialization is not appropriate before all the default objects are
read. So the fix is to have a mode bit in the controller. In the
initial state, those seat demand tracking variables are set to zero.
Once the config-producing controller detects that all the default
objects are pre-existing, it flips the mode bit. In the later mode,
the seat demand tracking variables are initialized as originally
designed.
However, that still gives preferential treatment to the default
PriorityLevelConfiguration objects, over any that may be added later.
So I have made a universal and simpler fix: always initialize those
seat demand tracking variables to zero. Even if a lot of load shows
up quickly, remember that adjustments are frequent (every 10 sec) and
the very next one will fully respond to that load.
Also: revise logging logic, to log at numerically lower V level when
there is a change.
Also: bug fix in float64close.
Also, separate imports in some file
Co-authored-by: Han Kang <hankang@google.com>
This change enables hot reload of encryption config file when api server
flag --encryption-provider-config-automatic-reload is set to true. This
allows the user to change the encryption config file without restarting
kube-apiserver. The change is detected by polling the file and is done
by using fsnotify watcher. When file is updated it's process to generate
new set of transformers and close the old ones.
Signed-off-by: Nilekh Chaudhari <1626598+nilekhc@users.noreply.github.com>
This change adds a flag --encryption-provider-config-automatic-reload
which will be used to drive automatic reloading of the encryption
config at runtime. While this flag is set to true, or when KMS v2
plugins are used without KMS v1 plugins, the /healthz endpoints
associated with said plugins are collapsed into a single endpoint at
/healthz/kms-providers - in this state, it is not possible to
configure exclusions for specific KMS providers while including the
remaining ones - ex: using /readyz?exclude=kms-provider-1 to exclude
a particular KMS is not possible. This single healthz check handles
checking all configured KMS providers. When reloading is enabled
but no KMS providers are configured, it is a no-op.
k8s.io/apiserver does not support dynamic addition and removal of
healthz checks at runtime. Reloading will instead have a single
static healthz check and swap the underlying implementation at
runtime when a config change occurs.
Signed-off-by: Monis Khan <mok@microsoft.com>
so that it explicitly describe group information defined in the
container image will be kept. This also adds e2e test case of
SupplementalGroups with pre-defined groups in the container
image to make the behaivier clearer.
- New API field .spec.schedulingGates
- Validation and drop disabled fields
- Disallow binding a Pod carrying non-nil schedulingGates
- Disallow creating a Pod with non-nil nodeName and non-nil schedulingGates
- Adds a {type:PodScheduled, reason:WaitingForGates} condition if necessary
- New literal SchedulingGated in the STATUS column of `k get pod`
In the PR https://github.com/kubernetes/kubernetes/pull/86139, two more lifecycle hook tests (poststart / prestop)
were added using HTTPS. They are similar with the existing HTTP tests.
However, this causes failures on Windows due to how networking
works there. We previously fixed this in the HTTP tests via f9e4a015e2.
This commit applies the same fix on the lifecycle hook HTTPS tests.
Signed-off-by: Ionut Balutoiu <ibalutoiu@cloudbasesolutions.com>
The cloud-provider and the e2e test were racing on deleting the
cloud resources.
Also, the cloud-provider should not leave orphan resources, that will
be detected by the job and fail, thus we should not have additional
logic to cleanup masking these errors.
Removed the unit tests that test the cases when the MixedProtocolLBService feature flag was false - the feature flag is locked to true with GA
Added an integration test to test whether the API server accepts an LB Service with different protocols.
Added an e2e test to test whether a service which is exposed by a multi-protocol LB Service is accessible via both ports.
Removed the conditional validation that compared the new and the old Service definitions during an update - the feature flag is locked to true with GA.
The original intention (adding more information for later analysis)
is probably obsolete because there is no code which does anything
with the extended error.
The code in upgrade_suite.go collected it in an in-memory JUnit report, but
then didn't do anything with that field. The code also wouldn't work for
failures detected by Ginkgo itself, like the upcoming timeout handling. If the
upgrade suite needs the information, it probably should get it from Gingko with
a ReportAfterSuite call instead of depending in some fragile interception
mechanism.
Tests scheduler enforcement of the ReadWriteOncePod PVC access mode.
- Creates a pod using a PVC with ReadWriteOncePod
- Creates a second pod using the same PVC
- Observes the second pod fails to schedule because PVC is in-use
- Deletes the first pod
- Observes the second pod successfully schedules
Some of our API types contain fields that get rendered very poorly by
gomega.format.Object because they contain lots of internal information, for
example CreationTimestamp. As a result, dumping full API object typically gets
truncated.
What we want is a representation that is a) multi-line (in contrast to the
stringer implemented by our types) and b) drops empty fields where it
was defined that this is okay.
The normal YAML representation fits that requirement. We just need to teach
gomega how and when to do that. This cannot be done for each type through a
generated GomegaString method (lots of code, additional dependency in public
API on YAML encoder), but it can be done inside tests by adding a formatting
handler (new gomega feature).
Our tooling cannot handle very long failure messages well:
- when unfolding a test in the spyglass UI, it fills the entire screen
- failure correlation for http://go.k8s.io/triage has resource constraints
We cannot enforce that all tests only produce short failure messages and even
if we could, depending on the test failure, including more information may be
useful to understand it.
To achieve both goals (summary for correlation and overview, all details
available when digging deeper), too longer failure messages now get truncated,
with the full message guaranteed to be captured in the test output.
"Too long" is arbitrarily chosen to be similar to the gomega.MaxLength because
that has been a limit for failure message size in the past.
When gomega.format exceeds the default size of 4000, it truncates and prints:
Gomega truncated this representation as it exceeds 'format.MaxLength'.
Consider having the object provide a custom 'GomegaStringer' representation
or adjust the parameters in Gomega's 'format' package.
Learn more here: https://onsi.github.io/gomega/#adjusting-output
These instructions don't help the user of the e2e.test binary unless we provide
a command line flag.
Commit 99e9096034 was only supposed to remove the
FailfWithOffset function, but it also changed the behavior by skipping one
additional stack frame. That makes no sense and is inconsistent with Fail, which
also logs the direct caller.
The Windows Server Core images are quite large (~2GB each), and pulling
it for multiple build jobs / E2E images is inefficient, especially if
have to build for multiple OS versions.
The windows-servercore-cache image is meant to simply cache the Windows files
we need from the Windows Server core images, so we can pull the small cache image
instead of the entire image. It is never meant to be a promotable image,
the version is not meant to be bumped.
The other images (e.g.: agnhost) rely on the version 1.0 images.
In the `should correctly account for terminated pods after restart`, the
test first creates a set of `restartNever` pods, followed by a set of
`restartAlways` pods. Both the `restartNever` and `restartAlways` pods
request an entire CPU. As a result, the `restartAlways` pods will not be
admitted, if the `restartNever` pods did not terminate yet.
Depending on the timing/how fast the pods terminate, the test can pass
sometimes fail which results in flakes. To de-flake the test, the test
should wait until the `restartNever` pods enter a terminal `Succeeded`
phase, before creating the `restartAlways` pods.
To do this, generalize the function `waitForPods` to accept a pod
condition (`testutils.PodRunningReadyOrSucceeded`, or
`testutils.PodSucceeded`). Also introduce a new "Succeeded" pod
condition, so the test can explicitly wait until the pods enter the
Succeeded phase.
Signed-off-by: David Porter <david@porter.me>
Currently, when running node e2e it's not possible to use the ginkgo `--repeat`
flag to run the test suite multiple times. This is useful when debugging tests
and ensuring they are not flaky by re-running them several times. Currently if
using `--repeat` ginkgo flag, the 2nd run of the test will fail due to kubelet
not starting with message like:
```
Failed to start transient service unit: Unit kubelet-20221020T040841.service already exists.
```
This is because during the test startup, kubelet is started as a transient unit
file via `systemd-run`. The unit is started with the `--remain-after-exit` flag
to ensure that the unit will remain even if the kubelet is restarted. The test
suite currently uses `systemd kill` command to stop kubelet. This works fine for
stopping the kubelet, but on the second run, when `systemd-run` is used to start
systemd unit again it will fail because the unit already exists. This is because
`systemd kill` will not delete the systemd unit, only send SIGTERM signal to it.
To fix this, add `unitName` as a field to the `server` struct. When
kubelet server is constructed, set the unit name. As part of e2e test
termination, in `E2EServices.Stop()``, stop the kubelet systemd unit. By
stopping the kubelet systemd unit, systemd will delete the systemd
transient unit, allowing it to be created and started again in a
subsequent e2e run.
Signed-off-by: David Porter <david@porter.me>
The original intention was to address "frustration of end users running the e2e
suite is that they take a significant amount of time and it is difficult to
gauge progress".
But Ginkgo's output is different now than it was in Kubernetes 1.19. If users
want to see progress, then "ginkgo --progress" might provide enough
information.
Printing to os.Stdout doesn't work as intended anyway when output redirection
is enabled (the default for parallel runs) and causes these JSON snippets to
appear as "show stdout" for each failed test in a Prow job, which is
distracting.
Tests should accept a context from Ginkgo and pass it through to all functions
which may block for a longer period of time. In particular all Kubernetes API
calls through client-go should use that context. Then if a timeout occurs,
the test returns immediately because everything that it could block on will
return.
Cleanup code then needs to run in a separate Ginkgo node, typically
DeferCleanup, which ensures that it gets a separate context which has not timed
out yet.
Align the behavior of HTTP-based lifecycle handlers and HTTP-based
probers, converging on the probers implementation. This fixes multiple
deficiencies in the current implementation of lifecycle handlers
surrounding what functionality is available.
The functionality is gated by the features.ConsistentHTTPGetHandlers feature gate.
The device plugin test in https://testgrid.k8s.io/sig-node-release-blocking#node-kubelet-serial-containerd
has been flaky for a while now when it runs on the test infrastructure.
Locally running this test resulted in test passing without issues.
Based on the existing logs, it is not clear why podresource
API endpoint is returning 3 pods rather than the expected
two pods (device plugin pod and the test pod requesting
devices). For more clarity and debugaability on why an
addtional pod seems to be appearing we expose the output
from podresource API endpoint.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
* add superuser fallback to authorizer
* change the order of authorizers
* change the order of authorizers
* remove the duplicate superuser authorizer
* add integration test for superuser permissions
e2e test validates the following 3 endpoints
- patchCoreV1NamespacedResourceQuotaStatus
- readCoreV1NamespacedResourceQuotaStatus
- replaceCoreV1ResourceQuotaForAllNamespacesStatus
This addresses a problem caused by
https://github.com/kubernetes/kubernetes/pull/112043: because the AfterEach
which invokes AllNodesReady always runs, including tests that skipped early,
those tests ran into a nil pointer access. This increased the size of log
files. The tests still worked.
Adds two tests for the enforcement of the ReadWriteOncePod
PersistentVolume access mode.
1. Tests that when two Pods are scheduled that reference the same
ReadWriteOncePod PVC, the latter-scheduled Pod will be marked
unschedulable because the PVC is in-use.
2. Tests that when two Pods are scheduled on the same node (setting
Pod.Spec.NodeName to bypass scheduling for the second Pod), the
latter Pod will fail to start because the PVC is already mounted on
the Node.
Included are changes to update the hostpath CSI driver to accept new CSI
access modes. Its sidecar containers are already at supported versions
for ReadWriteOncePod and don't need updating. The GCP PD CSI driver does
not yet support the new CSI access modes, but its sidecar containers are
at supported versions and so the feature will work.
To support ReadWriteOncePod, the following CSI sidecars must be updated
to these versions or greater:
- csi-provisioner:v3.0.0+
- csi-attacher:v3.3.0+
- csi-resizer:v1.3.0+
For more details, see:
https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/2485-read-write-once-pod-pv-access-mode/README.md
The reason for the issue is that the metrics were bumped before the
final job status update. In case the update failed the path was
repeated by the next syncJob leading to double-counting of the metrics.
The solution is to delay recording metrics and broadcasting events
after the job status update succeeds.
This change updates the API server code to load the encryption
config once at start up instead of multiple times. Previously the
code would set up the storage transformers and the etcd healthz
checks in separate parse steps. This is problematic for KMS v2 key
ID based staleness checks which need to be able to assert that the
API server has a single view into the KMS plugin's current key ID.
Signed-off-by: Monis Khan <mok@microsoft.com>
The change made in https://github.com/kubernetes/kubernetes/pull/112644
resulted in an update to the rejection message. In the memory manager
node e2e test, we still checked against the old expected error message
giving the impression that the pod succeeded to run even though it failed
as expected mainly because the check wasn't performed correctly.
In this patch, we update to the correct rejection message to make sure
that the memory manager is no longer failing.
NOTE: This test is supposed to run on multi NUMA systems and if the
underlying node does not have multi NUMA nodes, the test is skipped
which is what happens in upstream test infrastructure as it is mainly
composed of single NUMA nodes. Because of this, this test failure
wasn't evident via testgrid.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>