This drops testfiles.ReadOrDie and updated testfiles.Exists to return an
error, forcing the caller to decide whether to call framework.Fail or do
something else.
It makes for a slightly less friendly API, but also means the package is
decoupled from framework again, as per the comments at the top of the
file
Currently when checking for unscheduled pods an exception will be raised
if a pod is not scheduled and the status is unknown. This update modifies
the logic to include any pod without a NodeName in the not scheduled
pods returned.
Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
If a reserve plugin's Reserve method returns an error, there could be
previously allocated resources from successfully completed reserve
plugins that must be unallocated by the corresponding Unreserve
operation. Since Unreserve operations are idempotent, this patch runs
the Unreserve operation of ALL reserve plugins when a Reserve operation
fails.
This runs much faster than before. This change removes all of the
async status output because all of the compute time is spent inside
go/packages, with no opportunity to update the status.
Adds testdata code to prove it fails when expected.
When node scheduling tests were updated to use worker instead of master
nodes the GetPodsScheduled function, which is tasked with getting all
scheduled and not scheduled pods inadvertently was changed to ignore all
pods that have an empty NodeName before checking whether pods had been
scheduled or not. This updates the function to include pods without a
NodeName in the check for unscheduled pods.
Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
As its name, DeprecatedMightBeMasterNode is deprecated.
In e2e metrics, the function was used for knowing master node name to
get metrics from kube-scheduler and kube-controller-manager pods.
This make e2e metrics get these metrics directly by getting those pod
names without calling DeprecatedMightBeMasterNode().
Ready schedulable nodes are being inserted into an unitialized string
set, causing an assignment to entry in nil map in the underlying data
structure. This initializes the string set before attempting to insert
nodes.
Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
There were nits in invokeStaleDummyVMTestWithStoragePolicy() like
- The error message didn't contain necessary space
- IsVMPresent() can return an error, but lack of the error handling
- IsVMPresent() returns true/false, but didn't use ExpectEqual() and
less code readability
This fixes those things.
The namespace parameter "ns" of getScheduledAndUnscheduledPods() is
always metav1.NamespaceAll.
This removes the parameter from the function for cleanup.
Previously, separate interfaces were defined for Reserve and Unreserve
plugins. However, in nearly all cases, a plugin that allocates a
resource using Reserve will likely want to register itself for Unreserve
as well in order to free the allocated resource at the end of a failed
scheduling/binding cycle. Having separate plugins for Reserve and
Unreserve also adds unnecessary config toil. To that end, this patch
aims to merge the two plugins into a single interface called a
ReservePlugin that requires implementing both the Reserve and Unreserve
methods.
WaitForStableCluster() checks all pods run on worker nodes, and the
function used to refer master nodes to skip checking controller plane
pods.
GetMasterAndWorkerNodes() was used for getting master nodes, but the
implementation is not good because it usesDeprecatedMightBeMasterNode().
This makes WaitForStableCluster() refer worker nodes directly to avoid
using GetMasterAndWorkerNodes().
Part of work to remove racist language, this name change also improves on the
semantics of this variable name as it was not actually a list of permissible
images but rather a list of images that are required for e2e_node tests that
are to be pre-pulled so that they are available prior to running e2e tests.
Worth noting that this list of images is "union merged" with another list when
setting up e2e_node tests and as such there is the possibilty for overlap.
# Please enter the commit message for your changes. Lines starting
e2e/framework is a place to keep common functions for e2e tests, and
it is not a place to keep e2e tests themself. recreate_node.go is e2e
test for node.
This moves recreate_node.go to e2e/node.
The node-kubelet-flaky e2e job that runs the the
`Node Performance Testing [Serial] [Slow] [Flaky]` e2e tests have been
flaking because of inconsistencies on the cpu manager checkpoint file.
This seems to be caused because the checkpoint file is deleted (which is
what needs to happen in order to change the CPU manager policy which is
used for these e2e tests) right after the e2e tests asserts that a pod
does not exist anymore.
However, after a pod is deleted, the CPU manager may still be cleaning
up the resources used by the pod which may result in the checkpoint file
being created.
Whenever this happened, the kubelet would panic if we then try to
subsequently change the CPU manager policy to "static" from "none" or
vice versa (this is done 4 times in these tests).
Signed-off-by: alejandrox1 <alarcj137@gmail.com>
The behavior for exec'ing a backgrounded command is not specified
with CRI so modify the test to run the command directly instead
of using exec.
Signed-off-by: Mrunal Patel <mpatel@redhat.com>
deflake current e2e test
"should be able to preserve UDP traffic when server pod cycles for a
NodePort service" and reorganize the code in the e2e framework
Signed-off-by: Antonio Ojea <antonio.ojea.garcia@gmail.com>
These changes allow to set FQDN as hostname of pods for pods
that set the new PodSpec field setHostnameAsFQDN to true. The PodSpec
new field was added in related PR.
This is PART2 (last) of the changes to enable KEP #1797 and addresses #91036
Currently, the jessie-dnsutils image fails to build for arm64 arch with the following
error:
GPG error: http://archive.debian.org jessie Release: The following signatures were invalid: KEYEXPIRED 1587841717
We can bypass this issue by adding a --force-yes flag when installing the needed dnsutils.
Adds reviewers to the OWNERS files in the kubernetes/test/images folder.
The reviewers are added automatically, based on their contributions on
an image (>= 20% code churn).
Note that the code churn is taken into account for authors, and not committers.
Adds OWNERS files for: ipc-utils, node-perf, nonroot, regression-issue-74839,
resource-consumer, sample-device-plugin.
Windows does not support partially qualified domain names, which is why the test can fail.
Additionally, because nslookup may return 0 on Windows, even if the given DNS name was not
found, this issue was not observed until recently. We're now checking stderr as well.
The test spawns 10 pods with the same pod name, which contains multiple
containers with the same container name. Because of this, the test fails.
This commit addresses this issue.
e2e_node tests trigger OOM events on COS versions > 73-11636-0-0
possibly because of this change in the COS v.73-11636-0-0:
Made containerd run as a standalone systemd service
OOM killer usually kills cadvisor and e2e_node.test processes
causing node-kubelet-benchmark failures.
Decreasing amount of pods from 105 to 90 frees enough memory for
the test to succeed.
kubeconform was choking on a typo in the description field, so I fixed
the typo while adding friendlier logging to tell me which file was
invalid
I got curious why tests didn't catch this, and it turns out kubeconform
and the behavior tests use different codepaths to load and validate. So
I merged them together
The way gingko handles interrupts is:
- It starts running AfterSuite hooks in a separate goroutine (this includes cleanupAction hooks)
- Once AfterSuite hook is done executing it calls
os.Exit(1) on test suite.
So how cleanupFunc() that runs via defer in test can be interrupted
is:
- cleanupFunc starts running via defer (or AfterEach hook) but first
thing that function does is to remove cleanupHandle from
framework.RemoveCleanupAction.
- Test suite receives interrupt from user and AfterSuite block
starts executing
- remember that while cleanupFunc is running in goroutine#1,
AfterSuite is running concurrently in goroutine#2.
- AfterSuite hook has bunch of CleanupActions it needs to run which
were registered via framework.AddCleanupAction(cleanupFunc) but
once cleanupFunc starts executing via defer in the test, it will
remove the cleanupHandle from framework's aftersuite hooks.
- So if AfterSuite did not had anything to run (because
those actions were removed via framework.RemoveCleanupAction
then it will simply go to the last framework.AfterEach action and call os.Exit(1)
- So if os.Exit(1) is called before cleanupFunc has a chance to finish in defer, it will not complete.
The google cloud builder job is launched without the required Windows Image Builder nodes
certificates that are needed for authentication when building the Windows container images.
Adds a step in test/images/cloudbuild.yaml that fetches a secret containing the certificates.
This change removes the `coreCount` variable and associated counting
logic from the cluster size autoscaling test. This variable was used
during the 1.9 release era, but the tests that used it were removed
before the 1.10 release. Please see the referenced commits[0][1] for
more information.
[0]
c8b807837a
[1]
fd738945b1
The e2e/lifecycle package is owned by SIG CL, although maybe this
should be moved to e2e/auth at some point.
- copy the OWNERS from /cmd/kubeadm (minus the area/kueadm label)
- remove the OWNERS file in /bootstrap letting the parent OWNERS file
manage this sub-package.
ingress: use new serviceBackend split
ingress: remove all v1beta1 restrictions on creation
This change removes creation and update restrictions enforced by
k8s 1.18 for not allowing resource backends.
Paths are no longer
required to be valid regex and a PathType is now user-specified
and no longer defaulted.
Also remove all TODOs in staging/net/v1 types
Signed-off-by: Christopher M. Luciano <cmluciano@us.ibm.com>
Lowering the amount of cpu allocated to this workload will set the
resources allocated to be similar to the other npb and tf workload in
this tests.
This will also allow to run all three workloads in a n1-standard-12 gcp
instance - which has 16 cpus and 60 GB.
Signed-off-by: alejandrox1 <alarcj137@gmail.com>
For the beta server-side dry-run feature, `kubectl apply` provided the
`--server-dry-run` flag.
As of 1.18, this flag was deprecated and marked to be removed after 1
release.
- Remove the ServerDryRun field and delegate it entirely to the resource.Helper
- Use resource.Helper for deletions (as in `kubectl apply --force`)
instead of using the pruner's method that uses a dynamic client
- Reduce the resource.Helpers and times we check for server-side dry-run
in apply
the test "executing a command with run and attach without stdin"
is inherently flaky, there are several discussion but seems that
it requires changing the way the kubectl run and attach works.
The test fails if we are not able to attach before the container prints
"stdin closed", but hasn't exited yet.
Because the race seems difficult to solve, we can wait 5 seconds
before printing to give time to kubectl to attach to the container.
- use in-cache Pod instead of real-time Pod (by calling API server) to mark it as unschedulable
in internal schedulingQ
- remove the backoff logic as now we don't call API server
- the whole logic is changed to a synchronous call
We have little coverage around node addition and removal. Since distinct event handlers interact, it is important to cover this in integration tests.
Signed-off-by: Aldo Culquicondor <acondor@google.com>
ensure that when a pod servicing UDP traffic is deleted the conntrack entries
are cleaned up and another backend can pick up the traffic with minimal
interruption
When using NodePort services and long running connections that on pod deletion
stale conntrack entries can halt the flow of traffic. Add a test case to check
that conntrack entries are cleaned up.
In case two or more controllers share the informers created through InitTestScheduler,
it's not safe to start the informers until all controllers set their informer
indexers. Otherwise, some controller might fail to register their indexers
in time. Thus, it's responsibility of each consumer to make sure all informers
are started after all controllers had time to get initiliazed.
ginkgo has a weird bug that - AfterEach does not get called when
testsuite exits with certain kind of interrupt (Ctrl-C for example).
More info - https://github.com/onsi/ginkgo/issues/222
We workaround this issue in Kubernetes by adding a special hook into
AfterSuite call, but AfterSuite can not be used to peforms certain
kind of cleanup because it can race with AfterEach hook and
framework.AfterEach hook will set framework.ClientSet to nil.
This presents a problem in cleaning up CSI driver and testpods. This
PR removes cleanup of driver manifest via CleanupAction because that
is not safe and racy (such as f.ClientSet may disappear!) and makes
AfterSuite hooks run in a ordered fashion
We should read and verify the data before actually closing the
connection to avoid connection based-races within the test.
Signed-off-by: Sascha Grunert <sgrunert@suse.com>
Removes the fatal error from getIP and moves it to
retry loop so that application will not immediately
crash on failure.
Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
This updates the error messages when registering a
node to be more explicit about what error occurred
and how long it will wait to retry.
Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
Currently the guestbook application will fail if unable
to resolve TCP address on first attempt. If pod networking
is not setup when the application starts then it will be
unable to resolve, leading to frequent failures. This moves
the address resolution into the retry block so it will try
again if unsuccessful on first attempt.
Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
The current /exit method is not sufficient to test graceful shutdown
behaviors within Kube that allow services to remain available during
rolling restarts. Add support for `wait=DURATION` and
`timeout=DURATION` to the exit handler and wire that to the Go http
server's graceful termination.
With these methods netexec can be used in a pod to simulate graceful
shutdown by adding a preStop handler that hits the exit endpoint with
a timeout and wait period.
kubelet sometimes calls NodeStageVolume an NodePublishVolume too
often, which breaks this test and leads to flakiness. The test isn't
about that, so we can relax the checking and it still covers what it
was meant to cover.
collectPodsAndNetworkPolicies() is called to collect diagnostics
after a failure. Previously, if it encountered a failure in getting
the logs it would call Failf(), discarding the rest of the diagnostics
immediately.
Following changes in #87730, Kubelet is directly hcsshim to gather stats.
However, unlike `docker stats` API that was used before, hcsshim does not
keep information about exited containers.
When the Kubelet lists containers (`docker_container.go:ListContainers()`),
it sets `All: true`, retrieving non-running containers.
When docker stats is called with such container id, it'll return a valid JSON
with all values set to 0. The non-running containers are filtered later on in the process.
When the hcsshim is called with such container id, it'll return an error, effectively
stopping the stats retrieval for all containers.
"Volumes GlusterFS should be mountable" is a bit flaky in a downstream CI.
This PR make "should be mountable" test on par with the other GlusterFS
tests (in_tree.go: DeleteVolume())
commit 43c56eb403 introduced a change
where CPUAccounting, CPUAccounting and TasksAccounting are enabled for
the systemd service.
It causes a regression on RHEL 7.8 where systemd-run doesn't allow to
set TasksAccounting.
Since Delegate= already enables all the controllers, it is superfluous
to specify them.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
In caf0d1d61874a2c8687b7deb773eca30ddaee5b6 we documented a policy to
ensure that conformance tests should not rely in existence or use of
kubelet apis directly. So based on that we should drop conformance for
the two tests here that use the "/logs" endpoint directly.
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
The test "should not change the subpath mount on a container restart if the environment variable changes"
creates a pod with the liveness probe: cat /volume_mount/test.log. The test then
deletes that file, which causes the probe to fail and the container to be restarted.
After which it recreates the file by exec-ing into the pod, but there is a chance
that the container was not created yet, or it did not start yet.
This commit adds a few retries to the exec command.
this is mainly to ensure integration tests (which all end in _test)
are properly bossed around for their imports
I had to adjust some of the _test files to adhere to existing
reverse_rules specified elsewhere
specifically:
- cmd/kubeadm/.import-restrictions
- we don't need to explicitly allow k8s.io repos (external or published)
- rm pkg/controller/.import-restrictions
- pkg/client/unversioned was removed in 59042
- pkg/kubectl/.import-restrictions
- pkg/printers is no longer used
- pkg/api was masking all of the pkg/apis prefixes
- rm staging/src/k8s.io/code-generator/cmd/lister-gen/.import-restrictions
- noop / empty file
- test/e2e/framework/.import-restrictions
- we don't need to explicitly allow k8s.io repos (external or published)
yaml has comments, so we can explain why we have certain rules or
certain prefixes
for those files that weren't already commented yaml, I converted them to
yaml and took a best guess at comments based on the PRs that introduced
or updated them
When a test pattern or storage class uses late binding, the cleanup
code didn't know about the PV that may have been created for the PVC
since setting it up and thus then also didn't wait for PV deletion.
This is problematic for test isolation because the next test was
allowed to be started before fully cleaning up. Worse, it the driver
gets removed after the test, the volume might never get deleted.
'docker pull' is a time consuming operation. It makes sense to check
if image exists locally before pulling it from a registry.
Checked if image exists by running 'docker inspect'. Only pull if
image doesn't exist.
The service allocator is used to allocate ip addresses for the
Service IP allocator and NodePorts for the Service NodePort
allocator. It uses a bitmap backed by etcd to store the allocation
and tries to allocate the resources directly from the local memory
instead from etcd, that can cause issues in environment with
high concurrency.
It may happen, in deployments with multiple apiservers, that the
resource allocation information is out of sync, this is more
sensible with NodePorts, per example:
1. apiserver A create a service with NodePort X
2. apiserver B deletes the service
3. apiserver A creates the service again
If the allocation data of apiserver A wasn't refreshed with the
deletion of apiserver B, apiserver A fails the allocation because
the data is out of sync. The Repair loops solve the problem later,
but there are some use cases that require to improve the concurrency
in the allocation logic.
We can try to not do the Allocation and Release operations locally,
and try instead to check if the local data is up to date with etcd,
and operate over the most recent version of the data.
- add ./hack/tools/go.mod, this makes ./hack/tools a distinct module
- hack/tools/tools.go undescore imports bazel related tools, over time we
can add others.
- hack/*.sh scripts will cd to hack/tools and go install tools from there
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
During a parallel test run these tests were observed to cause a
preemption of another test:
```
Apr 19 16:59:06.749: INFO: At 2020-04-19 16:58:52 +0000 UTC - event for pod-init-b6fbd440-dbc2-454a-b31a-ce44266298d1: {default-scheduler } Scheduled: Successfully assigned e2e-init-container-7691/pod-init-b6fbd440-dbc2-454a-b31a-ce44266298d1 to ip-10-0-148-234.us-west-2.compute.internal
Apr 19 16:59:06.750: INFO: At 2020-04-19 16:58:54 +0000 UTC - event for pod-init-b6fbd440-dbc2-454a-b31a-ce44266298d1: {default-scheduler } Preempted: Preempted by e2e-resourcequota-priorityclass-8850/testpod-pclass9 on node ip-10-0-148-234.us-west-2.compute.internal
```
These tests have no need to actually land on a node to validate resource quota and so we can set an impossible scheduling condition. Hopefully we don't have tests that too broadly check impossible scheduling conditions.
The admission cache may take longer to see both ingress classes
than it takes to create the ingress. We must loop until we see
the appropriate error, cleaning up after ourselves as we go.
Don't set a connection deadline for reading, because the read operation will
fail if no data is reaceived after the deadline, and will not keep the
connection in the close_wait status.
Copy csi-hostpath driver manifests from
kubernetes-csi/csi-driver-host-path. It bumps version of all images to the
release shipped along Kubernetes 1.18.
As seen in one case (https://github.com/intel/pmem-csi/issues/587), a
pod can reach the "not running" state although its ephemeral volumes
are still being torn down by kubelet and the CSI driver. What happens
then is that the test returns too early and even deleting the
namespace and thus the pod succeeds before the NodeVolumeUnpublish
really finishes.
To avoid this, StopPod now waits for the pod to really disappear.
The agnhost image used for testing has a `netexec` path which supports
two new flags, `--tls-cert-file` and `--tls-private-key-file`. If the
former is provided, the HTTP server will be upgraded to HTTPS, using the
certificate (and private key) provided.
By default, there are keys already mounted into the container at
`/localhost.crt` and `/localhost.key`, which contain PEM-encoded TLS
certs with IP SANs for `127.0.0.1` and `[::1]`.
This adds 2 new tests covering EndpointSlices, including new coverage of
the self referential Endpoints and EndpointSlices that need to be
created by the API Server and the lifecycle of EndpointSlices from
creation to deletion. This also removes the [feature] indicator from the
name to ensure that this test will run more often now that it is enabled
by default.
Adds reviewers to the OWNERS files in the kubernetes/test/images folder.
The reviewers are added automatically, based on their contributions on
an image (>= 20% code churn).
Note that the code churn is taken into account for authors, and not committers.
Adds OWNERS files for: cuda-vector-add, nonewprivs, pets, redis, volume.
Adds reviewers to the OWNERS files in the kubernetes/test/images folder.
The reviewers are added automatically, based on their contributions on
an image (>= 20% code churn).
Note that the code churn is taken into account for authors, and not committers.
Adds ONWERS files for: apparmor-loader, echoserver, jessie-dnsutils, metadata-concealment,
sample-apiserver.
The build times are a bit high for the image builder (~50 minutes), and it will a bit more
when Windows support will be added to the other test images. This commit changes the
machineType to N1_HIGHCPU_8.
Reenables Windows test image building. Added DOCKER_CERT_BASE_PATH (default value: $HOME),
which will contain the path where the certificates needed for Remote Docker Connection can
be found.
If a REMOTE_DOCKER_URL was not set for a particular OS version, exclude that image from the
manifest list. This fixes an issue where, if REMOTE_DOCKER_URL was not set for Windows Server 1909,
the Windows were completely excluded from the manifest list, including for Windows Server 1809
and 1903 which could have been built and pushed.
Sets "test-webserver" as the default CMD for kitten and nautilus. Since they are now based on
agnhost, they should be set to run test-webserver to maintain previous behaviour.
Bumps the agnhost version to 2.13, as 2.12 has already been promoted. 2.13 will contain
Windows support.
Adds Windows support for the kitten and nautilus images, so they can promoted together
with agnhost (they were not previously promoted).
Adds OWNERS files to: agnhost, busybox, kitten, nautilus.
The timeout for the two loops inside the test itself are now bounded
by an upper limit for the duration of the entire test instead of
having their own, rather arbitrary timeouts.
The functionality included in the e2e/manifests is useful for writing
e2e tests and will be a good addition to the test framework as a
sub-package.
Signed-off-by: alejandrox1 <alarcj137@gmail.com>
Before https://github.com/kubernetes/kubernetes/pull/83084, `kubectl
apply --prune` can prune resources in all namespaces specified in
config files. After that PR got merged, only a single namespace is
considered for pruning. It is OK if namespace is explicitly specified
by --namespace option, but what the PR does is use the default
namespace (or from kubeconfig) if not overridden by command line flag.
That breaks the existing usage of `kubectl apply --prune` without
--namespace option. If --namespace is not used, there is no error,
and no one notices this issue unless they actually check that pruning
happens. This issue also prevents resources in multiple namespaces in
config file from being pruned.
kubectl 1.16 does not have this bug. Let's see the difference between
kubectl 1.16 and kubectl 1.17. Suppose the following config file:
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
creationTimestamp: null
name: foo
namespace: a
labels:
pl: foo
data:
foo: bar
---
apiVersion: v1
kind: ConfigMap
metadata:
creationTimestamp: null
name: bar
namespace: a
labels:
pl: foo
data:
foo: bar
```
Apply it with `kubectl apply -f file`. Then comment out ConfigMap foo
in this file. kubectl 1.16 prunes ConfigMap foo with the following
command:
$ kubectl-1.16 apply -f file -l pl=foo --prune
configmap/bar configured
configmap/foo pruned
But kubectl 1.17 does not prune ConfigMap foo with the same command:
$ kubectl-1.17 apply -f file -l pl=foo --prune
configmap/bar configured
With this patch, kubectl once again can prune the resource as before.
/cluster/kubeadm.sh is used to find the kubeadm binary.
This file is legacy and is removed.
Remove /test/cmd/kubeadm.sh. This file contains a function that is used
to build kubeadm and invoke "make test". Move the function contents
to hack/make-rules/test-cmd.cmd.
Stop sourcing /test/cmd/kubeadm.sh in /test/cmd/legacy-script.sh.
Also remove the --kubeadm-path invocation as this can be handled
with an env. variable directly.
The "error waiting for expected CSI calls" is redundant because it's
immediately followed by checking that error with:
framework.ExpectNoError(err, "while waiting for all CSI calls")
The mock driver gets instructed to return a ResourceExhausted error
for the first CreateVolume invocation via the storage class
parameters.
How this should be handled depends on the situation: for normal
volumes, we just want external-scheduler to retry. For late binding,
we want to reschedule the pod. It also depends on topology support.
The code became obsolete with the introduction of parseMockLogs
because that will retrieve the log itself. For debugging of a running
test the normal pod output logging is sufficient.
parseMockLogs is called potentially multiple times while waiting for
output. Dumping all CSI calls each time is quite verbose and
repetitive. To verify what the driver has done already, the normal
capturing of the container log can be used instead:
csi-mockplugin-0/mock@127.0.0.1: gRPCCall: {"Method":"/csi.v1.Node/NodePublishVolume","Request"...
As seen in some test
runs (https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/89041),
retrieving output can fail with "the server rejected our request for
an unknown reason (get pods csi-mockplugin-0)".
If this truly an intermittent error, then the existing retry logic in
the callers can deal with this.
Especially related to "uncertain" global mounts. A large refactoring of CSI
mock tests were necessary:
- to be able to script the driver to return errors as required by the test
- to parse the CSI driver logs to check kubelet called the right CSI calls
the e2e TCP CLOSE_WAIT has to create a server pod and then, from
a client, it creates a connection but doesn't notify the server
when closing it, so it stays on the CLOSE_WAIT status until it
times out.
Current test use a simple timeout for waiting the that server pod
is ready, it's better to use WaitForPodsReady for waiting that
the pod is available to avoid problems on busy environments like
the CI.
It also deletes the pods once the tests finish to avoid leaking
pods.
The original logic was that dumping can stop (for example, due to
loosing the connection to the apiserver) and then will start again as
long as the container exists. That it duplicates output on restarts
is better than skipping output that might not have been dumped yet.
But that logic then also dumped the output of containers that have
terminated multiple times:
- logging is started, dumps all output and stops because the
container has terminated
- next check finds the container again, sees no active logger,
repeats
This wasn't a problem for short-lived logging in a custom
namespace (the way how it is done for CSI drivers in Kubernetes E2E),
but other testsuites (like the one from PMEM-CSI) keep logging running
for the entire test suite duration: there duplicate output became a
problem when adding driver redeployment as part of the suite's run.
To avoid duplicated output for terminated containers, which containers
have been handled is now stored permanently. For terminated containers,
restarting of dumping is prevented. This comes with the risk that if
the previous dumping ended before capturing all output, some output
will get lost.
Marking the start and stop of the log was also useful when streaming
to a single writer and thus gets enabled.
There were several sshPort values in e2e test packages because
we've migrated code from e2e framework by copying and pastting.
This adds common SSHPort on e2essh package to reduce such duplicated
code.
Conformance tests must not rely on the kubelet API in order to
pass. In this case, I think it's unnecessary to verify that a
kubelet observes the deletion within gracePeriod seconds. The
remaining checks in this test verify that pod deletion happens,
and that the pod is removed.
Conformance tests must not rely on the kubelet API in order to
pass. SchedulerPredicates tests attempt to use the kubelet API
in their BeforeEach, some of which are tagged as Conformance.
Is there a compelling reason to use the kubelet's view of pods
for a given node instead of the apiserver's view of the pods?
we print yaml, so you can use yaml tools like `yq`:
```
e2e.test --list-conformance-tests | yq r - --collect *.testname
```
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
it turns out that the e2e test was not using the timeout used to
hold the CLOSE_WAIT status, hence the test was flake depending
on how fast it checked the conntrack table.
This PR replaces the dependency on ssh using a pod to check the conntrack
entries on the host in a loop, to make the test more robust
and reduce the flakiness due to race conditions and/or ssh issues.
It also fixes a bug trying to grep the conntrack entry, where
the error was swallowed if a conntrack entry wasn't found.
Integration tests imported e2e test code and the dependency made two drawbacks:
- Hard to move test/e2e/framework into staging (#74352)
- Need to run integration tests always even if PRs are just changing e2e test code
This enables import-boss check for blocking such dependency.
Sometimes the pod has already been cleaned up by the time the test
tried to grab the logs.
Mar 27 16:19:38.066: INFO: Waiting for client-a-jt4tf to complete.
Mar 27 16:19:38.066: INFO: Waiting up to 5m0s for pod "client-a-jt4tf" in namespace "e2e-network-policy-c-9007" to be "success or failure"
Mar 27 16:19:38.072: INFO: Pod "client-a-jt4tf": Phase="Pending", Reason="", readiness=false. Elapsed: 6.270302ms
Mar 27 16:19:40.078: INFO: Pod "client-a-jt4tf": Phase="Pending", Reason="", readiness=false. Elapsed: 2.01233019s
Mar 27 16:19:42.086: INFO: Pod "client-a-jt4tf": Phase="Succeeded", Reason="", readiness=false. Elapsed: 4.020186873s
STEP: Saw pod success
Mar 27 16:19:42.086: INFO: Pod "client-a-jt4tf" satisfied condition "success or failure"
Mar 27 16:19:42.093: FAIL: Error getting container logs: the server could not find the requested resource (get pods client-a-jt4tf)
Full Stack Trace
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/network.checkNoConnectivity(0xc00104adc0, 0xc0016b82c0, 0xc001666400, 0xc000c32000)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/network/network_policy.go:1457 +0x2a0
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/network.testCannotConnect(0xc00104adc0, 0xc0016b82c0, 0x55587e9, 0x8, 0xc000c32000, 0x50)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/network/network_policy.go:1406 +0x1fc
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/network.glob..func13.2.7()
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/network/network_policy.go:285 +0x883
github.com/openshift/origin/pkg/test/ginkgo.(*TestOptions).Run(0xc001e47830, 0xc001e50b70, 0x1, 0x1, 0x0, 0x0)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/pkg/test/ginkgo/cmd_runtest.go:59 +0x41f
main.newRunTestCommand.func1(0xc00121b900, 0xc001e50b70, 0x1, 0x1, 0x0, 0x0)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:238 +0x15d
github.com/openshift/origin/vendor/github.com/spf13/cobra.(*Command).execute(0xc00121b900, 0xc001e50b30, 0x1, 0x1, 0xc00121b900, 0xc001e50b30)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/spf13/cobra/command.go:826 +0x460
github.com/openshift/origin/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc00121b180, 0x0, 0x60d2d00, 0x9887ec8)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/spf13/cobra/command.go:914 +0x2fb
github.com/openshift/origin/vendor/github.com/spf13/cobra.(*Command).Execute(...)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/spf13/cobra/command.go:864
main.main.func1(0xc00121b180, 0x0, 0x0)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:59 +0x9c
main.main()
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:60 +0x341
STEP: Cleaning up the pod client-a-jt4tf
STEP: Cleaning up the policy.
RestartControllerManager() is kube-controller specific function
and it is better to separate the function as subpackage of e2e
test framework.
In addition, the function made invalid dependency into e2essh.
So this separates the function into e2ekubesystem subpackage.
- Move utilities or constants out so that both of them should be able
to run independently.
- Rename the legacy test so that it can eventually be deleted when the
perf dash changes is done
When deleting fails, the tests should be considered as failed,
too. Ignoring the error caused a wrong return code in the CSI mock
driver to go unnoticed (see
https://github.com/kubernetes-csi/csi-test/pull/250). The v3.1.0
release of the CSI mock driver fixes that.
The function is called from e2e/network test only, so this moves
the function into the test for reducing e2e/framework/util.go code
and removing invalid dependency on e2e test framework.
The function is for persistent volumes and it doesn't have any
reason why it stays in core test framework. So this moves the
function into e2epv package for reducing e2e/framework/util.go
code.
Since 4e7c2f638d the function has been
called from storage vsphere e2e test only. This moves the function
into the test file for
- Reducing test/e2e/framework/util.go which is one of huge files
- Remove invalid dependency on e2e test framework
- Remove unnecessary TODO
for removing invalid dependency from e2e core framework to e2essh
subpackage and reducing test/e2e/framework/util.go code which is
one of huge files today.
WaitForPod*() are just wrapper functions for e2epod package, and they
made an invalid dependency to sub e2e framework from the core framework.
So this replaces WaitForPodTerminated() with the e2epod function.
and they made an invalid dependency to sub e2e framework from the core framework.
So we can use e2epod.WaitTimeoutForPodReadyInNamespace to remove invalid dependency.
The main purpose of this pr is to handle the framework core package dependency subpackage pod.
WaitForPod*() are just wrapper functions for e2epod package, and they
made an invalid dependency to sub e2e framework from the core framework.
So this replaces WaitForPodNoLongerRunning() with the e2epod function.
When kubelet is restarted, it will now remove the resources for huge
page sizes no longer supported. This is required when:
- node disables huge pages
- changing the default huge page size in older versions of linux
(because it will then only support the newly set default).
- Software updates that change what sizes are supported (eg. by changing
boot parameters).
The e2e framework package podlogs is used in e2e/storage/testsuites
only. In addition we considered we should have a single e2e framework
package for pod without the podlogs. So this moves the podlogs into
e2e/storage/podlogs for the e2e storage tests.
Windows test "[sig-windows] [Feature:Windows] Cpu Resources Container
limits should not be exceeded after waiting 2 minutes" should be run
serially to prevent flakyness.
WaitForPod*() are just wrapper functions for e2epod package, and they
made an invalid dependency to sub e2e framework from the core framework.
So this replaces WaitForPodRunning() with the e2epod function.
Adds splitOsArch function to image-util.sh, which makes the script DRY-er.
When building a Windows test image, if REMOTE_DOCKER_URL is not set, skip the rest of the
building process for that image, which will save some time (no need to build binaries).
If a REMOTE_DOCKER_URL was not set for a particular OS version, exclude that image from the
manifest list. This fixes an issue where, if REMOTE_DOCKER_URL was not set for Windows Server 1909,
the Windows were completely excluded from the manifest list, including for Windows Server 1809
and 1903 which could have been built and pushed.
Sets "test-webserver" as the default CMD for kitten and nautilus. Since they are now based on
agnhost, they should be set to run test-webserver to maintain previous behaviour.
So multiple instances of kube-apiserver can bind on the same address and
port, to provide seamless upgrades.
Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
This change removes support for basic authn in v1.19 via the
--basic-auth-file flag. This functionality was deprecated in v1.16
in response to ATR-K8S-002: Non-constant time password comparison.
Similar functionality is available via the --token-auth-file flag
for development purposes.
Signed-off-by: Monis Khan <mok@vmware.com>
Some e2e tests depend on the controller-manager to expose metrics
on the path /metrics.
It may happen that when the test runs, the pod is not available or the
URL not ready, causing it to fail.
Previously, the test were waiting until the pod was running, but we
need to wait until the /metrics URL is ready.
The MetricsGrabber may use the controller-manager pod
to gather metrics, however, it doesn't wait until
it is ready to serve, failing the test if this is the
case.
We wait until the controller-manager pod is running
before trying to get metrics from it.
There were framework.ExpectNoError(fmt.Errorf(..)) calls which just
raise an exception without actual value checks, they just raised the
specified error messages. These usages of framework.ExpectNoError()
seemed a little tricky, so this replaces them with corresponding check
functions for the readability.
The configuration file was design as a yaml file on purpose.
To easily extend the test cases without a need to modify
the testing binary. Also, it's possible to extend the configuration
itself to enrich individual test cases.
The kubelet can race when a pod is deleted and report that a container succeeded
when it instead failed, and thus the pod is reported as succeeded. Create an e2e
test that demonstrates this failure.
After moving Permit() to the scheduling cycle test PermitPlugin should
no longer wait inside Permit() for another pod to enter Permit() and become waiting pod.
In the past this was a way to make test work regardless of order in
which pods enter Permit(), but now only one Permit() can be executed at
any given moment and waiting for another pod to enter Permit() inside
Permit() leads to timeouts.
In this change waitAndRejectPermit and waitAndAllowPermit flags make first
pod to enter Permit() a waiting pod and second pod to enter Permit()
either rejecting or allowing pod.
Mentioned in #88469
Extends agnhost with the capability to validate a mounted token against
the API server's OIDC endpoints.
Co-authored-by: Michael Taufen <mtaufen@google.com>
Close outbound connections when using a cert callback and certificates rotate. This means that we won't get into a situation where we have open TLS connections using expires certs, which would get unauthorized errors at the apiserver
Attempt to retrieve a new certificate if open connections near expiry, to prevent the case where the cert expires but we haven't yet opened a new TLS connection and so GetClientCertificate hasn't been called.
Move certificate rotation logic to a separate function
Rely on generic transport approach to handle closing TLS client connections in exec plugin; no need to use a custom dialer as this is now the default behaviour of the transport when faced with a cert callback. As a result of handling this case, it is now safe to apply the transport approach even in cases where there is a custom Dialer (this will not affect kubelet connrotation behaviour, because that uses a custom transport, not just a dialer).
Check expiry of the full TLS certificate chain that will be presented, not only the leaf. Only do this check when the certificate actually rotates. Start the certificate as a zero value, not nil, so that we don't see a rotation when there is in fact no client certificate
Drain the timer when we first initialize it, to prevent immediate rotation. Additionally, calling Stop() on the timer isn't necessary.
Don't close connections on the first 'rotation'
Remove RotateCertFromDisk and RotateClientCertFromDisk flags.
Instead simply default to rotating certificates from disk whenever files are exclusively provided.
Add integration test for client certificate rotation
Simplify logic; rotate every 5 mins
Instead of trying to be clever and checking for rotation just before an
expiry, let's match the logic of the new apiserver cert rotation logic
as much as possible. We write a controller that checks for rotation
every 5 mins. We also check on every new connection.
Respond to review
Fix kubelet certificate rotation logic
The kubelet rotation logic seems to be broken because it expects its
cert files to end up as cert data whereas in fact they end up as a
callback. We should just call the tlsConfig GetCertificate callback
as this obtains a current cert even in cases where a static cert is
provided, and check that for validity.
Later on we can refactor all of the kubelet logic so that all it does is
write files to disk, and the cert rotation work does the rest.
Only read certificates once a second at most
Respond to review
1) Don't blat the cert file names
2) Make it more obvious where we have a neverstop
3) Naming
4) Verbosity
Avoid cache busting
Use filenames as cache keys when rotation is enabled, and add the
rotation later in the creation of the transport.
Caller should start the rotating dialer
Add continuous request rotation test
Rebase: use context in List/Watch
Swap goroutine around
Retry GETs on net.IsProbableEOF
Refactor certRotatingDialer
For simplicity, don't affect cert callbacks
To reduce change surface, lets not try to handle the case of a changing
GetCert callback in this PR. Reverting this commit should be sufficient
to handle that case in a later PR.
This PR will focus only on rotating certificate and key files.
Therefore, we don't need to modify the exec auth plugin.
Fix copyright year
Quite a few images are only used a few times in a few tests. Thus,
the images are being centralized into the agnhost image, reducing
the number of images that have to be pulled and used.
This PR replaces the usage of the following images with agnhost:
- mounttest
- mounttest-user
Additionally, removes the usage of the mounttest-user image and removes
it from kubernetes/test/images. RunAsUser is set instead of having that image.
Most of these could have been refactored automatically but it wouldn't
have been uglier. The unsophisticated tooling left lots of unnecessary
struct -> pointer -> struct transitions.
This is gross but because NewDeleteOptions is used by various parts of
storage that still pass around pointers, the return type can't be
changed without significant refactoring within the apiserver. I think
this would be good to cleanup, but I want to minimize apiserver side
changes as much as possible in the client signature refactor.
The condition was not part of the message and so would not
match:
OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"rootfs_linux.go:58: mounting \\\"/var/lib/kubelet/pods/128aea1f-bde3-43d5-8b5f-dd86b9a5ef33/volumes/kubernetes.io~secret/default-token-v55hm\\\" to rootfs \\\"/var/lib/docker/overlay2/813487ba91d534ded546ae34f2a05e7d94c26bd015d356f9b2641522d8f0d6da/merged\\\" at \\\"/var/run/secrets/kubernetes.io/serviceaccount\\\" caused \\\"stat /var/lib/kubelet/pods/128aea1f-bde3-43d5-8b5f-dd86b9a5ef33/volumes/kubernetes.io~secret/default-token-v55hm: no such file or directory\\\"\"": unknown
Updated the check and regex.
Make sure the SR-IOV device plugin is ready, and that
there are enough SR-IOV devices allocatable before
spinning up test pods.
Signed-off-by: vpickard <vpickard@redhat.com>
1. move the integration test of TaintBasedEvictions to test/integration/node
2. move the e2e test of TaintBasedEvictions e2e test/e2e/node
3. modify the conformance file to adapt the TaintBasedEviction test
The current agnhost version is 2.12, 2.11 was not previously built as the
VERSION bumps merged one after the other, and the Image Promoter did not get to
build the 2.11 image.
In the current version, due to how make works, when building all the conformance
images (make all-push WHAT=all-conformance), ALL the images are being built first
before being pushed.
This PR will allow images to be built and pushed immediately afterwards, so the first
images that have been succesfully built are already pushed and promotable, even if
the the task failed on the last image, or it timed out.
A previous PR (#76838) introduced the ability to build and publish
Windows Test Images to kubernetes/test/images/image-util.sh.
Additionally, that PR also configured the Image Promoter to use a
few Windows Remote Docker build nodes to build the Windows Test Images,
however, there is a minor issue: the build container has a different $HOME
folder than expected (is: /builder/home, expected: /root - since it's the
root user), and the Remote Docker credentials are mounted in /root.
Because of that, image-build.sh cannot find the credentials it needs.
This will have to be properly fixed, but for now, we can just skip
the Windows image building part.
Quite a few images are only used a few times in a few tests. Thus,
the images are being centralized into the agnhost image, reducing
the number of images that have to be pulled and used.
This PR replaces the usage of the following images with agnhost:
- dnsutils
dnsmasq is a Linux specific binary. In order for the tests to also
pass on Windows, CoreDNS should be used instead.
- Search/replace Google Infra kube-cross locations for K8s Infra
- Update kube-cross make targets
- Don't attempt to pre-pull image (docker build --pull)
This prevents CI failures when the image under test doesn't exist
yet in the registry.
- 'make all' now builds and pushes the kube-cross image
- Allow 'TAG' to be specified via env var
- Use 'KUBE_CROSS_VERSION' to represent the kube-cross version
- Tag kube-cross images with both a kubernetes version
('git describe') and a kube-cross version
- Add a GCB (Google Cloud Build) config file (cloudbuild.yaml)
Signed-off-by: Stephen Augustus <saugustus@vmware.com>
We don't want to set the name directly because then starting the pod
can fail when the node is temporarily out of resources
(https://github.com/kubernetes/kubernetes/issues/87855).
For CSI driver deployments, we have three options:
- modify the pod spec with custom code, similar
to how the NodeSelection utility code does it
- add variants of SetNodeSelection and SetNodeAffinity which
work with a pod spec instead of a pod
- change their parameter from pod to pod spec and then use
them also when patching a pod spec
The last approach is used here because it seems more general. There
might be other cases in the future where there's only a pod spec that
needs to be modified.
A previous PR replaced the usage of Redis in the guestbook app test
with Agnhost. The replacement went well for Linux setups and Containers,
which is why the tests are green, but there is a network particularity on
Windows setups which won't allow the test to pass.
The issue was observed with another test hitting the same issue:
https://github.com/kubernetes/kubernetes/issues/83072
Here's exactly what happens during the test:
- frontend containers are created, having the /guestbook endpoint. Its main
purpose is to forward the call to either agnhost-master (cmd=set), or
agnhost-slave (cmd=get).
- agnhost-master container is created, having the /set endpoint, and the
/register endpoint, through which the agnhost-slave containers would
register to it. Its purpose is to propagate all data received through /set
to its clients.
- agnhost-slave containers are created, having the /set and /get endpoints.
They would register to agnhost-master, and then receive any and all updates
from it, which was then served through the /get endpoint.
For simplicity, all 3 types have the same agnhost subcommand (agnhost guestbook), being
able to satisfy its given purpose. For this, HTTP servers were being used, including
for the /register endpoints. agnhost-master would send its /set updates as /set HTTP
requests. However, because of the issue listed above, agnhost-master did not receive
the client's IP, but rather the container host's IP, resulting in the request being
sent to the wrong destination.
This PR updates the agnhost guestbook subcommand. Now, the agnhost subscriber nodes will
send their own IP to the /register endpoint (/endpoint?host=myip).
In order to promote the volume limits e2e test (from CSI Mock driver)
to Conformance, we can't rely on specific output of optional Condition
fields. Thus, this commit changes the test to only check the presence
of the right condition and verify that the optional fields are not empty.
The existing walk.go and conformance.txt have a few shortcomings
which we'd like to resolve:
- difficult to get the full test name due to test context nesting
- complicated AST logic and understanding necessary due to the
different ways a test can be invoked and written
This changes the AST parsing logic to be much more simple and simply
looks for the comments at/around a specific line. This file/line
information (and the full test name) is gathered by a custom ginkgo
reporter which dumps the SpecSummary data to a file.
Also, the SpecSummary dump can, itself, be potentially useful for
other post-processing and debugging tasks.
Signed-off-by: John Schnake <jschnake@vmware.com>
The service session affinity allows to set the maximum session
sticky timeout.
This commit adds e2e tests to check that the session is sticky
before the timeout and is not after.