Currently the guestbook application will fail if unable
to resolve TCP address on first attempt. If pod networking
is not setup when the application starts then it will be
unable to resolve, leading to frequent failures. This moves
the address resolution into the retry block so it will try
again if unsuccessful on first attempt.
Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
The current /exit method is not sufficient to test graceful shutdown
behaviors within Kube that allow services to remain available during
rolling restarts. Add support for `wait=DURATION` and
`timeout=DURATION` to the exit handler and wire that to the Go http
server's graceful termination.
With these methods netexec can be used in a pod to simulate graceful
shutdown by adding a preStop handler that hits the exit endpoint with
a timeout and wait period.
kubelet sometimes calls NodeStageVolume an NodePublishVolume too
often, which breaks this test and leads to flakiness. The test isn't
about that, so we can relax the checking and it still covers what it
was meant to cover.
collectPodsAndNetworkPolicies() is called to collect diagnostics
after a failure. Previously, if it encountered a failure in getting
the logs it would call Failf(), discarding the rest of the diagnostics
immediately.
Following changes in #87730, Kubelet is directly hcsshim to gather stats.
However, unlike `docker stats` API that was used before, hcsshim does not
keep information about exited containers.
When the Kubelet lists containers (`docker_container.go:ListContainers()`),
it sets `All: true`, retrieving non-running containers.
When docker stats is called with such container id, it'll return a valid JSON
with all values set to 0. The non-running containers are filtered later on in the process.
When the hcsshim is called with such container id, it'll return an error, effectively
stopping the stats retrieval for all containers.
"Volumes GlusterFS should be mountable" is a bit flaky in a downstream CI.
This PR make "should be mountable" test on par with the other GlusterFS
tests (in_tree.go: DeleteVolume())
commit 43c56eb403 introduced a change
where CPUAccounting, CPUAccounting and TasksAccounting are enabled for
the systemd service.
It causes a regression on RHEL 7.8 where systemd-run doesn't allow to
set TasksAccounting.
Since Delegate= already enables all the controllers, it is superfluous
to specify them.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
In caf0d1d61874a2c8687b7deb773eca30ddaee5b6 we documented a policy to
ensure that conformance tests should not rely in existence or use of
kubelet apis directly. So based on that we should drop conformance for
the two tests here that use the "/logs" endpoint directly.
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
The test "should not change the subpath mount on a container restart if the environment variable changes"
creates a pod with the liveness probe: cat /volume_mount/test.log. The test then
deletes that file, which causes the probe to fail and the container to be restarted.
After which it recreates the file by exec-ing into the pod, but there is a chance
that the container was not created yet, or it did not start yet.
This commit adds a few retries to the exec command.
this is mainly to ensure integration tests (which all end in _test)
are properly bossed around for their imports
I had to adjust some of the _test files to adhere to existing
reverse_rules specified elsewhere
specifically:
- cmd/kubeadm/.import-restrictions
- we don't need to explicitly allow k8s.io repos (external or published)
- rm pkg/controller/.import-restrictions
- pkg/client/unversioned was removed in 59042
- pkg/kubectl/.import-restrictions
- pkg/printers is no longer used
- pkg/api was masking all of the pkg/apis prefixes
- rm staging/src/k8s.io/code-generator/cmd/lister-gen/.import-restrictions
- noop / empty file
- test/e2e/framework/.import-restrictions
- we don't need to explicitly allow k8s.io repos (external or published)
yaml has comments, so we can explain why we have certain rules or
certain prefixes
for those files that weren't already commented yaml, I converted them to
yaml and took a best guess at comments based on the PRs that introduced
or updated them
When a test pattern or storage class uses late binding, the cleanup
code didn't know about the PV that may have been created for the PVC
since setting it up and thus then also didn't wait for PV deletion.
This is problematic for test isolation because the next test was
allowed to be started before fully cleaning up. Worse, it the driver
gets removed after the test, the volume might never get deleted.
'docker pull' is a time consuming operation. It makes sense to check
if image exists locally before pulling it from a registry.
Checked if image exists by running 'docker inspect'. Only pull if
image doesn't exist.
The service allocator is used to allocate ip addresses for the
Service IP allocator and NodePorts for the Service NodePort
allocator. It uses a bitmap backed by etcd to store the allocation
and tries to allocate the resources directly from the local memory
instead from etcd, that can cause issues in environment with
high concurrency.
It may happen, in deployments with multiple apiservers, that the
resource allocation information is out of sync, this is more
sensible with NodePorts, per example:
1. apiserver A create a service with NodePort X
2. apiserver B deletes the service
3. apiserver A creates the service again
If the allocation data of apiserver A wasn't refreshed with the
deletion of apiserver B, apiserver A fails the allocation because
the data is out of sync. The Repair loops solve the problem later,
but there are some use cases that require to improve the concurrency
in the allocation logic.
We can try to not do the Allocation and Release operations locally,
and try instead to check if the local data is up to date with etcd,
and operate over the most recent version of the data.
- add ./hack/tools/go.mod, this makes ./hack/tools a distinct module
- hack/tools/tools.go undescore imports bazel related tools, over time we
can add others.
- hack/*.sh scripts will cd to hack/tools and go install tools from there
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
During a parallel test run these tests were observed to cause a
preemption of another test:
```
Apr 19 16:59:06.749: INFO: At 2020-04-19 16:58:52 +0000 UTC - event for pod-init-b6fbd440-dbc2-454a-b31a-ce44266298d1: {default-scheduler } Scheduled: Successfully assigned e2e-init-container-7691/pod-init-b6fbd440-dbc2-454a-b31a-ce44266298d1 to ip-10-0-148-234.us-west-2.compute.internal
Apr 19 16:59:06.750: INFO: At 2020-04-19 16:58:54 +0000 UTC - event for pod-init-b6fbd440-dbc2-454a-b31a-ce44266298d1: {default-scheduler } Preempted: Preempted by e2e-resourcequota-priorityclass-8850/testpod-pclass9 on node ip-10-0-148-234.us-west-2.compute.internal
```
These tests have no need to actually land on a node to validate resource quota and so we can set an impossible scheduling condition. Hopefully we don't have tests that too broadly check impossible scheduling conditions.
The admission cache may take longer to see both ingress classes
than it takes to create the ingress. We must loop until we see
the appropriate error, cleaning up after ourselves as we go.
Don't set a connection deadline for reading, because the read operation will
fail if no data is reaceived after the deadline, and will not keep the
connection in the close_wait status.
Copy csi-hostpath driver manifests from
kubernetes-csi/csi-driver-host-path. It bumps version of all images to the
release shipped along Kubernetes 1.18.
As seen in one case (https://github.com/intel/pmem-csi/issues/587), a
pod can reach the "not running" state although its ephemeral volumes
are still being torn down by kubelet and the CSI driver. What happens
then is that the test returns too early and even deleting the
namespace and thus the pod succeeds before the NodeVolumeUnpublish
really finishes.
To avoid this, StopPod now waits for the pod to really disappear.
The agnhost image used for testing has a `netexec` path which supports
two new flags, `--tls-cert-file` and `--tls-private-key-file`. If the
former is provided, the HTTP server will be upgraded to HTTPS, using the
certificate (and private key) provided.
By default, there are keys already mounted into the container at
`/localhost.crt` and `/localhost.key`, which contain PEM-encoded TLS
certs with IP SANs for `127.0.0.1` and `[::1]`.
This adds 2 new tests covering EndpointSlices, including new coverage of
the self referential Endpoints and EndpointSlices that need to be
created by the API Server and the lifecycle of EndpointSlices from
creation to deletion. This also removes the [feature] indicator from the
name to ensure that this test will run more often now that it is enabled
by default.
Adds reviewers to the OWNERS files in the kubernetes/test/images folder.
The reviewers are added automatically, based on their contributions on
an image (>= 20% code churn).
Note that the code churn is taken into account for authors, and not committers.
Adds OWNERS files for: cuda-vector-add, nonewprivs, pets, redis, volume.
Adds reviewers to the OWNERS files in the kubernetes/test/images folder.
The reviewers are added automatically, based on their contributions on
an image (>= 20% code churn).
Note that the code churn is taken into account for authors, and not committers.
Adds ONWERS files for: apparmor-loader, echoserver, jessie-dnsutils, metadata-concealment,
sample-apiserver.
The build times are a bit high for the image builder (~50 minutes), and it will a bit more
when Windows support will be added to the other test images. This commit changes the
machineType to N1_HIGHCPU_8.
Reenables Windows test image building. Added DOCKER_CERT_BASE_PATH (default value: $HOME),
which will contain the path where the certificates needed for Remote Docker Connection can
be found.
If a REMOTE_DOCKER_URL was not set for a particular OS version, exclude that image from the
manifest list. This fixes an issue where, if REMOTE_DOCKER_URL was not set for Windows Server 1909,
the Windows were completely excluded from the manifest list, including for Windows Server 1809
and 1903 which could have been built and pushed.
Sets "test-webserver" as the default CMD for kitten and nautilus. Since they are now based on
agnhost, they should be set to run test-webserver to maintain previous behaviour.
Bumps the agnhost version to 2.13, as 2.12 has already been promoted. 2.13 will contain
Windows support.
Adds Windows support for the kitten and nautilus images, so they can promoted together
with agnhost (they were not previously promoted).
Adds OWNERS files to: agnhost, busybox, kitten, nautilus.
The timeout for the two loops inside the test itself are now bounded
by an upper limit for the duration of the entire test instead of
having their own, rather arbitrary timeouts.
The functionality included in the e2e/manifests is useful for writing
e2e tests and will be a good addition to the test framework as a
sub-package.
Signed-off-by: alejandrox1 <alarcj137@gmail.com>
Before https://github.com/kubernetes/kubernetes/pull/83084, `kubectl
apply --prune` can prune resources in all namespaces specified in
config files. After that PR got merged, only a single namespace is
considered for pruning. It is OK if namespace is explicitly specified
by --namespace option, but what the PR does is use the default
namespace (or from kubeconfig) if not overridden by command line flag.
That breaks the existing usage of `kubectl apply --prune` without
--namespace option. If --namespace is not used, there is no error,
and no one notices this issue unless they actually check that pruning
happens. This issue also prevents resources in multiple namespaces in
config file from being pruned.
kubectl 1.16 does not have this bug. Let's see the difference between
kubectl 1.16 and kubectl 1.17. Suppose the following config file:
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
creationTimestamp: null
name: foo
namespace: a
labels:
pl: foo
data:
foo: bar
---
apiVersion: v1
kind: ConfigMap
metadata:
creationTimestamp: null
name: bar
namespace: a
labels:
pl: foo
data:
foo: bar
```
Apply it with `kubectl apply -f file`. Then comment out ConfigMap foo
in this file. kubectl 1.16 prunes ConfigMap foo with the following
command:
$ kubectl-1.16 apply -f file -l pl=foo --prune
configmap/bar configured
configmap/foo pruned
But kubectl 1.17 does not prune ConfigMap foo with the same command:
$ kubectl-1.17 apply -f file -l pl=foo --prune
configmap/bar configured
With this patch, kubectl once again can prune the resource as before.
/cluster/kubeadm.sh is used to find the kubeadm binary.
This file is legacy and is removed.
Remove /test/cmd/kubeadm.sh. This file contains a function that is used
to build kubeadm and invoke "make test". Move the function contents
to hack/make-rules/test-cmd.cmd.
Stop sourcing /test/cmd/kubeadm.sh in /test/cmd/legacy-script.sh.
Also remove the --kubeadm-path invocation as this can be handled
with an env. variable directly.
The "error waiting for expected CSI calls" is redundant because it's
immediately followed by checking that error with:
framework.ExpectNoError(err, "while waiting for all CSI calls")
The mock driver gets instructed to return a ResourceExhausted error
for the first CreateVolume invocation via the storage class
parameters.
How this should be handled depends on the situation: for normal
volumes, we just want external-scheduler to retry. For late binding,
we want to reschedule the pod. It also depends on topology support.
The code became obsolete with the introduction of parseMockLogs
because that will retrieve the log itself. For debugging of a running
test the normal pod output logging is sufficient.
parseMockLogs is called potentially multiple times while waiting for
output. Dumping all CSI calls each time is quite verbose and
repetitive. To verify what the driver has done already, the normal
capturing of the container log can be used instead:
csi-mockplugin-0/mock@127.0.0.1: gRPCCall: {"Method":"/csi.v1.Node/NodePublishVolume","Request"...
As seen in some test
runs (https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/89041),
retrieving output can fail with "the server rejected our request for
an unknown reason (get pods csi-mockplugin-0)".
If this truly an intermittent error, then the existing retry logic in
the callers can deal with this.
Especially related to "uncertain" global mounts. A large refactoring of CSI
mock tests were necessary:
- to be able to script the driver to return errors as required by the test
- to parse the CSI driver logs to check kubelet called the right CSI calls
the e2e TCP CLOSE_WAIT has to create a server pod and then, from
a client, it creates a connection but doesn't notify the server
when closing it, so it stays on the CLOSE_WAIT status until it
times out.
Current test use a simple timeout for waiting the that server pod
is ready, it's better to use WaitForPodsReady for waiting that
the pod is available to avoid problems on busy environments like
the CI.
It also deletes the pods once the tests finish to avoid leaking
pods.
The original logic was that dumping can stop (for example, due to
loosing the connection to the apiserver) and then will start again as
long as the container exists. That it duplicates output on restarts
is better than skipping output that might not have been dumped yet.
But that logic then also dumped the output of containers that have
terminated multiple times:
- logging is started, dumps all output and stops because the
container has terminated
- next check finds the container again, sees no active logger,
repeats
This wasn't a problem for short-lived logging in a custom
namespace (the way how it is done for CSI drivers in Kubernetes E2E),
but other testsuites (like the one from PMEM-CSI) keep logging running
for the entire test suite duration: there duplicate output became a
problem when adding driver redeployment as part of the suite's run.
To avoid duplicated output for terminated containers, which containers
have been handled is now stored permanently. For terminated containers,
restarting of dumping is prevented. This comes with the risk that if
the previous dumping ended before capturing all output, some output
will get lost.
Marking the start and stop of the log was also useful when streaming
to a single writer and thus gets enabled.
There were several sshPort values in e2e test packages because
we've migrated code from e2e framework by copying and pastting.
This adds common SSHPort on e2essh package to reduce such duplicated
code.
Conformance tests must not rely on the kubelet API in order to
pass. In this case, I think it's unnecessary to verify that a
kubelet observes the deletion within gracePeriod seconds. The
remaining checks in this test verify that pod deletion happens,
and that the pod is removed.
Conformance tests must not rely on the kubelet API in order to
pass. SchedulerPredicates tests attempt to use the kubelet API
in their BeforeEach, some of which are tagged as Conformance.
Is there a compelling reason to use the kubelet's view of pods
for a given node instead of the apiserver's view of the pods?
we print yaml, so you can use yaml tools like `yq`:
```
e2e.test --list-conformance-tests | yq r - --collect *.testname
```
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
it turns out that the e2e test was not using the timeout used to
hold the CLOSE_WAIT status, hence the test was flake depending
on how fast it checked the conntrack table.
This PR replaces the dependency on ssh using a pod to check the conntrack
entries on the host in a loop, to make the test more robust
and reduce the flakiness due to race conditions and/or ssh issues.
It also fixes a bug trying to grep the conntrack entry, where
the error was swallowed if a conntrack entry wasn't found.
Integration tests imported e2e test code and the dependency made two drawbacks:
- Hard to move test/e2e/framework into staging (#74352)
- Need to run integration tests always even if PRs are just changing e2e test code
This enables import-boss check for blocking such dependency.
Sometimes the pod has already been cleaned up by the time the test
tried to grab the logs.
Mar 27 16:19:38.066: INFO: Waiting for client-a-jt4tf to complete.
Mar 27 16:19:38.066: INFO: Waiting up to 5m0s for pod "client-a-jt4tf" in namespace "e2e-network-policy-c-9007" to be "success or failure"
Mar 27 16:19:38.072: INFO: Pod "client-a-jt4tf": Phase="Pending", Reason="", readiness=false. Elapsed: 6.270302ms
Mar 27 16:19:40.078: INFO: Pod "client-a-jt4tf": Phase="Pending", Reason="", readiness=false. Elapsed: 2.01233019s
Mar 27 16:19:42.086: INFO: Pod "client-a-jt4tf": Phase="Succeeded", Reason="", readiness=false. Elapsed: 4.020186873s
STEP: Saw pod success
Mar 27 16:19:42.086: INFO: Pod "client-a-jt4tf" satisfied condition "success or failure"
Mar 27 16:19:42.093: FAIL: Error getting container logs: the server could not find the requested resource (get pods client-a-jt4tf)
Full Stack Trace
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/network.checkNoConnectivity(0xc00104adc0, 0xc0016b82c0, 0xc001666400, 0xc000c32000)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/network/network_policy.go:1457 +0x2a0
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/network.testCannotConnect(0xc00104adc0, 0xc0016b82c0, 0x55587e9, 0x8, 0xc000c32000, 0x50)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/network/network_policy.go:1406 +0x1fc
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/network.glob..func13.2.7()
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/network/network_policy.go:285 +0x883
github.com/openshift/origin/pkg/test/ginkgo.(*TestOptions).Run(0xc001e47830, 0xc001e50b70, 0x1, 0x1, 0x0, 0x0)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/pkg/test/ginkgo/cmd_runtest.go:59 +0x41f
main.newRunTestCommand.func1(0xc00121b900, 0xc001e50b70, 0x1, 0x1, 0x0, 0x0)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:238 +0x15d
github.com/openshift/origin/vendor/github.com/spf13/cobra.(*Command).execute(0xc00121b900, 0xc001e50b30, 0x1, 0x1, 0xc00121b900, 0xc001e50b30)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/spf13/cobra/command.go:826 +0x460
github.com/openshift/origin/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc00121b180, 0x0, 0x60d2d00, 0x9887ec8)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/spf13/cobra/command.go:914 +0x2fb
github.com/openshift/origin/vendor/github.com/spf13/cobra.(*Command).Execute(...)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/spf13/cobra/command.go:864
main.main.func1(0xc00121b180, 0x0, 0x0)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:59 +0x9c
main.main()
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:60 +0x341
STEP: Cleaning up the pod client-a-jt4tf
STEP: Cleaning up the policy.
RestartControllerManager() is kube-controller specific function
and it is better to separate the function as subpackage of e2e
test framework.
In addition, the function made invalid dependency into e2essh.
So this separates the function into e2ekubesystem subpackage.
- Move utilities or constants out so that both of them should be able
to run independently.
- Rename the legacy test so that it can eventually be deleted when the
perf dash changes is done
When deleting fails, the tests should be considered as failed,
too. Ignoring the error caused a wrong return code in the CSI mock
driver to go unnoticed (see
https://github.com/kubernetes-csi/csi-test/pull/250). The v3.1.0
release of the CSI mock driver fixes that.
The function is called from e2e/network test only, so this moves
the function into the test for reducing e2e/framework/util.go code
and removing invalid dependency on e2e test framework.
The function is for persistent volumes and it doesn't have any
reason why it stays in core test framework. So this moves the
function into e2epv package for reducing e2e/framework/util.go
code.
Since 4e7c2f638d the function has been
called from storage vsphere e2e test only. This moves the function
into the test file for
- Reducing test/e2e/framework/util.go which is one of huge files
- Remove invalid dependency on e2e test framework
- Remove unnecessary TODO
for removing invalid dependency from e2e core framework to e2essh
subpackage and reducing test/e2e/framework/util.go code which is
one of huge files today.
WaitForPod*() are just wrapper functions for e2epod package, and they
made an invalid dependency to sub e2e framework from the core framework.
So this replaces WaitForPodTerminated() with the e2epod function.
and they made an invalid dependency to sub e2e framework from the core framework.
So we can use e2epod.WaitTimeoutForPodReadyInNamespace to remove invalid dependency.
The main purpose of this pr is to handle the framework core package dependency subpackage pod.
WaitForPod*() are just wrapper functions for e2epod package, and they
made an invalid dependency to sub e2e framework from the core framework.
So this replaces WaitForPodNoLongerRunning() with the e2epod function.
When kubelet is restarted, it will now remove the resources for huge
page sizes no longer supported. This is required when:
- node disables huge pages
- changing the default huge page size in older versions of linux
(because it will then only support the newly set default).
- Software updates that change what sizes are supported (eg. by changing
boot parameters).
The e2e framework package podlogs is used in e2e/storage/testsuites
only. In addition we considered we should have a single e2e framework
package for pod without the podlogs. So this moves the podlogs into
e2e/storage/podlogs for the e2e storage tests.
Windows test "[sig-windows] [Feature:Windows] Cpu Resources Container
limits should not be exceeded after waiting 2 minutes" should be run
serially to prevent flakyness.
WaitForPod*() are just wrapper functions for e2epod package, and they
made an invalid dependency to sub e2e framework from the core framework.
So this replaces WaitForPodRunning() with the e2epod function.
Adds splitOsArch function to image-util.sh, which makes the script DRY-er.
When building a Windows test image, if REMOTE_DOCKER_URL is not set, skip the rest of the
building process for that image, which will save some time (no need to build binaries).
If a REMOTE_DOCKER_URL was not set for a particular OS version, exclude that image from the
manifest list. This fixes an issue where, if REMOTE_DOCKER_URL was not set for Windows Server 1909,
the Windows were completely excluded from the manifest list, including for Windows Server 1809
and 1903 which could have been built and pushed.
Sets "test-webserver" as the default CMD for kitten and nautilus. Since they are now based on
agnhost, they should be set to run test-webserver to maintain previous behaviour.
So multiple instances of kube-apiserver can bind on the same address and
port, to provide seamless upgrades.
Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
This change removes support for basic authn in v1.19 via the
--basic-auth-file flag. This functionality was deprecated in v1.16
in response to ATR-K8S-002: Non-constant time password comparison.
Similar functionality is available via the --token-auth-file flag
for development purposes.
Signed-off-by: Monis Khan <mok@vmware.com>
Some e2e tests depend on the controller-manager to expose metrics
on the path /metrics.
It may happen that when the test runs, the pod is not available or the
URL not ready, causing it to fail.
Previously, the test were waiting until the pod was running, but we
need to wait until the /metrics URL is ready.
The MetricsGrabber may use the controller-manager pod
to gather metrics, however, it doesn't wait until
it is ready to serve, failing the test if this is the
case.
We wait until the controller-manager pod is running
before trying to get metrics from it.
There were framework.ExpectNoError(fmt.Errorf(..)) calls which just
raise an exception without actual value checks, they just raised the
specified error messages. These usages of framework.ExpectNoError()
seemed a little tricky, so this replaces them with corresponding check
functions for the readability.
The configuration file was design as a yaml file on purpose.
To easily extend the test cases without a need to modify
the testing binary. Also, it's possible to extend the configuration
itself to enrich individual test cases.
The kubelet can race when a pod is deleted and report that a container succeeded
when it instead failed, and thus the pod is reported as succeeded. Create an e2e
test that demonstrates this failure.
After moving Permit() to the scheduling cycle test PermitPlugin should
no longer wait inside Permit() for another pod to enter Permit() and become waiting pod.
In the past this was a way to make test work regardless of order in
which pods enter Permit(), but now only one Permit() can be executed at
any given moment and waiting for another pod to enter Permit() inside
Permit() leads to timeouts.
In this change waitAndRejectPermit and waitAndAllowPermit flags make first
pod to enter Permit() a waiting pod and second pod to enter Permit()
either rejecting or allowing pod.
Mentioned in #88469
Extends agnhost with the capability to validate a mounted token against
the API server's OIDC endpoints.
Co-authored-by: Michael Taufen <mtaufen@google.com>
Close outbound connections when using a cert callback and certificates rotate. This means that we won't get into a situation where we have open TLS connections using expires certs, which would get unauthorized errors at the apiserver
Attempt to retrieve a new certificate if open connections near expiry, to prevent the case where the cert expires but we haven't yet opened a new TLS connection and so GetClientCertificate hasn't been called.
Move certificate rotation logic to a separate function
Rely on generic transport approach to handle closing TLS client connections in exec plugin; no need to use a custom dialer as this is now the default behaviour of the transport when faced with a cert callback. As a result of handling this case, it is now safe to apply the transport approach even in cases where there is a custom Dialer (this will not affect kubelet connrotation behaviour, because that uses a custom transport, not just a dialer).
Check expiry of the full TLS certificate chain that will be presented, not only the leaf. Only do this check when the certificate actually rotates. Start the certificate as a zero value, not nil, so that we don't see a rotation when there is in fact no client certificate
Drain the timer when we first initialize it, to prevent immediate rotation. Additionally, calling Stop() on the timer isn't necessary.
Don't close connections on the first 'rotation'
Remove RotateCertFromDisk and RotateClientCertFromDisk flags.
Instead simply default to rotating certificates from disk whenever files are exclusively provided.
Add integration test for client certificate rotation
Simplify logic; rotate every 5 mins
Instead of trying to be clever and checking for rotation just before an
expiry, let's match the logic of the new apiserver cert rotation logic
as much as possible. We write a controller that checks for rotation
every 5 mins. We also check on every new connection.
Respond to review
Fix kubelet certificate rotation logic
The kubelet rotation logic seems to be broken because it expects its
cert files to end up as cert data whereas in fact they end up as a
callback. We should just call the tlsConfig GetCertificate callback
as this obtains a current cert even in cases where a static cert is
provided, and check that for validity.
Later on we can refactor all of the kubelet logic so that all it does is
write files to disk, and the cert rotation work does the rest.
Only read certificates once a second at most
Respond to review
1) Don't blat the cert file names
2) Make it more obvious where we have a neverstop
3) Naming
4) Verbosity
Avoid cache busting
Use filenames as cache keys when rotation is enabled, and add the
rotation later in the creation of the transport.
Caller should start the rotating dialer
Add continuous request rotation test
Rebase: use context in List/Watch
Swap goroutine around
Retry GETs on net.IsProbableEOF
Refactor certRotatingDialer
For simplicity, don't affect cert callbacks
To reduce change surface, lets not try to handle the case of a changing
GetCert callback in this PR. Reverting this commit should be sufficient
to handle that case in a later PR.
This PR will focus only on rotating certificate and key files.
Therefore, we don't need to modify the exec auth plugin.
Fix copyright year
Quite a few images are only used a few times in a few tests. Thus,
the images are being centralized into the agnhost image, reducing
the number of images that have to be pulled and used.
This PR replaces the usage of the following images with agnhost:
- mounttest
- mounttest-user
Additionally, removes the usage of the mounttest-user image and removes
it from kubernetes/test/images. RunAsUser is set instead of having that image.
Most of these could have been refactored automatically but it wouldn't
have been uglier. The unsophisticated tooling left lots of unnecessary
struct -> pointer -> struct transitions.
This is gross but because NewDeleteOptions is used by various parts of
storage that still pass around pointers, the return type can't be
changed without significant refactoring within the apiserver. I think
this would be good to cleanup, but I want to minimize apiserver side
changes as much as possible in the client signature refactor.
The condition was not part of the message and so would not
match:
OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"rootfs_linux.go:58: mounting \\\"/var/lib/kubelet/pods/128aea1f-bde3-43d5-8b5f-dd86b9a5ef33/volumes/kubernetes.io~secret/default-token-v55hm\\\" to rootfs \\\"/var/lib/docker/overlay2/813487ba91d534ded546ae34f2a05e7d94c26bd015d356f9b2641522d8f0d6da/merged\\\" at \\\"/var/run/secrets/kubernetes.io/serviceaccount\\\" caused \\\"stat /var/lib/kubelet/pods/128aea1f-bde3-43d5-8b5f-dd86b9a5ef33/volumes/kubernetes.io~secret/default-token-v55hm: no such file or directory\\\"\"": unknown
Updated the check and regex.
Make sure the SR-IOV device plugin is ready, and that
there are enough SR-IOV devices allocatable before
spinning up test pods.
Signed-off-by: vpickard <vpickard@redhat.com>
1. move the integration test of TaintBasedEvictions to test/integration/node
2. move the e2e test of TaintBasedEvictions e2e test/e2e/node
3. modify the conformance file to adapt the TaintBasedEviction test
The current agnhost version is 2.12, 2.11 was not previously built as the
VERSION bumps merged one after the other, and the Image Promoter did not get to
build the 2.11 image.
In the current version, due to how make works, when building all the conformance
images (make all-push WHAT=all-conformance), ALL the images are being built first
before being pushed.
This PR will allow images to be built and pushed immediately afterwards, so the first
images that have been succesfully built are already pushed and promotable, even if
the the task failed on the last image, or it timed out.
A previous PR (#76838) introduced the ability to build and publish
Windows Test Images to kubernetes/test/images/image-util.sh.
Additionally, that PR also configured the Image Promoter to use a
few Windows Remote Docker build nodes to build the Windows Test Images,
however, there is a minor issue: the build container has a different $HOME
folder than expected (is: /builder/home, expected: /root - since it's the
root user), and the Remote Docker credentials are mounted in /root.
Because of that, image-build.sh cannot find the credentials it needs.
This will have to be properly fixed, but for now, we can just skip
the Windows image building part.
Quite a few images are only used a few times in a few tests. Thus,
the images are being centralized into the agnhost image, reducing
the number of images that have to be pulled and used.
This PR replaces the usage of the following images with agnhost:
- dnsutils
dnsmasq is a Linux specific binary. In order for the tests to also
pass on Windows, CoreDNS should be used instead.
- Search/replace Google Infra kube-cross locations for K8s Infra
- Update kube-cross make targets
- Don't attempt to pre-pull image (docker build --pull)
This prevents CI failures when the image under test doesn't exist
yet in the registry.
- 'make all' now builds and pushes the kube-cross image
- Allow 'TAG' to be specified via env var
- Use 'KUBE_CROSS_VERSION' to represent the kube-cross version
- Tag kube-cross images with both a kubernetes version
('git describe') and a kube-cross version
- Add a GCB (Google Cloud Build) config file (cloudbuild.yaml)
Signed-off-by: Stephen Augustus <saugustus@vmware.com>
We don't want to set the name directly because then starting the pod
can fail when the node is temporarily out of resources
(https://github.com/kubernetes/kubernetes/issues/87855).
For CSI driver deployments, we have three options:
- modify the pod spec with custom code, similar
to how the NodeSelection utility code does it
- add variants of SetNodeSelection and SetNodeAffinity which
work with a pod spec instead of a pod
- change their parameter from pod to pod spec and then use
them also when patching a pod spec
The last approach is used here because it seems more general. There
might be other cases in the future where there's only a pod spec that
needs to be modified.
A previous PR replaced the usage of Redis in the guestbook app test
with Agnhost. The replacement went well for Linux setups and Containers,
which is why the tests are green, but there is a network particularity on
Windows setups which won't allow the test to pass.
The issue was observed with another test hitting the same issue:
https://github.com/kubernetes/kubernetes/issues/83072
Here's exactly what happens during the test:
- frontend containers are created, having the /guestbook endpoint. Its main
purpose is to forward the call to either agnhost-master (cmd=set), or
agnhost-slave (cmd=get).
- agnhost-master container is created, having the /set endpoint, and the
/register endpoint, through which the agnhost-slave containers would
register to it. Its purpose is to propagate all data received through /set
to its clients.
- agnhost-slave containers are created, having the /set and /get endpoints.
They would register to agnhost-master, and then receive any and all updates
from it, which was then served through the /get endpoint.
For simplicity, all 3 types have the same agnhost subcommand (agnhost guestbook), being
able to satisfy its given purpose. For this, HTTP servers were being used, including
for the /register endpoints. agnhost-master would send its /set updates as /set HTTP
requests. However, because of the issue listed above, agnhost-master did not receive
the client's IP, but rather the container host's IP, resulting in the request being
sent to the wrong destination.
This PR updates the agnhost guestbook subcommand. Now, the agnhost subscriber nodes will
send their own IP to the /register endpoint (/endpoint?host=myip).
In order to promote the volume limits e2e test (from CSI Mock driver)
to Conformance, we can't rely on specific output of optional Condition
fields. Thus, this commit changes the test to only check the presence
of the right condition and verify that the optional fields are not empty.