- Use preemptor pod's Status.NominatedNodeName to signal success of the Preemption behavior
- Optimize the test to eliminate unnecessary Pods creation
- Increase timeout from 1 minute to 2 minutes
GetPodLogs always fails when the tests fail, which is because the tests
specify wrong container names when getting logs.
When creating a client Pod, it specifies "<podName>-container" as
container name and "<podName>-" as Pod GenerateName. For instance,
podName "client-a" will result in "client-a-container" as the container
name and "client-a-vx5sv" as the actual Pod name, but it always uses the
actual Pod name to construct the container name when getting logs, e.g.
"client-a-vx5sv-container".
This patch fixes it by specifying the same static container name when
creating Pod and getting logs.
ServiceStartTimeout is defined at e2e core framework and the one of
the core is used at many places, but the one of this endpoints/ports.go
is not used at the other places.
So this removes the one of endpoints/ports.go for the cleanup.
The manifest list is stateful, which means that the same list will get amended
with each successive image published. That's unintended, and can lead to the
wrong image being pulled from the manifest list.
Resets the manifest list before amending new images into it.
The test was flaky because it required the job succeeds 3 times with
pseudorandom 50% failure chance within 15 minutes, while there is an
exponential back-off delay (10s, 20s, 40s …) capped at 6 minutes before
recreating failed pods. As 7 consecutive failures (1/128 chance) could
take 20+ minutes, exceeding the timeout, the test failed intermittently
because of "timed out waiting for the condition".
This PR forces the Pods of a Job to be scheduled to a single node and
uses a hostPath volume instead of an emptyDir to persist data across new
Pods.
The e2e test "Kubectl Port forwarding With a server listening .."
is failed sometimes due to the difference between expected data and
received data. To receive the data, the test does CloseWrite() but
it didn't have the corresponding error handling.
This adds it to investigate the issue more.
Not all errors will happen in sync during Instances.Insert(...).Do(), so
it is important to verify the operation object to see why insert fails.
An example is when exceeding the resource quota.
Eg.
could not create instance test-cos-beta-80-12739-29-0: [&{Code:QUOTA_EXCEEDED Location: Message:Quota 'CPUS' exceeded. Limit: 24.0 in region europe-west6. ForceSendFields:[] NullFields:[]}
This fixes the issue where tests will fail "silently" when instance
insert fails.
This is currently the top flake against PRs, so I'm tagging it
as [Flaky]. Flaky tests can't be conformance tests, so I'm
removing it from [Conformance] as well until this is resolved.
it turns out that the e2e test was not using the timeout used to
hold the CLOSE_WAIT status, hence the test was flake depending
on how fast it checked the conntrack table.
This PR replaces the dependency on ssh using a pod to check the conntrack
entries on the host in a loop, to make the test more robust
and reduce the flakiness due to race conditions and/or ssh issues.
It also fixes a bug trying to grep the conntrack entry, where
the error was swallowed if a conntrack entry wasn't found.
It seems that the Image Promoter is running containers without the -t flag, which causes the error:
the input device is not a TTY
Removing the -it from the docker command in kubernetes/test/images/image-util.sh solves this.
Issue: https://github.com/kubernetes/kubernetes/issues/86884
This PR fixes the issue that err variable gets shadowed. Because of
this, it might get nil pointer error.
Change-Id: Ib7da918418a7c8148a6ca598db12b3744eb3b7c8
Prior to the Image Centralization part 4 (https://github.com/kubernetes/kubernetes/pull/81170),
a PR merged that enables the Image Promoter to run on the k/k test images.
The Image Promoter currently only builds the Conformance-related images, but the
Image Centralization part 4 centralized some of those images into agnhost, so they
need to be removed from the conformance_images list.
Additionally, https://github.com/kubernetes/kubernetes/pull/81226 proposes mounttest-user
image to be removed, and RunAsUser to be used in tests instead.
The image used by the Image Promoter (gcr.io/k8s-testimages/gcb-docker-gcloud:v20190906-745fed4)
is based on busybox, and thus, the sed binary is actually busybox. image-util.sh calls
kube::util::ensure-gnu-sed several times, which ensures that a GNU sed binary exists
(it checks by greping GNU in its --help output). Obviously, it won't match the busybox sed
binary. But the sed usage in image-util.sh is fairly simple, and the busybox sed is sufficient.
Bumps image versions for: jessie-dnsutils, nonewprivs, resource-consumer, sample-apiserver. These
images are included in the conformance_images that are being built by the Image Promoter, so
we're bumping them just to make sure we're not breaking anything and cause all the CIs to fall.
We're going to bump the image versions used in tests in a subsequent PR. The image version was not
bumped for: agnhost, kitten, nautilus, as they were already bumped by the Image Centralization part 4
PR.
it turns out that the e2e test was no using the timeout used to
hold the CLOSE_WAIT status, hence the test was flake depending
on how fast it checked the conntrack table.
This PR replaces the dependency on ssh using a pod to check the conntrack
entries on the host in a loop, to make the test more robust
and reduce the flakiness due to race conditions and/or ssh issues.
It also fixes a bug trying to grep the conntrack entry, where
the error was swallowed if a conntrack entry wasn't found..
Remove Todos for verification of “allowed to
post CSR” and "allowed to auto approve CSR"
for bootstraptoken group, as these privilege
bindings are not managed by kubeadm source
code
Signed-off-by: Howard Zhang <howard.zhang@arm.com>
Now we are facing flake test of TestPreemption due to less available
node. TestPreemption consists of multiple test cases and the resource
is shared in them.
At this time, we cannot see what test cases run before the flake
happens. So it is better to know that to distinguish the cleanup of
pods is not completed or not.
The log of a flake test says
"Pod did not start running: timed out waiting for the condition"
but it is hard to know what is actual status of the pod.
So this adds debugging message to know that.
DeleteSyncInNamespace() was used at an e2e node test and DeleteSync()
only. In addition, the part of the e2e node test can be replaced with
DeleteSync(). CreateSyncInNamespace() is the same thing and can be
replaced with CreateSync(). So this replaces these functions and
removes them for the cleanup.
In order for the E2E test images to be automatically built and published
to the staging registry (from which they will be promoted to the regular
E2E test registry), the cloudbuild.yaml file has been added.
The file was added in conformance with [1].
Adds the ability to build all test images:
make -C test/images WHAT=all-images
[1] https://github.com/kubernetes/test-infra/blob/master/config/jobs/image-pushing/README.md
kubeadm creates the bootstrap token with extra group
, system:bootstrappers:kubeadm:default-node-token,
should be able to be used for authentication and
signing.
Signed-off-by: Howard Zhang <howard.zhang@arm.com>
While this is looser check than original check, I do not think
we can quite expect NodePublish and NodeUnpublish call counts to match
NodePublishvolume call count may not be same as NodeUnpublishVolume
call count because reconciler may have a mount operation queued up
while previous one is finishing. So, it is not unusual to have more than one
NodePublishVolume call for same pod+volume combination, similarly
unmount may also run more than once.
We've had a fun case of `Sample API Server using the current
Aggregator` failing due to DNS returning a response for localhost that
is not 127.0.0.1 and does not exist on the node, causing etcd trying to
bind to a non-existing address and consequently failing. Trying to spare
others this fun :)
This reverts commit 4d3d364d2b.
This commit promoted a test to conformance and also changed the test at
the same time. It should have been split up into two PRs to verify
whether the change had the intended effect. The test is now failing
consistently, so reverting.
Currently hollow nodes communicate with kubemark master using public
master IP, which results in each call going through cloud NAT. Cloud NAT
limitations become performance bottleneck (see kubernetes/perf-tests/issues/874).
To mitigate this, in this change, a second kubeconfig called "internal"
is created. It uses private master IP and is used to set up hollow nodes.
Note that we still need the original kubemark kubeconfig (using public master IP)
to be able to communicate with the master from outside the cluster (when
setting it up or running tests).
Testing:
- set up kubemark cluster, verified apiserver logs to confirm that the call
from hollow nodes did not go through NAT
Errors from staticcheck:
test/images/agnhost/dns/common.go:79:6: func runCommand is unused (U1000)
test/images/agnhost/inclusterclient/main.go:75:16: using time.Tick leaks the underlying ticker, consider using it only in endless functions, tests and the main package, and use time.NewTicker here (SA1015)
test/images/agnhost/net/nat/closewait.go:41:5: var leakedConnection is unused (U1000)
test/images/agnhost/netexec/netexec.go:157:17: printf-style function with dynamic format string and no further arguments should use print-style function instead (SA1006)
test/images/agnhost/netexec/netexec.go:250:18: printf-style function with dynamic format string and no further arguments should use print-style function instead (SA1006)
test/images/agnhost/netexec/netexec.go:317:18: printf-style function with dynamic format string and no further arguments should use print-style function instead (SA1006)
test/images/agnhost/netexec/netexec.go:331:19: printf-style function with dynamic format string and no further arguments should use print-style function instead (SA1006)
test/images/agnhost/netexec/netexec.go:345:19: printf-style function with dynamic format string and no further arguments should use print-style function instead (SA1006)
test/images/agnhost/netexec/netexec.go:357:19: printf-style function with dynamic format string and no further arguments should use print-style function instead (SA1006)
test/images/agnhost/netexec/netexec.go:370:19: printf-style function with dynamic format string and no further arguments should use print-style function instead (SA1006)
test/images/agnhost/netexec/netexec.go:380:9: this value of err is never used (SA4006)
test/images/agnhost/netexec/netexec.go:381:17: printf-style function with dynamic format string and no further arguments should use print-style function instead (SA1006)
test/images/agnhost/netexec/netexec.go:386:17: printf-style function with dynamic format string and no further arguments should use print-style function instead (SA1006)
test/images/agnhost/pause/pause.go:43:23: syscall.SIGKILL cannot be trapped (did you mean syscall.SIGTERM?) (SA1016)
test/images/agnhost/serve-hostname/serve_hostname.go:125:15: the channel used with signal.Notify should be buffered (SA1017)
test/images/agnhost/webhook/patch_test.go:131:13: this value of err is never used (SA4006)
test/images/pets/peer-finder/peer-finder.go:155:6: this value of newPeers is never used (SA4006)
The storage e2e test suite uses given CSI driver names as pod names. For
pod names that also get enriched by a prefix and suffix, it is very easy
to exceed the 63 character limit that pod names are subject to, thereby
causing tests to fail.
This change fixes the described problem by omitting the driver name from
the pod name suffix.
It also allows us to drop VolumeResource.VolType.