We can cache the powershell-helper image's results into a scratch Linux image using
docker buildx. This will allow us to spend less time pulling the data we need from the
powershell-helper image when we need it.
Additionally, docker buildx might have some issues with cross-registry images, so this
will allow us to circumvent it.
While adding annotations to the namespace, using the Update API may result in
conflicts as "the object has been modified; please apply your changes to the
latest version and try again". Use Patch API to avoid this.
Signed-off-by: Chaitanya Bandi <kbandi@cs.stonybrook.edu>
While labeling the namespace using the Update API may result in conflicts as
"the object has been modified; please apply your changes to the latest version
and try again". Use Patch API to avoid this.
Signed-off-by: Chaitanya Bandi <kbandi@cs.stonybrook.edu>
The Topology Manager e2e tests wants to run on real multi-NUMA system
and want to consume real devices supported by device plugins; SRIOV
devices happen to be the most commonly available of such devices.
The tests need to wait for resource availability before to actually
run the tests, or they will fail with a false negative, also relatively
hard to debug.
An optimization was added in commit 56106439cf to minimize the restarts,
speed up the execution and make a nasty, yet not fully understood, flake
with SRIOV device plugin much less likely.
Unfortunately the pod-scope tests were mistakenly left over.
This Patch fixes that.
CI lanes did NOT fail (and will not fail) because the CI machines aren't
multi NUMA nor expose SRIOV devices, so the relevant portion of the test
will just skip, avoiding the issue.
However, this resurfaces when running the testsuite on bare metal; this
is how we noticed.
Signed-off-by: Francesco Romani <fromani@redhat.com>
The tests:
"Pod liveness probe, container exec timeout, restart"
"Pod readiness probe, container exec timeout, not ready"
cannot be run against a kubelet older than 1.20.
Tag them with [MinimumKubeletVersion:1.20].
Adds and implements ResetFieldsProvder interface in order to ensure that
the fieldmanager no longer owns fields that get reset before the object
is persisted.
Co-authored-by: Kevin Wiesmueller <kwiesmul@redhat.com>
Co-authored-by: Kevin Delgado <kevindelgado@google.com>
Reverting af3e118b1f and
2242d0ffc4 as these tests fail when
ExecProbeTimeout feature gate is turned on.
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
it can happen that there are multiple slices, and some of them
outdated, so we relax the test to check that at least one
has the corresponding fields mirrored.
Add feature gate to disable the GetAllocatableResources API.
The feature gate isd alpha stage, disabled by default.
Add e2e test to demonstrate the behaviour with feature gate disabled.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Document why teardownSRIOVPod has to wait for all the containers
to be gone before to end, and why is important.
Additionally, change the code to wait for all the containers to be gone,
not just the first. This is both a little cleaner and a little safer,
even though it seems the current code caused no issues so far.
Signed-off-by: Francesco Romani <fromani@redhat.com>
speedup the cleanup after testcases deleting pods in separate
goroutines.
The post-test cleanup stage must be done carefully since pod require
exclusive allocation - so pods must take all the steps to properly
cleanup the tests to avoid to pollute the environment, but
this has a negative effect on test duration (take longer).
Hence, we add safe speedups like doing pod deletions in parallel.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Add e2e tests for the new GetAllocatableResources API.
The tests are added in the `podresources_test` suite
created previously in this series.
Signed-off-by: Francesco Romani <fromani@redhat.com>
When listening on udp, the reply is sent using a src address which is
the address of the gateway interface. This means that when listening to
any, the reply can be sent out with a src ip which is different from the
request's target ip. This confuses natting and "connectionful" udp
services do not work.
Here, we force the endpoint to listen from the hostIP and from podIPs,
to cover both dual stack and legacy clusters.
Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
As discussed during the alpha review, the ReadOnly field is not really
needed because volume mounts can also be read-only. It's a historical
oddity that can be avoided for generic ephemeral volumes as part
of the promotion to beta.
* namespace by name default labelling
Co-authored-by: Jordan Liggitt <jordan@liggitt.net>
Co-authored-by: Abhishek Raut <rauta@vmware.com>
* Make some logic improvement into default namespace label
* Fix unit tests
* minor change to trigger the CI
* Correct some tests and validation behaviors
* Add Canonicalize normalization and improve validation
* Remove label validation that should be dealt by strategy
* Update defaults_test.go
add fuzzer
ns spec
* remove the finalizer thingy
* Fix integration test
* Add namespace canonicalize unit test
* Improve validation code and code comments
* move validation of labels to validateupdate
* spacex will save us all
* add comment to testget
* readablility of canonicalize
* Added namespace finalize and status update validation
* comment about ungenerated names
* correcting a missing line on storage_test
* Update the namespace validation unit test
* Add more missing unit test changes
* Let's just blast the value. Also documenting the workflow here
* Remove unnecessary validations
Co-authored-by: Jordan Liggitt <jordan@liggitt.net>
Co-authored-by: Abhishek Raut <rauta@vmware.com>
Co-authored-by: Ricardo Pchevuzinske Katz <ricardo.katz@gmail.com>
Add support to the endpoint slice mirroring controller to mirror
annotations, in addition to labels, but don´t mirror endpoint
triggertime annotation.
Also, fix a bug in the endpointslice mirroring controller, that
wasn't updating the mirrored slice with the new labels, in case
that only the endpoint labels were modified.
Defaults and validation are such that the field has to be set when
the feature is enabled, just as for the other boolean fields. This
was missing in some tests, which was okay as long as they ran
with the feature disabled. Once it gets enabled, validation will
flag the missing field as error.
Other tests didn't run at all.
Updates comment on building dependencies step in the local node test
runner to reflect the binaries that are actually produced.
Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
* Removes discovery v1alpha1 API
* Replaces per Endpoint Topology with a read only DeprecatedTopology
in GA API
* Adds per Endpoint Zone field in GA API
This is to consume the changes for binding the udp listeners of netexec
to specific addresses.
Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
The current udp implementation listens on any for tcp, udp and tcp. There
are some cases where it makes sense to listen on specific addresses
(especially udp, see https://github.com/kubernetes/kubernetes/issues/95565).
This is because UDP is connectionless, and in order to conntrack to
work, the application must ensure that the src of the reply is the same
as the dest of the request. The easiest way to do that is to bind
explicitly on an ip.
Here we pass an optional parameter that contains a comma separated list
of addresses.
Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
- Test was failing due to using `sleep infinity` inside the busybox
container which was going into a crash loop. `sleep infinity` isn't
supported by the sleep version in busybox, so replace it with a `while
true; sleep loop`.
- Replace usage of dbus message emitting from gdbus to dbus-send. The
test was failing on ubuntu which doesn't have gdbus installed.
dbus-send is installed on COS and Ubuntu, so use it instead.
- Replace check of pod phase with the test util function `PodRunningReady`
which checks both phase as well as pod ready condition.
- Add some more verbose logging to ease future debugging.
We've dropped the content-type field since it is effectively unbounded
(we had a sec-vuln about this before actually). We retain all other
fields, despite their unboundedness due to the fact that we can now
explicitly set bounds on label values.
Change-Id: Icc483fc6a17ea6382928f4448643cda6f3e21adb
The `apparmor_parser` binary is not really required for a system to run
AppArmor from a Kubernetes perspective. How to apply the profile is more
in the responsibility of lower level runtimes like CRI-O and containerd,
which may do the binary check on their own.
This synchronizes the current libcontainer implementation with the
vendored Kubernetes source code and allows distributions to use
AppArmor, even when they do not have the parser available in
`/sbin/apparmor_parser`.
Signed-off-by: Sascha Grunert <mail@saschagrunert.de>
The server service monitors the kubelet service and restart it
once the service is down, to avoid kubelet double restarting
we will stop the kubelet service and wait until the kubelet will be
restarted and the node will be ready.
Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
e2e test validates the following service status endpoints
- patchCoreV1NamespacedServiceStatus
- readCoreV1NamespacedServiceStatus
- replaceCoreV1NamespacedServiceStatus
Also includes untested service endpoint
- patchCoreV1NamespacedService
The same SHA cannot be pushed twice to the staging registry. Because some images were
mirrored, their SHAs remained unchanged. This addresses this issue.
Sharing the same connection for multiple streams should have worked,
but ran into unexpected timeouts:
I0227 08:07:49.754263 80029 portproxy.go:109] container "mock" in pod csi-mock-volumes-4037-2061/csi-mockplugin-0 is running
E0227 08:07:49.779359 80029 portproxy.go:178] prepare forwarding csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: dialer failed: unable to upgrade connection: pod not found ("csi-mockplugin-0_csi-mock-volumes-4037-2061")
I0227 08:07:50.782705 80029 portproxy.go:109] container "mock" in pod csi-mock-volumes-4037-2061/csi-mockplugin-0 is running
I0227 08:07:50.809326 80029 portproxy.go:125] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: starting connection polling
I0227 08:07:50.909544 80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection #0, 0 open
I0227 08:07:50.912436 80029 portproxy.go:155] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: created a new connection #0
I0227 08:07:50.912503 80029 portproxy.go:286] forward listener for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: got a new connection #0
I0227 08:07:50.913161 80029 portproxy.go:322] forward connection #0 for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: remote side closed the stream
E0227 08:07:50.913324 80029 portproxy.go:242] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: an error occurred connecting to the remote port: error forwarding port 9000 to pod 66662ea1ab30b4193dac0102c49be840971d337c802cc0c8bbc074214522bd13, uid : failed to execute portforward in network namespace "/var/run/netns/cni-c15e4e36-dad9-8316-c301-33af9dad5717": failed to dial 9000: dial tcp4 127.0.0.1:9000: connect: connection refused
I0227 08:07:50.913371 80029 portproxy.go:340] forward connection #0 for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: closing our side
W0227 08:07:50.913487 80029 server.go:669] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
I0227 08:07:51.009519 80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection #1, 0 open
I0227 08:07:51.011912 80029 portproxy.go:155] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: created a new connection #1
I0227 08:07:51.011973 80029 portproxy.go:286] forward listener for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: got a new connection #1
I0227 08:07:51.013677 80029 portproxy.go:322] forward connection #1 for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: remote side closed the stream
I0227 08:07:51.013720 80029 portproxy.go:340] forward connection #1 for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: closing our side
W0227 08:07:51.013794 80029 server.go:669] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
E0227 08:07:51.017026 80029 portproxy.go:242] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: an error occurred connecting to the remote port: error forwarding port 9000 to pod 66662ea1ab30b4193dac0102c49be840971d337c802cc0c8bbc074214522bd13, uid : failed to execute portforward in network namespace "/var/run/netns/cni-c15e4e36-dad9-8316-c301-33af9dad5717": failed to dial 9000: dial tcp4 127.0.0.1:9000: connect: connection refused
I0227 08:07:51.109515 80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection #2, 0 open
I0227 08:07:51.111479 80029 portproxy.go:155] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: created a new connection #2
I0227 08:07:51.111519 80029 portproxy.go:286] forward listener for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: got a new connection #2
I0227 08:07:51.209519 80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection #3, 1 open
I0227 08:07:51.766305 80029 csi.go:377] gRPC call: {"Method":"/csi.v1.Identity/Probe","Request":{},"Response":{"ready":{"value":true}},"Error":"","FullError":null}
I0227 08:07:51.768304 80029 csi.go:377] gRPC call: {"Method":"/csi.v1.Identity/GetPluginInfo","Request":{},"Response":{"name":"csi-mock-csi-mock-volumes-4037","vendor_version":"0.3.0","manifest":{"url":"https://k8s.io/kubernetes/test/e2e/storage/drivers/csi-test/mock"}},"Error":"","FullError":null}
I0227 08:07:51.770494 80029 csi.go:377] gRPC call: {"Method":"/csi.v1.Identity/GetPluginCapabilities","Request":{},"Response":{"capabilities":[{"Type":{"Service":{"type":1}}},{"Type":{"VolumeExpansion":{"type":1}}},{"Type":{"Service":{"type":2}}}]},"Error":"","FullError":null}
I0227 08:07:51.772899 80029 csi.go:377] gRPC call: {"Method":"/csi.v1.Controller/ControllerGetCapabilities","Request":{},"Response":{"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":3}}},{"Type":{"Rpc":{"type":10}}},{"Type":{"Rpc":{"type":4}}},{"Type":{"Rpc":{"type":6}}},{"Type":{"Rpc":{"type":5}}},{"Type":{"Rpc":{"type":8}}},{"Type":{"Rpc":{"type":7}}},{"Type":{"Rpc":{"type":12}}},{"Type":{"Rpc":{"type":11}}},{"Type":{"Rpc":{"type":9}}}]},"Error":"","FullError":null}
I0227 08:08:21.209901 80029 portproxy.go:151] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: no connection: error creating error stream: Timeout occurred
I0227 08:08:21.209980 80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection #3, 1 open
I0227 08:08:51.211522 80029 portproxy.go:151] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: no connection: error creating data stream: Timeout occurred
I0227 08:08:51.211566 80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection #3, 1 open
I0227 08:08:51.213451 80029 portproxy.go:155] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: created a new connection #3
I0227 08:08:51.213498 80029 portproxy.go:286] forward listener for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: got a new connection #3
I0227 08:08:51.309540 80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection #4, 2 open
I0227 08:08:52.215358 80029 portproxy.go:322] forward connection #3 for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: remote side closed the stream
I0227 08:08:52.215475 80029 portproxy.go:340] forward connection #3 for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: closing our side
I0227 08:09:21.310003 80029 portproxy.go:151] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: no connection: error creating error stream: Timeout occurred
I0227 08:09:21.310086 80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection #4, 1 open
I0227 08:09:51.311854 80029 portproxy.go:151] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: no connection: error creating data stream: Timeout occurred
I0227 08:09:51.311908 80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection #4, 1 open
I0227 08:09:51.314415 80029 portproxy.go:155] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: created a new connection #4
I0227 08:09:51.314497 80029 portproxy.go:286] forward listener for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: got a new connection #4
I0227 08:09:51.409527 80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection #5, 2 open
I0227 08:09:52.326203 80029 portproxy.go:322] forward connection #4 for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: remote side closed the stream
I0227 08:09:52.326277 80029 portproxy.go:340] forward connection #4 for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: closing our side
I0227 08:10:21.409892 80029 portproxy.go:151] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: no connection: error creating error stream: Timeout occurred
I0227 08:10:21.409954 80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection #5, 1 open
I0227 08:10:51.411455 80029 portproxy.go:151] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: no connection: error creating data stream: Timeout occurred
I0227 08:10:51.411557 80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection #5, 1 open
I0227 08:10:51.413229 80029 portproxy.go:155] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: created a new connection #5
I0227 08:10:51.413274 80029 portproxy.go:286] forward listener for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: got a new connection #5
I0227 08:10:51.509508 80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection #6, 2 open
I0227 08:10:52.414862 80029 portproxy.go:322] forward connection #5 for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: remote side closed the stream
I0227 08:10:52.414931 80029 portproxy.go:340] forward connection #5 for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: closing our side
I0227 08:11:21.509879 80029 portproxy.go:151] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: no connection: error creating error stream: Timeout occurred
I0227 08:11:21.509934 80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection #6, 1 open
I0227 08:11:51.511519 80029 portproxy.go:151] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: no connection: error creating data stream: Timeout occurred
I0227 08:11:51.511568 80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection #6, 1 open
I0227 08:11:51.513519 80029 portproxy.go:155] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: created a new connection #6
I0227 08:11:51.513571 80029 portproxy.go:286] forward listener for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: got a new connection #6
I0227 08:11:51.609504 80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection #7, 2 open
I0227 08:11:52.517799 80029 portproxy.go:322] forward connection #6 for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: remote side closed the stream
I0227 08:11:52.517918 80029 portproxy.go:340] forward connection #6 for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: closing our side
I0227 08:12:21.609856 80029 portproxy.go:151] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: no connection: error creating error stream: Timeout occurred
I0227 08:12:21.609909 80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection #7, 1 open
I0227 08:12:51.611494 80029 portproxy.go:151] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: no connection: error creating data stream: Timeout occurred
I0227 08:12:51.611555 80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection #7, 1 open
I0227 08:12:51.613289 80029 portproxy.go:155] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: created a new connection #7
I0227 08:12:51.613343 80029 portproxy.go:286] forward listener for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: got a new connection #7
I0227 08:12:51.709535 80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection #8, 2 open
I0227 08:12:52.615858 80029 portproxy.go:322] forward connection #7 for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: remote side closed the stream
I0227 08:12:52.615989 80029 portproxy.go:340] forward connection #7 for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: closing our side
W0227 08:12:52.616116 80029 server.go:669] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
I0227 08:13:21.709934 80029 portproxy.go:151] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: no connection: error creating error stream: Timeout occurred
I0227 08:13:21.709997 80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection #8, 1 open
Feb 27 08:13:30.916: FAIL: Failed to register CSIDriver csi-mock-csi-mock-volumes-4037
Unexpected error:
<*errors.errorString | 0xc002666220>: {
s: "error waiting for CSI driver csi-mock-csi-mock-volumes-4037 registration on node kind-worker2: timed out waiting for the condition",
}
error waiting for CSI driver csi-mock-csi-mock-volumes-4037 registration on node kind-worker2: timed out waiting for the condition
occurred
Instead of trying to use the client-go portforward package as-is it is
simpler to copy some code from it and then use the http stream
directly. That way we don't need to go through a local listening
socket and error handling and logging becomes simpler.
This replaces embedding of JavaScript code into the mock driver that
runs inside the cluster with Go callbacks which run inside the
e2e.test suite itself. In contrast to the JavaScript hooks, they have
direct access to all parameters and can fabricate arbitrary responses,
not just error codes.
Because the callbacks run in the same process as the test itself, it
is possible to set up two-way communication via shared variables or
channels. This opens the door for writing better tests. Some of the
existing tests that poll mock driver output could be simplified, but
that can be addressed later.
For now, only tests using hooks use embedding. How gRPC calls are
retrieved is abstracted behind the CSIMockTestDriver interface, so
tests don't need to be modified when switching between embedding
and remote mock driver.
The function must modify the content of the "creds" pointer, not the
pointer.
Found via hack/verify-staticcheck.sh after importing the code into
Kubernetes. It is uncertain whether this bug had any consequences.
Caught by verify-typecheck.sh after importing the code into
Kubernetes:
ERROR(linux/arm): /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/storage/drivers/csi-test/mock/service/controller.go:404:20: math.MaxUint32 (untyped int constant 4294967295) overflows int
ERROR(linux/arm): /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/storage/drivers/csi-test/mock/service/controller.go:795:20: math.MaxUint32 (untyped int constant 4294967295) overflows int
ERROR(linux/386): /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/storage/drivers/csi-test/mock/service/controller.go:404:20: math.MaxUint32 (untyped int constant 4294967295) overflows int
ERROR(linux/386): /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/storage/drivers/csi-test/mock/service/controller.go:795:20: math.MaxUint32 (untyped int constant 4294967295) overflows int
ERROR(windows/386): /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/storage/drivers/csi-test/mock/service/controller.go:404:20: math.MaxUint32 (untyped int constant 4294967295) overflows int
ERROR(windows/386):
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/storage/drivers/csi-test/mock/service/controller.go:795:20:
math.MaxUint32 (untyped int constant 4294967295) overflows int
Instead of producing our own error message, we can show the original
value and the error from strconv.
This is a verbatim copy of the corresponding files in csi-test v4.0.2.
They'll be modified in future commits to make the code usable when
embedded in e2e.test. Some of those changes may be worthwhile
backporting to csi-test, but this is uncertain at this time.
We don't need much concurrency and having too many worker threads has
one disadvantage (besides resource usage): when the sidecar looses the
connection to the CSI driver, it calls klog.Fatal, which prints all
gouroutines. This can lead to much output.
If MaxSurge is set, the controller will attempt to double up nodes
up to the allowed limit with a new pod, and then when the most recent
(by hash) pod is ready, trigger deletion on the old pod. If the old
pod goes unready before the new pod is ready, the old pod is immediately
deleted. If an old pod goes unready before a new pod is placed on that
node, a new pod is immediately added for that node even past the MaxSurge
limit.
The backoff clock is used consistently throughout the daemonset controller
as an injectable clock for the purposes of testing.
The test should run for test drivers which support dynamic
provisioning, but was skipped because of the volume type check:
External Storage [Driver: hostpath.csi.k8s.io]
[90m/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/storage/external/external.go:175
[Testpattern: Generic Ephemeral-volume (default fs) [Feature:GenericEphemeralVolume] (late-binding)] ephemeral
[90m/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/storage/framework/testsuite.go:50
[36m[1mshould support multiple inline ephemeral volumes [BeforeEach]
[90m/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/storage/testsuites/ephemeral.go:211
[36mDriver "hostpath.csi.k8s.io" does not support volume type "GenericEphemeralVolume" - skipping
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/storage/external/external.go:255
nginx expects to find its conf and logs folder locally, and fails if it cannot find them.
cd-ing into the the nginx folder solves this issue. This is a similar approach to the
echoserver image, which also uses nginx.
Looks like there is a bit of an issue in the Bluderbuss (Prow plugin)
where it prefers to pick reviewers from a parent OWNERS files,
instead of using an approver from a current OWNERS file as
an additional reviewer.
Because log capturing can end due to an error and because a pod has
terminated, it is uncertain whether all log output has been
captured. So far, the code leaned more towards restarting log
capturing based on the rationale that duplicate logs are better than
no logs.
But this is confusing and potentially makes logs much larger, so now
an additional heuristic is used to avoid log capturing when logging
was started already and the pod itself is marked for deletion. That
occurs before the individual containers shut down and get marked as
terminated.
hack/make-rules/test-cmd.sh script fails with tariling errors.
Error: unknown command "convert" for "kubectl"
1. This PR fixes the errors by replacing or removing the use of
"kubectl convert" option because it was already removed.
2. Fix trailing shell check failure as well.
In ./test/cmd/generic-resources.sh line 366:
kube::test::get_object_assert deployment "{{range.items}}{{$image_field0}}:{{end}}" "${IMAGE_NGINX}:${IMAGE_NGINX}:"
The user expectections calling this method is that the pod should
be ready for the test, however, it only checks that is running,
causing timing issues on busy environments.
Per example, if the pod is not ready, kube-proxy or other services
implementations will not forward traffic to it.
We've observed this test causing e2e runs to time-out; the ginkgo
behaviour here is unhelpful, in that we don't see any output on a
timeout.
Set a (generous) time limit to start to get output and enforce some
limit to the response time.
As a part of cleaning up inactive members (those with no activity within
the past 18 months) from OWNERS files, this commit moves fabxc from an
approver to an emeritus_approver.
As a part of cleaning up inactive members (those with no activity within
the past 18 months) from OWNERS files, this commit removes gmarek as a
reviewer.
A 32-bit php was included in the images, instead of the 64-bit one. The base image
is nanoserver-based, which does not support 32-bit apps. Because of this, httpd
fails to start.
Additionally, we've previously removed the busybox-helper dependency, but was
left in in the httpd images. This removes the dependency from the httpd images.
Test now validates patchAppsV1NamespacedStatefulSetScale endpoint.
Update conformance metadata to include v1.21
Co-authored-by: Stephen Heywood <stephen@ii.coop>
CONTENT_TYPE in this case is `kube-api-content-type=application/vnd.kubernetes.protobuf` and it can be removed since
we don’t see a need for setting it differently in the tests.
Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com>
With the memory manager static policy:
- start multiple guaranteed pods and verify that pods succeeded to start
- start workload pod on each NUMA node to load the memory and start the
pod that requested more memory than each NUMA node have, the pod should fail
to start with the admission error, because no single NUMA node has enough
memory to start the pod and also each NUMA node already used for single
NUMA node allocation
The test requires at least two NUMA nodes
Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
Provides basic tests e2e to verify that pod succeeds
to start with MemoryManager enabled.
Verifies both MemoryManager policies and when the node has
multiple NUMA nodes it will verify the memory pinning.
Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
Due to the dockerhub rate limiting, we had to find an alternative solution. We've mirrored the dockerhub
images into our own.
Additionally, our own busybox, httpd, and nginx images also have Windows support.
Current gcRegistry configuration is responsible for both etcd and pause images. We should
use the upstream pause image in testing.
The etcd image doesn't have Windows support yet, so we need to have a separate configuration
for it.
We've added Windows support to the resource-consumer image and 1.8 tag is already promoted.
We need to bump the VERSION, so we can promote the new image.
The PHP release page tends to get updated and changed every time there's a
new release, removing the old ones. Because of this, the PHP link in the
httpd and httpd-new images may become invalid.
Updating the links to the archives solves this issue.
These are the latest stable releases. We should test with those.
The newer external-provisioner no longer needs (and doesn't support)
the --provisioner parameter.
This adds a call to createBalancedPods during the ubernetes_lite scheduling e2es,
which are prone to improper score balancing due to unbalanced utilization.
Instead of allowing the cloud provider to guess at the zones that
should be applied for a cluster under test, allow the explicit list
of zones to consider to be passed as a new test context flag -gce-zones.
Only the GCE test cloud provider recognizes this value because only
the GCE test cloud provider makes assumptions about zones for verifying
values, and the default assumptions for GKE do not always match non-GKE
providers.
A number of e2e tests are useful to run after the system has been
disrupted or is in the progress of being disrupted, but the current
suite and test logic blocks progress waiting for all nodes to be
healthy.
By passing -1 to --minStartupPods or --allowed-not-ready-nodes flags
the caller can bypass wait logic before and after test suites that
would prevent running e2e during disruption. This allows use of parts
of the e2e suite during cluster duress to verify that controllers or
components still function.
Adds the httpd, nginx, images that are used in tests.
Two different versions of nginx have to build, and thus, the have
different folders. An ALIAS file was added to nginx-new in order to
keep the same image name.
This creates a test similar to "should function for service endpoints using hostNetwork"
for dual stack tests, using the secondary clusterip / nodeip.
Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
This yaml contains GA endpoints currently ineligible for conformance,
along with the reason for their ineligibility and a link with
context to that reason. It's intention is to give a transparent and
community-supported way to track these endpoints.
The current implementation does not check for errors, so any failure in
DialFromNode won't float.
Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
This is part of the goal for scheduling to remove dependencies on internal
packages for the scheduling framework. It also provides these functions in an
external location for other components and projects to import.
A few details about the image builder postsubmit jobs changed, so the README had to
be updated as well.
Added a few extra bits of information regarding the Windows images.
- reset `binfmt_misc` is needn't when the building platform is non-amd64 and the
target arch is the same as building platform
- non-amd64 platform doesn't supported cross-build well, and there is no binary of
`qemu-user-static` able to do that, and thus skip the cross-build on non-amd64
platform.
Signed-off-by: Dave Chen <dave.chen@arm.com>
- as soon as a request is received by the apiserver, determine the
timeout of the request and set a new request context with the deadline.
- the timeout filter that times out non-long-running requests should
use the request context as opposed to a fixed 60s wait today.
- admission and storage layer uses the same request context with the
deadline specified.
we use the default timeout enforced by the apiserver:
- if the user has specified a timeout of 0s, this implies no timeout on the user's part.
- if the user has specified a timeout that exceeds the maximum deadline allowed by the apiserver.
Both images are now sane multi-architecture images and should fix the
kube-proxy container image in the same way.
Signed-off-by: Sascha Grunert <mail@saschagrunert.de>
The test "validates that there is no conflict between pods with same
hostPort but different hostIP and protocol" was testing the scheduler
capability to schedule pods on the same node with hostPorts, however,
it wasn´t validating that the HostPorts was working, causing false
positives, because the pods were scheduled, but the HostPort exposed
wasn´t working.
In order to test the HostPort functionality, we have to use HostNetwork
pods, that are incompatible with Windows platforms. Also, since this
is touching both network and scheduling, there is no clear the ownership,
but sig-network is happy to adopt it.
We also add a new test for scheduling only under "scheduling", so Windows
folks can use it to test the scheduled in that platform.
We cannot have any RUN commands in the Windows stage when using docker buildx,
which is why we were using the busybox-helper image. The purpose of the image
was to contain a few things that we would obtain by running a few commands:
- symlinks for the busybox binary
- run vcredist_x64.exe which would also give us the vcruntime140.dll which is
necessary for dig or httpd.
There are alternatives to the commands above that can be achieved in a Linux stage
as well:
- we can create the symlinks in a Linux stage with ln -s. Copying them over to
Windows will allow them to work just as well as if they were being copied over
from a Windows image. The 'Files\' prefix issue to the symlink target still persists.
- we can download the vcruntime140.dll directly, allowing us to skip the vcredist_x64.exe
installation.
Under the CPU manager and topology manager e2e tests possible the situation
when one of steps under the test will fail and it will not clean the CPU manager
state file. Move the deletion of the state file to `AfterEach` to guarantee that
the state file will be always removed from the node.
Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
- fix the issue when the test runs on the node with the single CPU
- fix the issue when the CPU topology has only one core per socket, it can
be easily reproduced by configuring VM with multi NUMA, but when each socket
has only one core
Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
the sig-network e2e tests related to services has more than 3k lines.
Some of those e2e tests are related to loadbalancers, that are
cloud provider specific and have special requirements.
We split up the services file and keeps the loadbalancers e2e tests
in their own file and with their own tag, so it is easier to skip
for people that don't run e2e tests in cloud providers.
The default value for the progress is ``auto``, which will eat the output of RUN commands. This makes it a bit hard to debug when issues occur. Changing that option to ``plain`` will ensure that the output is properly kept.
Currently, the image is not working properly because of the apparmor_parser giving this error:
Error relocating /sbin/apparmor_parser: secure_getenv: symbol not found
Updating musl to 1.1.20 or newer will fix this problem.
The metadata-concealment image does not have any BASEIMAGE file, which means
that the image will be built from scratch. In this case, there are a few
fixes that need to be made in the image-build.sh script.
The test is not cleaning all pods it created.
Memory balancing pods are deleted once the test namespace is.
Thus, leaving the pods running or in terminating state when a new test is run.
In case the next test is "[sig-scheduling] SchedulerPredicates [Serial] validates resource limits of pods that are allowed to run",
the test can fail.
Both of these are explicit arguments and are more elegantly logged
in a test framework by logging the arguments to the test.
The namespaces to be deleted are already logged inside
WaitForNamespacesDeleted
The DNS autoscaler test was not correctly counting tainted but
schedulable nodes, meaning that the target count was not correct for
clusters with multiple control-plane nodes (which are often tainted
but schedulable).
The cluster-proportional-autoscaler ignores non-schedulable nodes, but
does not consider taints.
Some of these images didn't have any job run for them. Some of these
images previously failed due to an issue that has been addressed since.
Making a change into their image directory will spawn a postsubmit job
that will build that image.
Changes default cluster DNS domain to empty string to align with the
default kubelet configuration value.
Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
In downstream contexts, it's extremely useful to be able to combine
all the "testable" images in Kubernetes into a single repo so that
a user could mirror these offline in one chunk, and audit the set of
images for changes. For instance, within OpenShift we would like to
have a single place we can place all the images used by all the tests
with a single authentication scheme. While some images are not "real"
and can't be mirrored (for instance, the images that point to an
auth protected registry), that is not the majority.
This code makes it possible to specify an environment variable
KUBE_TEST_REPO that maps the static strings of the registry to a
single repository by placing the uniqueness in a tag. For instance:
KUBE_TEST_REPO=quay.io/openshift/community-e2e-images
would translate `k8s.gcr.io/prometheus-to-sd:v0.5.0` to `quay.io/openshift/community-e2e-images:e2e-30-k8s-gcr-io-prometheus-to-sd-v0-5-0-6JI59Yih4oaj3oQOjRfhyQ`.
The tag is a safe form of the name, plus the index (the constant within
manifest.go), plus a hash of the full input. The length of the tag is
constrained to the minimum of hash + index + the safe name.
The public method is changed to return two maps - index to original
name and index to test repo name. These maps would be the same if
the env var is not set.
For debugging purposes, it should be useful to run ``docker version`` and ``docker buildx version`` in order to more easily check and verify issues encountered with the Image Builder.
The test "A node shouldn't be able to create another node" could create
a node during its run, but it doesn't delete it in this case.
This commit addresses this issue.
The "should have correct firewall rules for e2e cluster" test is GCE
specific, and likely specific to the kube-up configuration.
However, the second half of the test is a generic behaviour based test
that verifies that ports are not reachable.
We can split this into two tests, with an eye to running the generic
test in more places.
The ESIPP tests are using a function to poll an HTTP endpoint.
This function failed the framework if the request to the http endpoint
timed out, causing a panic that ginkgo couldn´t recover.
Also, this function was used inside a pollImmediate loop, so it should
return the error instead of fail.
This reverts commit 0ef7f27fc1.
The info is not enough to debug the problems, there are simply no
conntrack entries but there is no clue about it.
Another problem is that it dumps the conntrack entries from all
nodes, that is more than 40 mins in a scale test job with 5000 nodes.
Signed-off-by: pacoxu <paco.xu@daocloud.io>
When Spec.AllocateLoadBalancerNodePorts is "false" NodePort shall
not be included when computing quota for type:LoadBalancer.
Co-authored-by: uablrek
We are planing to test and support 20H2 release of Windows, thus,
we need to build test images for it as well. The busybox image already
has a BASEIMAGE entry for it, but we also need to add it to the image-util.sh's
windows_os_versions, so the OS Version can be properly included in the
manifest list.
e2e test validates the following 3 extra endpoints
- readApiregistrationV1APIServiceStatus
- patchApiregistrationV1APIService
- listApiregistrationV1APIService
We are planing to test and support 20H2 release of Windows, thus,
we need to build test images for it as well. The busybox image already
has a BASEIMAGE entry for it, but we also need to add it to the image-util.sh's
windows_os_versions, so the OS Version can be properly included in the
manifest list.
Dockerhub will introduce rate limiting in November, and a lot of E2E tests
are relying on the busybox image. It could potentially become an issue
causing jobs to fail because of this.
Ideally, we'd have the busybox image mirrored on gcr.io, but that could take
some time. Until then, we can just have the Image Builder mirror the image
for us in the staging registry and use that for tests until this issue is
solved. The busybox image should NOT be promoted out of staging.
During the sig-testing meeting, it was decided that we should do the same
for the other images are hosted on dockerhub.
Two different versions of httpd and nginx have to be built, and thus, the have
different folders. An ALIAS file was added to httpd-new and nginx-new in order
to keep the same image name.
The test create a pod with a hostPort to expose an SCTP port, then
it checks if the iptables rules were installed correctly in the host.
The iptables rules MUST be checked in the same host where the pod
is running :)
Deprecated metrics are removed and suggest to use the Histogram
metrics got from scheduler extension points.
Signed-off-by: Dave Chen <dave.chen@arm.com>
Co-authored-by: wawa0210 <xiaozhang0210@hotmail.com>
Current logic to check whether a PVC is fully bound are:
1. PVC's volume name is not empty
2. Annotation "pv.kubernetes.io/bind-completed" is properly set
The behavior in the test case only set the annotation, and leave the
volume name to be set by a `FakePVController`.
This will cause a problem for us to run some testcase like scheduler's
perf test, scheduling pod with volume as an example, the first try will
always hit "unbound immediate PersistentVolumeClaims" exception.
As a result, the metric data "schedule_attempts_total", or "scheduling_algorithm_duration_seconds"
will not accurate enough.
Signed-off-by: Dave Chen <dave.chen@arm.com>
Extract TestSuite, TestDriver, TestPattern, TestConfig
and VolumeResource, SnapshotVolumeResource from testsuite
package and put them into a new package called api.
The ultimate goal here is to make the testsuites as clean
as possible. And only testsuites in the package.
Cleanup all the f.BeforeEach() before new framework to move all the
testskips in the new SkipUnsupportedTests() to make the structure easier.
And provide the standard way of RegisterTests()
Add a InitCustomXXXTestSuite(patterns []patterns) function for each
TestSuite to enable custom test suite definition.
PodTopologySpread plugin will only count the existing pod when that
pod's label matches with `constraint.Selector`, which means all pods
could be scheduled to one topology zone when the constraint does not
have any selector defined.
Signed-off-by: Dave Chen <dave.chen@arm.com>
WaitForPodSuccessInNamespace[Slow] are replaced by WaitForPodSuccessInNamespaceTimeout(),
so that custom timeouts are used instead of the hardcoded ones.
Before creating and bootstrapping a docker buildx instance, we need to call
register.sh with the -p yes flag. Without this, the docker buildx will only
support linux/amd64 and linux/386 platforms, meaning that it will fail when
trying to build images for other architecture types.
Additionally, the builder has to have qemu and its qemu-* binaries installed
in order to properly build the images. The recently created image
gcr.io/k8s-testimages/gcb-docker-gcloud:v20201130-750d12f has those requirements met.
This test is not working for windows yet due to commands issued in pod
are not available for windows
Change-Id: Ia0b03afd6dfe0bbb1ab00dc821775450a7e8ce54
Many README files and other docs contained a link to a an appspot
tracking app that is no longer active. Following the links leads to an
error about Go 1.9 no longer being supported. Go 1.9 support was dropped
in appspot in 2019 and disabled June 2020.
This also resulted in a broken image link displaying when viewing these
files on GitHub. Since the app is no longer functioning, and since it
causes a potentially (but granted, minor) confusing error to display,
this just removes those links as I don't believe they are needed
anymore.
Signed-off-by: Sean McGinnis <sean.mcginnis@gmail.com>
Some storage tests has commands not available in Windows. Mark them as
LinuxOnly now. Will check later to see whether equivalent windows
commands are available.
Change-Id: I41b5668c855b2754a2e332cff4e90ebf2981aca0
Removes comment from daemons function that previously indicated that a
check was being run to make sure docker daemon was running.
Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
The e2e test, included as part of Conformance,
"validates that there is no conflict between
pods with same hostPort but different hostIP and protocol"
was only testing that the pods were scheduled without conflict
but was never testing the functionality.
The test should check that pods with containers forwarding the same
hostPort can be scheduled without conflict, and that those exposed
HostPort are forwarding the ports to the corresponding pods.
the predicate tests were using loopback addresses for the the
hostPort test, however, those have different semantics depending
on the IP family, i.e. you can not bind to ::1 and ::2 simultanously,
in addition, IP forwarding from localhost to localhost in IPv6 is
not working since it doesn't have the kernel route_localnet hack.
- as soon as a request is received by the apiserver, determine the
timeout of the request and set a new request context with the deadline.
- the timeout filter that times out non-long-running requests should
use the request context as opposed to a fixed 60s wait today.
- admission and storage layer uses the same request context with the
deadline specified.
Instead of hardcoding fedora:latest, use one of our e2e images as
source inside the created pods. This will allow users who test with
this data outside of integration environments to reference a real
image and avoid spurious errors.
Relaxes matching of pod_memory_working_set_bytes metric so that we won't
error due to presence of other pods.
Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
The e2e topology manager want to test the resource alignment using
devices, and the easiest devices to use are the SRIOV devices at this
moment.
The resource alignment test cases are run for each supported policies,
in a loop.
The tests manage the SRIOV device plugin; up until now, the plugin
was set up and tore down at each loop.
There is no real need for that. Each loop must reconfigure (thus
restart) the kubelet, but the device plugin can set up and tore down
just once for all the policies, thus once.
The kubelet can reconnect just fine to a running device plugin.
This way, we greatly reduce the interactions and the complexity of the
test environment, making it easier to understand and more robust, and
we trim down some minutes from execution time.
However, this patch also hides (not solves) a test flake we observed
on some environment. The issue is hardly reproduceable and not well
understood, but seems caused by doing the sriov dp setup/teardown
in each policy testing loop.
Investigation so far suggests that the kubelet sometimes have a stale
state after the sriovdp teardown/setup cycle, leading to flakes and
false negatives.
We tried to address this in https://github.com/kubernetes/kubernetes/pull/95611
with no conclusive results yet.
This patch was posted because overall we believe this patch gains
exceeds the drawbacks (hiding the aforementioned flake) and
because understanding the potential interaction issues between the
sriovdp and the kubelet deserve a separate test.
Signed-off-by: Francesco Romani <fromani@redhat.com>
A suite of e2e tests was created for Topology Manager
so as to test pod scope alignment feature.
Co-authored-by: Pawel Rapacz <p.rapacz@partner.samsung.com>
Co-authored-by: Krzysztof Wiatrzyk <k.wiatrzyk@samsung.com>
Signed-off-by: Cezary Zukowski <c.zukowski@samsung.com>
since we added tests to check connectivity against pods with
hostNetwork: true, there is the possibility that those pods
fail to run because the port is being used in the host.
Current test were using port 8080,8081 and 8082 that are commonly
used in hosts for other applications.
If the service is not ready after a certain time, and we are using
Pods with hostNetwork: true we assume that there is a conflict
and skip this test.
Dual stack services can have two ClusterIPs, we already have tests that
exercise the connectivity from different scenarios to the first
ClusterIP of the service.
This PR adds a new functionality to the e2e network utils to enable
DualStack services, and replicate the same tests but using the
secondary ClusterIP, so we cover the connectivity to both cluster IPs.
Add a separate in-tree gcepd driver for windows cluster because it does
not support certain features as Linux driver.
Change-Id: I2fca86b3f32f17db7703c46a36944d9ee51f355f
We hardcode the index number in the KubeProxy/Conntrack e2es and
CollectAddresses returns 4 mixed IP Family addresses in a dualstack
cluster. This change ensures that the serverNodeInfo.nodeIP has only
valid addresses for the expected IPFamily per test case.
Signed-off-by: Christopher M. Luciano <cmluciano@us.ibm.com>
To make sure that the storage version filter can block certain requests until
the storage version updates are completed, and that the apiserver works
properly after the storage version updates are done.
Currently e2e tests run under test-e2e-node have a cluster-domain
equals to "". This change makes test-e2e-node consistent with other
e2e tests. For example, in hack/ginkgo-e2e.sh, cluster-domain
defaults to cluster.local and it can be changed by defining KUBE_DNS_DOMAIN.
* Rename const for topology.../zone
* Rename const for topology.../region
* Rename const for failure-domain.../zone
* Rename const for failure-domain.../region
* Restore old names for compat
This reverts commit 0ed8fd6dc9.
It turns out that ExternalIPs are not allowed to be reachable from
pods until the IP is present in the node.
However, due to a kube-proxy limitation it was working in environment
that used CNIs without bridges for the pods.
Currently, the Image Builder job is failing as it cannot build images
for other architecture types. This happens because the Image Builder image
does not have any of the expected qemu-* binaries in /usr/bin/ needed in order to
run qemu-binfmt-conf.sh with the -p yes flag, so that flag is removed.
docker buildx requires DOCKER_CLI_EXPERIMENTAL=enabled to be set
in order to be used.
This environment variable is not getting plumbed through from the
test/images/cloudbuild.yaml file, causing the docker buildx commands
to fail.
The default cloudbuild has HOME=/builder/home and docker buildx is in /root/.docker/cli-plugins/docker-buildx
We need to set the Home to /root explicitly since we're using docker buildx
Exec is a utility function but, if you call it and are already planning to either suppress or print the exec command before hand, its ability to log can be redundant or a hinderance on test readability
- re-enable e2e_node services
- call GenerateSecureToken for e2e_node Conformance test-suite
- add log messages indicating location in process
- move log messages to some more accurate locations
Since the insecure port of apiserver has been disabled in e2e node tests,
we could create a service account in the test for node problem detector
and then bind the cluster role `system:node-problem-detector` with this
service account.
Signed-off-by: knight42 <anonymousknight96@gmail.com>
Provides a response that includes a body and a method. This response
will enable a client (e2e test) to confirm that a proxy did not alter
the http method.
Service has had a problem since forever:
- User creates a service type=LoadBalancer
- We silently allocate them a NodePort
- User changes type to ClusterIP
- We fail the operation because they did not clear NodePort
They never asked for or used the NodePort!
Dual-stack introduced some dependent fields that get auto-wiped on
updates. This carries it further.
If you squint, you can see Service as a big, messy discriminated union,
with type as the discriminator. Ignoring fields for non-selected
union-modes seems right.
This introduces the potential for an apply loop. Specifically, we will
accept YAML that we did not previously accept. Apply could see the
field in local YAML and not in the server and repeatedly try to patch it
in. But since that YAML is currently an error, it seems like a very low
risk. Almost nobody actually specifies their own NodePort values.
To mitigate this somewhat, we only auto-wipe on updates. The same YAML
would fail to create. This is a little inconsistent. We could
auto-wipe on create, too, at the risk of more potential impact.
To do this properly, we need to know the old and new values, which means
we can not do it in defaulting or conversion. So we do it in strategy.
This change also adds unit tests and updates e2e tests to rely on and
verify this behavior.
add integration tests to verify the behaviour of the endpoints
and endpointslices controller with dual stack services.
Since services can be single or dual stack, endpoints should be
generated for each IP family in the endpoint slice controller.
The legacy endpoint controller only will generate endpoints
in the first IP family configured in the service.
integration fix
we are missing tests that check the connectivity against services
that have backend pods with hostNetwork: true.
Because the tests run in parallel, it is possible that the pods used as
backends try to bind to the same port, and since all of them use the
host network, the scheduler will fail to create them due to port conflicts,
so we run them serially.
We have to skip networking tests with udp and endpoints using
hostNetwork, because they have a known issue.
NetworkingTest is used to test different network scenarios.
Since new capabilites and scenarios are added, like SCTP or HostNetwork
for pods, we need a way to configure it with minimum disruption and code
changes.
Go idiomatic way to achieve this is using functional options.
the e2e test container used for the "Networking Granular Checks: Services"
tests only needs to listen in one port to perform do network checks.
This port is unrelated to the other ports used in the test, so we may
use a different number to avoid possible conflicts.
* api: structure change
* api: defaulting, conversion, and validation
* [FIX] validation: auto remove second ip/family when service changes to SingleStack
* [FIX] api: defaulting, conversion, and validation
* api-server: clusterIPs alloc, printers, storage and strategy
* [FIX] clusterIPs default on read
* alloc: auto remove second ip/family when service changes to SingleStack
* api-server: repair loop handling for clusterIPs
* api-server: force kubernetes default service into single stack
* api-server: tie dualstack feature flag with endpoint feature flag
* controller-manager: feature flag, endpoint, and endpointSlice controllers handling multi family service
* [FIX] controller-manager: feature flag, endpoint, and endpointSlicecontrollers handling multi family service
* kube-proxy: feature-flag, utils, proxier, and meta proxier
* [FIX] kubeproxy: call both proxier at the same time
* kubenet: remove forced pod IP sorting
* kubectl: modify describe to include ClusterIPs, IPFamilies, and IPFamilyPolicy
* e2e: fix tests that depends on IPFamily field AND add dual stack tests
* e2e: fix expected error message for ClusterIP immutability
* add integration tests for dualstack
the third phase of dual stack is a very complex change in the API,
basically it introduces Dual Stack services. Main changes are:
- It pluralizes the Service IPFamily field to IPFamilies,
and removes the singular field.
- It introduces a new field IPFamilyPolicyType that can take
3 values to express the "dual-stack(mad)ness" of the cluster:
SingleStack, PreferDualStack and RequireDualStack
- It pluralizes ClusterIP to ClusterIPs.
The goal is to add coverage to the services API operations,
taking into account the 6 different modes a cluster can have:
- single stack: IP4 or IPv6 (as of today)
- dual stack: IPv4 only, IPv6 only, IPv4 - IPv6, IPv6 - IPv4
* [FIX] add integration tests for dualstack
* generated data
* generated files
Co-authored-by: Antonio Ojea <aojea@redhat.com>
The kube_proxy SIGDescribe previously only had Network in the title
and made it more difficult to select just the test cases in the
kube_proxy file and would end up running anything with Network in the
text area of SIGDescribe e2e tests.
Signed-off-by: Christopher M. Luciano <cmluciano@us.ibm.com>
Simulating a cluster with 500 nodes in 3 zones, deploying 3, 12 and 27 Pods belonging to the same service.
Change-Id: I16425594012ea7bd24b888acedb12958360bff97
A spreading test is more meaningful with a greater number of Pods. However, we cannot always expect perfect spreading. We accept a skew of 2 for 5*z Pods, where z is the number of zones.
Change-Id: Iab0de06a95974fbfec604f003b550f15db618ebd
Due to a rebase glitch the fmt.Sprintf() was lost.
This patches restores it improving the logs readability.
Signed-off-by: Francesco Romani <fromani@redhat.com>
CRUD operations are the extent of conformance testing that we can add
for NetworkPolicy tests since we require a 3rd party plugin like CNI
for enforcement.
Signed-off-by: Christopher M. Luciano <cmluciano@us.ibm.com>
Adds Windows support for most of the images.
Adds a README explaining the image building process, including the
Windows Container image building process.
A previous commit added a few agnhost related functions that creates agnhost
pods / containers for general purposes.
Refactors tests to use those functions.
The CreateSync method includes the waiting for the pod to become running
and returns a fresh new pod instance.
In addition, errors are asserted in the method.
Therefore, there is no need for the callers to repeat these operations.
Some, like the error assertions, will never be reached in case they
occur as they will explode from within the method itself.
Signed-off-by: Edward Haas <edwardh@redhat.com>
I've observed this test occasionally failing due to 403 errors. I think there's something racing within apiserver w/ respect to RBAC and that if this test were more patient, then it would not flake this way.
- Due to performance issues, service controller updates are slow
in large clusters, causing failing tests. Tag can be removed once
performance issues are resolved
Currently, some of the E2E test images have Windows support, and one of the goals is for most of
them to have Windows support. For that, the Image Builder is currently building those Windows
container images using a few Windows Server nodes (for 1809, 1903, 1909) with Remote Docker
enabled which are hosted on an azure subscription dedicated for CNCF.
With this, the Windows nodes dependency is removed entirely, as the images can be also built with
docker buildx. One additional benefit to this is that adding new supported Windows OS versions
to the E2E test images manifest lists becomes a lot easier (we wouldn't have to create a new Windows
Server node that matches that new OS version, assign DNS name, update certificates, etc.), and it
also becomes easier for other people to build their own E2E windows test images.
However, some dependencies are still required to run on a Windows machine. To solve this, we can
just pull helper images: e2eteam/powershell-helper:6.2.7 and e2eteam/busybox-helper:1.29.0. Their
Dockerfiles and a Makefile for them has been included in this commit. If any change is required to
them, then a new image will be built and tagged under a different version, but they are pretty
straight-forward and shouldn't require changes.
However, there is a small concern when it comes to the build time: Windows servercore images are
very large (for example, mcr.microsoft.com/windows/servercore:ltsc2019 is 4.99GB uncompressed, and
about ~2 GB compressed - those images are already cached on the Windows Server builder nodes, so
this isn't an issue there), and we currently support 1809, 1903, and 1909 (soon to add 2004).
This can lead to build times that are too big.
We have changed the base image to nanoserver (uncompressed size: 250MB), but some images still
require some DLLs or some other dependencies that can be fetched from a servercore image.
A separate job has been defined that would build a scratch windows-servercore-cache image monthly,
and then we can just get those dependencies from this cache, which will be very small.
This would be preferred, as the Windows images update periodically, and those dependencies
could be updated as well.
We need to make sure we tear down the sriov device plugin pod
should the tests fail, to avoid leaking pods in the test environment.
Signed-off-by: Francesco Romani <fromani@redhat.com>
The NetworkPolicy tests work by trying to connect to a service by its
name, which means that for the tests that involved creating egress
policies, it had to always create an extra rule allowing egress for
DNS, but this assumed that DNS was running on UDP port 53. If it was
running somewhere else (eg if you changed the CoreDNS pods to use port
5353 to avoid needing to give them the NET_BIND_SERVICE capability)
then the NetworkPolicy tests would fail.
Fix this by making the tests connect to their services by IP rather
than by name, and removing all the DNS special-case rules. There are
other tests that ensure that Service DNS works.
The integration test for pods produces a warning caused by using deprecated
default cluster IPs.
$ make test-integration WHAT=./test/integration/pods GOFLAGS="-v"
W1007 17:25:28.217410 100721 services.go:37] No CIDR for service cluster IPs specified. Default value which was 10.0.0.0/24 is deprecated and will be removed in future releases. Please specify it using --service-cluster-ip-range on kube-apiserver.
This warning appears 36 times after running all tests. This patch removes all
the warnings.
Signed-off-by: Masashi Honma <masashi.honma@gmail.com>
PATCH verb is used when creating a namespace using server-side apply,
while POST verb is used when creating a namespace using client-side
apply.
The difference in path between the two ways to create a namespace led to
an inconsistency when calling webhooks. When server-side apply is used,
the request sent to webhooks has the field "namespace" populated with
the name of namespace being created. On the other hand, when using
client-side apply the "namespace" field is omitted.
This commit aims to make the behaviour consistent and populates the
"namespace" field when creating a namespace using POST verb (i.e.
client-side apply).
On HA API server hiccups we saw a prow job with error:
fail [k8s.io/kubernetes@v1.19.0/test/e2e/e2e.go:284]: Sep 30 17:06:08.313: Error waiting for all pods to be running and ready: 0 / 0 pods in namespace "kube-system" are NOT in RUNNING and READY state in 10m0s
POD NODE PHASE GRACE CONDITIONS
The failure should include the last error from API server, if it's
available:
fail [k8s.io/kubernetes@v1.19.0/test/e2e/e2e.go:284]: Oct 1 11:29:45.220: Error waiting for all pods to be running and ready: 0 / 0 pods in namespace "kube-system" are NOT in RUNNING and READY state in 10m0s
Last error: Get "https://localhost:6443/api/v1/namespaces/kube-system/replicationcontrollers": dial tcp [::1]:6443: connect: connection refused
POD NODE PHASE GRACE CONDITIONS
This fixes a problem when using the framework helper
e2epod.CreateExecPodOrFail() more than once in the same test.
The problem is that this function was creating pods with both the
pod.Name and pod.GenerateName set.
The second pod failed to be created with a 500 with Reason ServerTimeout
indicating a unique name could not be found in the time allotted,
xref https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.19/
With the changes introduced by agnhost refactoring PR ( #87266 ), this test was left searching for a invalid container name. Updated to the proper name.
fixed syntax, wrote a test
fixed a test
.
1
Update staging/src/k8s.io/apimachinery/pkg/util/intstr/intstr_test.go
Co-Authored-By: Joel Speed <Joel.speed@hotmail.co.uk>
added test
.
fix
fix test
fixed a test
gofmt
lint
fix
function name
validation fix
.
godocs added
.
A previous commit created a few agnhost related functions that creates agnhost
pods / containers for general purposes.
Refactors tests to use those functions.
A previous commit added a few agnhost related functions that creates agnhost
pods / containers for general purposes.
Refactors tests to use those functions.
Disable subpath test "should be able to unmount after the subpath
directory is deleted" for windows because the test will fail when
deleting a dir while another container still use it.
having framework.DumpDebugInfo after the FailF was
a noop and we are losing those potentially helpful
logs when we need them the most (on a failure)
Signed-off-by: Jamo Luhrsen <jluhrsen@redhat.com>
By passing "oflag=nocache" and "iflag=direct", caching should be
disabled while writing/reading with "dd" to a block device. The
TestConcurrentAccessToSingleVolume() test is known to fail with certain
storage backends (like Ceph RBD) when caching is enabled.
The default BusyBox image used for testing does not support the required
options for "dd". So instead of running with BusyBox, run the test with
a Debian image.
For testing certain features, the BusyBox image does not provide all the
tools that are needed. Notably 'dd' from BusyBox does not support
direct-io that is required for skipping caches while doing writes and
reads on a Block-mode PVC attached to different nodes.
'agnhost' image uses hardcoded 'cluster.local' value for DNS domain.
It leads to failure of a bunch of HPA tests when test cluster is
configured to use custom DNS domain and there is no alias for
default 'cluster.local' one.
So, fix it by reusing it's own function for reading DNS domain suffixes.
Signed-off-by: Valerii Ponomarov <kiparis.kh@gmail.com>
In some cases the EndpointSlice controller will create more
EndpointSlices than necessary resulting in some duplication. This is
valid and tests should only fail here if less EndpointSlices than
expected are added.
Testing with the default FS (ext4) is IMO enough, ext2/ext3 does not add much
value. It's handled by the same kernel module anyway.
Leave ext2/ext3 only in GCE PD which is tested in kubernetes/kubernetes CI
jobs regularly to catch regressions.
test can execute whever hosts have ssh or not
relevant case:
"should be able to up and down services"
"should implement service.kubernetes.io/service-proxy-name"
"should implement service.kubernetes.io/headless"
A previous commit created a few agnhost related functions that creates agnhost
pods / containers for general purposes.
Refactors tests to use those functions.
This reverts commit 61490bba46, reversing
changes made to 9ecab1b4b2.
Some methods from the networking e2e tools are dialing from a
containter to another container, and failing the test if there was no
connectivity. This PR modified the methods to return an error instead of
failing the test.
However, these methods were used by other tests in the framework, and
they are not checking if the method returns an error, expecting that
the method fail the test. With this change, any connectivity problem
will go unnoticed on the tests that are not asserting the error, so we
need to revert to previous state.
Using Windows nanoserver container images as a base instead of the current
Windows servercore image will reduce the image size by about ~10x.
However, the nanoserver image lacks several things we need:
- netapi32.dll
- powershell
- certain powershell commands
- chocolatey cannot be used
When building the nanoserver images, we are going to use a Windows servercore helper,
in which we are going to install the necessary dependencies, and then copy them over
to our nanoserver image, including necessary DLLs.
Other notable changes include:
- switch from wget to curl (wget was a powershell alias).
- implement in code getting the DNS suffix list and DNS server list.
- reimplement getting file permissions for mounttest.
The provided DialContext wraps existing clients' DialContext in an attempt to
preserve any existing timeout configuration. In some cases, we may replace
infinite timeouts with golang defaults.
- scaleio: tcp connect/keepalive values changed from 0/15 to 30/30
- storageos: no change
When trying to build the s390x image, it would fail when running the apk
command with the following error:
ERROR: Unable to open root: Bad address
ERROR: Failed to open apk database: Bad address
This can be fixed by updating the third_party/multiarch/qemu-user-static/register/register.sh
and third_party/multiarch/qemu-user-static/register/qemu-binfmt-conf.sh scripts
and their usage to a newer version [1].
Additionally, the packages nginx-mod-http-lua and nginx-mod-http-lua-upstream
cannot be found in the regular http://dl-cdn.alpinelinux.org/alpine/v3.9/main/s390x/
repository, but we can use an older one [2].
[1] https://github.com/qemu/qemu/blob/master/scripts/qemu-binfmt-conf.sh
[2] http://dl-cdn.alpinelinux.org/alpine/v3.8/main
All images used by e2e tests must use templates in order to allow
relocation. In addition this is hitting Dockerhub which will be
getting throttled soon.
microk8s run kubelet service as `snap.microk8s.daemon-kubelet.service`, instead of `kubelet.service`.
so e2e should use `systemctl list-units *kubelet* --state=running` to find out kubelet service of microk8s.
And same for go_test_conditional_pure.
Instead of aliasing. Aliases are annoying in a number of ways. This is
specifically bugging me now because they make the action graph harder to
analyze programmatically. By using aliases here, we would need to handle
potentially aliased go_binary targets and dereference to the effective
target.
The comment references an issue with `pure = select(...)` which appears
to be resolved considering this now builds.
When updating ephemeral containers, convert Pod to EphemeralContainers
in storage validation. This resolves a bug where admission webhook
validation fails for ephemeral container updates because the webhook
client cannot perform the conversion.
Also enable the EphemeralContainers feature gate for the admission
control integration test, which would have caught this bug.
Even with foreground deletion, removal of the PVs that may have been
created for a pod with generic ephemeral volumes happens
asynchronously, in the worst case after the test has completed and the
driver for the volume got removed.
Perhaps this can be fixed in Kubernetes itself, but for now we need to
deal with it as part of the test.
the endpoints API handler was using the Canonicalize() method to
reorder the endpoints, however, due to differences with the
endpoint controller RepackSubsets(), the controller was considering
the endpoints different despite they were not, generating unnecessary
updates evert resync period.
In 1.19, we made it so that upgrading from client-side `kubectl apply`
to server-side `kubectl apply --server-side` does not conflict.
This means that `kubectl diff --server-side` should work the same:
server-side apply diffing a resource managed originally with client-side apply
should not result in conflicts.
Since the SCTP module verification tests were added, their result may be affected by
running the SCTPConnectivity tests. For this reason, they are now marked as disruptive.
Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
Creates a few agnhost related functions that creates agnhost
pods / containers for general purposes.
The following commits will refactor tests to use those functions.
Make pre provisioned snapshots using CSI driver by
1. Take a dynamic snapshot with retain policy
2. Delete the dynamic snapshot and content
3. Create a preprovisioned snapshot with snapshotHandle
This commit adds a preprovisiond test pattern, all snapshots made using
create snapshot resource become prepv snapshots. All exisitng test cases
now run again with prepv snapshots.
After PR https://github.com/kubernetes/kubernetes/pull/92555, there are a number of gce pd default fs tests skipped. Here the testpatten has SnapshotType set because some provisioning tests use snapshots. But for drivers such as In-tree gce pd driver, the tests will be skipped because of the logic in skipUnsupportedTesthttps://github.com/kubernetes/kubernetes/blob/master/test/e2e/storage/testsuites/base.go#L154
Since multiple drivers might test with the same pattern, so I think we need keep SnapshotType here.
This PR removes the part of the logic in skipUnsupportedTest. This should be ok because all snapshot tests will check whether a driver has snapshot capability or not.
Currently if a group is specified for an impersonated user,
'system:authenticated' is not added to the 'Groups' list inside the
request context.
This causes priority and fairness match to fail. The catch-all flow
schema needs the user to be in the 'system:authenticated' or in the
'system:unauthenticated' group. An impersonated user with a specified
group is in neither.
As a general rule, if an impersonated user has passed authorization
checks, we should consider him authenticated.
The test "should fail substituting values in a volume subpath with absolute path" creates a pod
with a variable expansion path which is set as an absolute path. However "/tmp" is not an absolute
on Windows, it has to be prefixed with the drive letter (C:\tmp). But C:\tmp does not typically
exist on Windows nodes, so we use C:\Users instead.
This test was trying to create an Endpoints resource that the Endpoints
controller would also attempt to create. This could result in a failure
if the Endpoints controller created the resource before the test did.
/metrics/resource/v1alpha1 was deprecated and moved to
/metrics/resource
Renames to remove v1alpha1 from function names and matcher variables.
Pod deletion was taking multiple minutes, so set GracePeriodSeconds to 0.
Commented restart loop during test pod startup.
Move ResourceMetricsAPI out of Orphans by giving it a NodeFeature tag.
API removed in 7b7c73b#88568
Test created 6051664#73946
As discussed in https://github.com/kubernetes/kubernetes/pull/93658,
relying on a watch to deliver all events is not acceptable, not even
for a test, because it can and did (at least for OpenShift testing) to
test flakes.
The solution in #93658 was to replace log flooding with a clear test
failure, but that didn't solve the flakiness.
A better solution is to use a RetryWatcher which "under normal
circumstances" (https://github.com/kubernetes/kubernetes/pull/93777#discussion_r467932080)
should always deliver all changes.
This test might flake when run on a multi-zone cluster (similar to
persistent volumes, see
https://github.com/kubernetes/kubernetes/issues/75776). We don't do
that at the moment, but it's better to fix this anyway.
Since 1.19 endpoint slices is enabled by default, so all the e2e
tests should consider them.
The e2e networking tests for services use the jig object for
all the tests, but was not taking into account endpoint slices.
This considers endpoints slices for the method waitForAvailableEndpoint()
Date: Sun Aug 9 12:34:06 2020 +0200
To reduce costs and to increase speed. Formating 5GiB volume took too long
in a badly overloaded CI cluster.
All in-tree volume plugins support at least 1GiB volumes.
Although rare, the EndpointSlice controller can create duplicate
EndpointSlices. This is considered a valid state and tests that find
this state should not fail.
In our current mock CSI driver e2e test, we are not waiting
for the CSI driver register successfully to perform test
including provision PVC. This can lead to timeout when the
csi driver takes longer to register the socket.
This change adds the waiting part so that the system will
wait for up to 10 minutes for the driver to be ready. This
normally won't take this long. However, under a resource
constraint environment it can take longer than expected time.
https://github.com/kubernetes/kubernetes/issues/93358
As a part of cleaning up inactive members (who haven't been active since
beginning of 2019) from OWNERS files, this commit moves abrarshivani to
emeritus_approvers section.
- Test that client-side apply users don't encounter a conflict with
server-side apply for objects that previously didn't track managedFields
- Test that we stop tracking managed fields with `managedFields: []`
- Test that we stop tracking managed fields when the feature is disabled
As of now, the kubelet is passing the security context to container runtime even
if the security context has invalid options for a particular OS. As a result,
the pod fails to come up on the node. This error is particularly pronounced on
the Windows nodes where kubelet is allowing Linux specific options like SELinux,
RunAsUser etc where as in [documentation](https://kubernetes.io/docs/setup/production-environment/windows/intro-windows-in-kubernetes/#v1-container),
we clearly state they are not supported. This PR ensures that the kubelet strips
the security contexts of the pod, if they don't make sense on the Windows OS.
This PR fixes a few things for e2e storage suite to run on Windows
cluster.
1. increaes timeout due to longer pod startup time for windows
2. Only set SELinuxOptions or fsGroup if os is not windows
3. Add VolumeSnapshot delete policy for windows3. Add VolumeSnapshot
delete policy for windows3. Add VolumeSnapshot delete policy for windows
Typecheck is still hitting memory limits semi-regularly on periodic CI
jobs. This bumps the default parallelism down to 3 from 4 to make it
slightly less memory intensive.
Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
Some of the tests are negative test cases which are supposed to ensure that those
invalid usecases are handled properly.
However, some of the tests are false positives, they can pass due to various reasons.
One such example is: "should fail substituting values in a volume subpath with absolute path".
This test can pass if:
- the Pod cannot start due to various reasons (e.g.: the container image cannot be pulled or does
not exist).
- the Pod ran to completion, even though the container was not supposed to start in the first place.
- Revert "Fix integration test flake on TestFilter and TestPostFilter"
This reverts commit 94fc18c2dc.
- Relax checking logic on expected Filter/PostFilter counters.
- Move "ForgetPod" after "RunReservePluginsUnreserve", so that the cache would hold the pod to
avoid it's being retried simutaneously until Unreserve is completed.
- Move "assume" ahead of "RunReservePluginsReserve". This is based on the fact that "ForgetPod" is
the last step of failure path, so "assume" should be reversly treated as the first step. The
current failure path is like this:
assume -> reserve -> unreserve -> forgetPod -> recordingFailure
- Make subtests of TestReservePluginUnreserve stateless
When these tests failed it was unclear that the reason for the failure
could have been more EndpointSlices than expected. It was also unclear
what EndpointSlices were actually found when that occurred. This fixes
both of those issues.
There's currently no way to know whether an error is for SCTP or
UDP, for example:
Jul 24 09:55:54.469: INFO: netserver-0[e2e-nettest-3476].container[webserver].log
2020/07/24 09:53:52 Started UDP server
2020/07/24 09:53:52 Error occurred. error:protocol not supported
In this case the "Error occurred. error:protocol not supported" is
actually for the SCTP socket. Make that more apparent.
This adjusts tests that were waiting for Pods to be ready to instead
just wait for them to have IPs assigned to them. This relies on the
associated publishNotReadyAddresses field on Services. Additionally this
increases the the length of time we'll wait for EndpointSlices to be garbage
collected from 12s to 30s. Finally, this adds additional logging to
ExpectNoError calls so it's easier to understand where and why a test
failed.
Under e2e tests possible the situation when we restart the kubelet
number of times in the short time frame. When it happens the systemd
can fail the service restart with the `Failed with result 'start-limit-hit'.`
error.
To avoid this situation the code will reset the kubelet service start failures
on each call to the kubelet restart command.
Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
sig-storage tests that delete pods need to wait for owned resources to
also be cleaned up before returning in the case that resources such as
ephemeral inline volumes are being used. This was previously implemented
by modifying the pod delete call of the e2e framework, which negatively
impacted other tests. This was reverted and now the logic has been moved
to StopPodAndDependents, which is local to the sig-storage tests.
Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
Adds check for index out of bounds error instead of panic when passing
container to kubectl exec.
Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
Updates sig-scheduling e2e Nvidia GPU tests to install drivers using
local manifest by default. Currently the DaemonSet is fetched from the
GoogleCloudPlatform/container-enginer-accelerators repo by default.
Using a local manifest allows for manually specifying the image
cos-gpu-installer image rather than always using latest. A remote
manifest can still be fetched by setting
NVIDIA_DRIVER_INSTALLER_DAEMONSET env var.
Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
MasterUpgrade() is called only from cloud/gcp/cluster_upgrade.go.
And the function depends on GCP, so it is nice to move this function
from e2e framework.
The IPAM and scheduler performance tests are currently causing
integration-master job to fail because of timeouts. They were not
previously running as part of integration-master, so we can disable them
without loss of test coverage. They should be re-enabled as part of fix
for #93112.
Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
In #91342 attempting to evict a Pod with a DeletionTimestamp caused
checking of PDBs to be ignored due to the fact that a Pod scheduled for
deletion should not be factored into a disruption budget. However, PDB
eviction tests currently will sometimes select a Pod already scheduled
for deletion, expecting that attempting to evict it will conflict with
the PDB. This updates those tests to make sure a Pod with deletion
timestamp is not selected for eviction when it is intended to violate a
PDB.
Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
bprashanth hasn't be active since the release of v1.11. Removing them
from test/images/pets/OWNERS would leave mkumatag as the sole approver.
But mkumatag is also an approver for test/images/OWNERS so this commit
removes the test/images/pets/OWNERS completely.
Note: we should try to find more OWNERS for test/images/pets instead,
but this cleanup is a short term solution to avoid the bot suggesting
inactive members for reviews and approval.
This removes DeprecatedGetMasterAndWorkerNodes() usage from vsphere e2e
test as deprecated function cleanup.
Then all callers of DeprecatedMightBeMasterNode() have been removed.
So this removes DeprecatedMightBeMasterNode() itself also.
The multi-arch container images used in tests live in quay.io which
doesn't support nesting. By making the /volume/ images repo configurable,
we are able to override them despite our current limitation.
Signed-off-by: Rafael Fonseca <r4f4rfs@gmail.com>
This extends the existing "ephemeral volume" tests to also cover
generic ephemeral inline volumes. They get instantiated for all
drivers (CSI and others) which support persistent volume provisioning,
for several different filesystems.
Configuring the number of inline volumes via a flag with a computed
name had been identified as problematic before and now gets removed
because re-using the tests as a stress test with a higher number of
volumes should be added and configured separately.
Windows test for subPath is failing due to an issue related to
removeUnusedContainers calls. After image is changed to Agnhost, it
automatically has a args by default. However, there are places to use
container commands instead of args and causing issues.
This is the first step to fix this issue. Next plan to replace
busybox used in Linux with Agnhost which can work for both linux and
windows.
I also mark two subPath tests as LinuxOnly. I think they are not ready
for windows yet. Before they were passing due to wrong reason. The tests
checks failed container status but the contain fails due to other
reasons than what we expected.
This is useful in case that the pod owns some resources, because then
waiting for the pod ensures that those resources also were removed.
This should not matter at the moment because pods typically are not
owners of any other object, but that will change with the introduction
of generic ephemeral inline
volumes (https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1698-generic-ephemeral-volumes).
- Allow client-side to server-side apply upgrade.
Ensure that a user can change management of an object from client-side apply to
server-side apply without conflicts.
- Allow server-side apply to client-side downgrade.
For an object managed with client-side apply, a user may upgrade to
managing the object with server-side apply, then decide to downgrade.
We can support this downgrade by keeping the last-applied-configuration
annotation for client-side apply updated with server-side apply.
The test steps are as follows:
1. Write some data
2. Take a snapshot
3. Write more data
4. Create a new volume from snapshot
5. Validate data is the old data
1. Use ginkgo before each to do common setup
2. Use volume resource to create SC, PV, PVC and handle cleanup
3. Add SnapshotResource to handle creating and cleanup of VS, VSC, VSClass
4. Add test pattern for deletion policy: Delete vs Retain
5. Use test pattern to determine test behaviour
6. Add test pattern for preprovisioned snapshot (not implemented)
These changes are made to consolidate common setup steps and stop resource
leaks by waiting for objects to be deleted.
By creating CSIStorageCapacity objects in advance, we get the
FailedScheduling pod event if (and only if!) the test is expected to
fail because of insufficient or missing capacity. We can use that as
indicator that waiting for pod start can be stopped early. However,
because we might not get to see the event under load, we still need
the timeout.
Setting testParameters.scName had no effect because
StorageClassTest.StorageClassName isn't used anywhere. Instead, the
storage class name is generated dynamically.
DeprecatedMightBeMasterNode() has been marked as deprecated and we need to
find alternative way for callers of the function.
In NewResourceUsageGatherer(), the function was called for distinguishing
the specified pods are running on master nodes, and the gatherer gathers
those pods' resource usage.
This adds nodeHasControlPlanePods() to gistinguish the specified pods
are running on nodes which are operating control plane pods (kube-scheduler
and kube-controller-manager) and replace callers of DeprecatedMightBeMasterNode()
with this new function as better way.
The kubelet would attempt to create a new sandbox for a pod whose
RestartPolicy is OnFailure even after all container succeeded. It caused
unnecessary CRI and CNI calls, confusing logs and conflicts between the
routine that creates the new sandbox and the routine that kills the Pod.
This patch checks the containers to start and stops creating sandbox if
no container is supposed to start.
If a bearer token is present in a request, the exec credential plugin should accept that as the chosen method of authentication. Judging by an [earlier comment in exec.go](c18bc7e9f7/staging/src/k8s.io/client-go/plugin/pkg/client/auth/exec/exec.go (L217)), this was already intended. This would however not work since UpdateTransportConfig would set the GetCert callback which would then get called by the transport, triggering the exec plugin action even with a token present in the request. See linked issue for further details.
See #87369 for further details.
Signed-off-by: Anders Eknert <anders.eknert@bisnode.com>
When using the entire test name as file name, the name became too
long (> 256 characters, which wasn't supported by all file systems)
and the artifact directory got cluttered.
The original reason (a limitation in Gubernator) no longer applies
because Spyglass is used now for log viewing.
Using NodeWrapper in the integration tests gives more flexibility when
creating nodes. For instance, tests can create nodes with labels or
with a specific sets of resources.
Also, NodeWrapper initialises a node with a capacity of 32 pods, which
can be overridden by the caller. This makes sure that a node is usable
as soon as it is created.
Now the test covers 6 different api calls
- verify create with a get
- verify patch with a list (all namespaces)
- verify delete with a list (single namespace)
The thing is, for this test at least, I'm pretty sure there's nothing
we need to wait on. Instead of waiting for a deleted event, we will
relist configmaps and expect 0, to confirm the deletion took effect
This drops testfiles.ReadOrDie and updated testfiles.Exists to return an
error, forcing the caller to decide whether to call framework.Fail or do
something else.
It makes for a slightly less friendly API, but also means the package is
decoupled from framework again, as per the comments at the top of the
file
Currently when checking for unscheduled pods an exception will be raised
if a pod is not scheduled and the status is unknown. This update modifies
the logic to include any pod without a NodeName in the not scheduled
pods returned.
Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
If a reserve plugin's Reserve method returns an error, there could be
previously allocated resources from successfully completed reserve
plugins that must be unallocated by the corresponding Unreserve
operation. Since Unreserve operations are idempotent, this patch runs
the Unreserve operation of ALL reserve plugins when a Reserve operation
fails.
This runs much faster than before. This change removes all of the
async status output because all of the compute time is spent inside
go/packages, with no opportunity to update the status.
Adds testdata code to prove it fails when expected.
When node scheduling tests were updated to use worker instead of master
nodes the GetPodsScheduled function, which is tasked with getting all
scheduled and not scheduled pods inadvertently was changed to ignore all
pods that have an empty NodeName before checking whether pods had been
scheduled or not. This updates the function to include pods without a
NodeName in the check for unscheduled pods.
Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
As its name, DeprecatedMightBeMasterNode is deprecated.
In e2e metrics, the function was used for knowing master node name to
get metrics from kube-scheduler and kube-controller-manager pods.
This make e2e metrics get these metrics directly by getting those pod
names without calling DeprecatedMightBeMasterNode().
Ready schedulable nodes are being inserted into an unitialized string
set, causing an assignment to entry in nil map in the underlying data
structure. This initializes the string set before attempting to insert
nodes.
Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
There were nits in invokeStaleDummyVMTestWithStoragePolicy() like
- The error message didn't contain necessary space
- IsVMPresent() can return an error, but lack of the error handling
- IsVMPresent() returns true/false, but didn't use ExpectEqual() and
less code readability
This fixes those things.
The namespace parameter "ns" of getScheduledAndUnscheduledPods() is
always metav1.NamespaceAll.
This removes the parameter from the function for cleanup.
Previously, separate interfaces were defined for Reserve and Unreserve
plugins. However, in nearly all cases, a plugin that allocates a
resource using Reserve will likely want to register itself for Unreserve
as well in order to free the allocated resource at the end of a failed
scheduling/binding cycle. Having separate plugins for Reserve and
Unreserve also adds unnecessary config toil. To that end, this patch
aims to merge the two plugins into a single interface called a
ReservePlugin that requires implementing both the Reserve and Unreserve
methods.
WaitForStableCluster() checks all pods run on worker nodes, and the
function used to refer master nodes to skip checking controller plane
pods.
GetMasterAndWorkerNodes() was used for getting master nodes, but the
implementation is not good because it usesDeprecatedMightBeMasterNode().
This makes WaitForStableCluster() refer worker nodes directly to avoid
using GetMasterAndWorkerNodes().
Part of work to remove racist language, this name change also improves on the
semantics of this variable name as it was not actually a list of permissible
images but rather a list of images that are required for e2e_node tests that
are to be pre-pulled so that they are available prior to running e2e tests.
Worth noting that this list of images is "union merged" with another list when
setting up e2e_node tests and as such there is the possibilty for overlap.
# Please enter the commit message for your changes. Lines starting
e2e/framework is a place to keep common functions for e2e tests, and
it is not a place to keep e2e tests themself. recreate_node.go is e2e
test for node.
This moves recreate_node.go to e2e/node.
The node-kubelet-flaky e2e job that runs the the
`Node Performance Testing [Serial] [Slow] [Flaky]` e2e tests have been
flaking because of inconsistencies on the cpu manager checkpoint file.
This seems to be caused because the checkpoint file is deleted (which is
what needs to happen in order to change the CPU manager policy which is
used for these e2e tests) right after the e2e tests asserts that a pod
does not exist anymore.
However, after a pod is deleted, the CPU manager may still be cleaning
up the resources used by the pod which may result in the checkpoint file
being created.
Whenever this happened, the kubelet would panic if we then try to
subsequently change the CPU manager policy to "static" from "none" or
vice versa (this is done 4 times in these tests).
Signed-off-by: alejandrox1 <alarcj137@gmail.com>
The behavior for exec'ing a backgrounded command is not specified
with CRI so modify the test to run the command directly instead
of using exec.
Signed-off-by: Mrunal Patel <mpatel@redhat.com>
deflake current e2e test
"should be able to preserve UDP traffic when server pod cycles for a
NodePort service" and reorganize the code in the e2e framework
Signed-off-by: Antonio Ojea <antonio.ojea.garcia@gmail.com>
These changes allow to set FQDN as hostname of pods for pods
that set the new PodSpec field setHostnameAsFQDN to true. The PodSpec
new field was added in related PR.
This is PART2 (last) of the changes to enable KEP #1797 and addresses #91036
Currently, the jessie-dnsutils image fails to build for arm64 arch with the following
error:
GPG error: http://archive.debian.org jessie Release: The following signatures were invalid: KEYEXPIRED 1587841717
We can bypass this issue by adding a --force-yes flag when installing the needed dnsutils.
Adds reviewers to the OWNERS files in the kubernetes/test/images folder.
The reviewers are added automatically, based on their contributions on
an image (>= 20% code churn).
Note that the code churn is taken into account for authors, and not committers.
Adds OWNERS files for: ipc-utils, node-perf, nonroot, regression-issue-74839,
resource-consumer, sample-device-plugin.
Windows does not support partially qualified domain names, which is why the test can fail.
Additionally, because nslookup may return 0 on Windows, even if the given DNS name was not
found, this issue was not observed until recently. We're now checking stderr as well.
The test spawns 10 pods with the same pod name, which contains multiple
containers with the same container name. Because of this, the test fails.
This commit addresses this issue.
e2e_node tests trigger OOM events on COS versions > 73-11636-0-0
possibly because of this change in the COS v.73-11636-0-0:
Made containerd run as a standalone systemd service
OOM killer usually kills cadvisor and e2e_node.test processes
causing node-kubelet-benchmark failures.
Decreasing amount of pods from 105 to 90 frees enough memory for
the test to succeed.
kubeconform was choking on a typo in the description field, so I fixed
the typo while adding friendlier logging to tell me which file was
invalid
I got curious why tests didn't catch this, and it turns out kubeconform
and the behavior tests use different codepaths to load and validate. So
I merged them together
The way gingko handles interrupts is:
- It starts running AfterSuite hooks in a separate goroutine (this includes cleanupAction hooks)
- Once AfterSuite hook is done executing it calls
os.Exit(1) on test suite.
So how cleanupFunc() that runs via defer in test can be interrupted
is:
- cleanupFunc starts running via defer (or AfterEach hook) but first
thing that function does is to remove cleanupHandle from
framework.RemoveCleanupAction.
- Test suite receives interrupt from user and AfterSuite block
starts executing
- remember that while cleanupFunc is running in goroutine#1,
AfterSuite is running concurrently in goroutine#2.
- AfterSuite hook has bunch of CleanupActions it needs to run which
were registered via framework.AddCleanupAction(cleanupFunc) but
once cleanupFunc starts executing via defer in the test, it will
remove the cleanupHandle from framework's aftersuite hooks.
- So if AfterSuite did not had anything to run (because
those actions were removed via framework.RemoveCleanupAction
then it will simply go to the last framework.AfterEach action and call os.Exit(1)
- So if os.Exit(1) is called before cleanupFunc has a chance to finish in defer, it will not complete.
The google cloud builder job is launched without the required Windows Image Builder nodes
certificates that are needed for authentication when building the Windows container images.
Adds a step in test/images/cloudbuild.yaml that fetches a secret containing the certificates.
This change removes the `coreCount` variable and associated counting
logic from the cluster size autoscaling test. This variable was used
during the 1.9 release era, but the tests that used it were removed
before the 1.10 release. Please see the referenced commits[0][1] for
more information.
[0]
c8b807837a
[1]
fd738945b1
The e2e/lifecycle package is owned by SIG CL, although maybe this
should be moved to e2e/auth at some point.
- copy the OWNERS from /cmd/kubeadm (minus the area/kueadm label)
- remove the OWNERS file in /bootstrap letting the parent OWNERS file
manage this sub-package.
ingress: use new serviceBackend split
ingress: remove all v1beta1 restrictions on creation
This change removes creation and update restrictions enforced by
k8s 1.18 for not allowing resource backends.
Paths are no longer
required to be valid regex and a PathType is now user-specified
and no longer defaulted.
Also remove all TODOs in staging/net/v1 types
Signed-off-by: Christopher M. Luciano <cmluciano@us.ibm.com>
Lowering the amount of cpu allocated to this workload will set the
resources allocated to be similar to the other npb and tf workload in
this tests.
This will also allow to run all three workloads in a n1-standard-12 gcp
instance - which has 16 cpus and 60 GB.
Signed-off-by: alejandrox1 <alarcj137@gmail.com>
For the beta server-side dry-run feature, `kubectl apply` provided the
`--server-dry-run` flag.
As of 1.18, this flag was deprecated and marked to be removed after 1
release.
- Remove the ServerDryRun field and delegate it entirely to the resource.Helper
- Use resource.Helper for deletions (as in `kubectl apply --force`)
instead of using the pruner's method that uses a dynamic client
- Reduce the resource.Helpers and times we check for server-side dry-run
in apply
the test "executing a command with run and attach without stdin"
is inherently flaky, there are several discussion but seems that
it requires changing the way the kubectl run and attach works.
The test fails if we are not able to attach before the container prints
"stdin closed", but hasn't exited yet.
Because the race seems difficult to solve, we can wait 5 seconds
before printing to give time to kubectl to attach to the container.
- use in-cache Pod instead of real-time Pod (by calling API server) to mark it as unschedulable
in internal schedulingQ
- remove the backoff logic as now we don't call API server
- the whole logic is changed to a synchronous call
We have little coverage around node addition and removal. Since distinct event handlers interact, it is important to cover this in integration tests.
Signed-off-by: Aldo Culquicondor <acondor@google.com>
ensure that when a pod servicing UDP traffic is deleted the conntrack entries
are cleaned up and another backend can pick up the traffic with minimal
interruption
When using NodePort services and long running connections that on pod deletion
stale conntrack entries can halt the flow of traffic. Add a test case to check
that conntrack entries are cleaned up.
In case two or more controllers share the informers created through InitTestScheduler,
it's not safe to start the informers until all controllers set their informer
indexers. Otherwise, some controller might fail to register their indexers
in time. Thus, it's responsibility of each consumer to make sure all informers
are started after all controllers had time to get initiliazed.
ginkgo has a weird bug that - AfterEach does not get called when
testsuite exits with certain kind of interrupt (Ctrl-C for example).
More info - https://github.com/onsi/ginkgo/issues/222
We workaround this issue in Kubernetes by adding a special hook into
AfterSuite call, but AfterSuite can not be used to peforms certain
kind of cleanup because it can race with AfterEach hook and
framework.AfterEach hook will set framework.ClientSet to nil.
This presents a problem in cleaning up CSI driver and testpods. This
PR removes cleanup of driver manifest via CleanupAction because that
is not safe and racy (such as f.ClientSet may disappear!) and makes
AfterSuite hooks run in a ordered fashion
We should read and verify the data before actually closing the
connection to avoid connection based-races within the test.
Signed-off-by: Sascha Grunert <sgrunert@suse.com>
Removes the fatal error from getIP and moves it to
retry loop so that application will not immediately
crash on failure.
Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
This updates the error messages when registering a
node to be more explicit about what error occurred
and how long it will wait to retry.
Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
Currently the guestbook application will fail if unable
to resolve TCP address on first attempt. If pod networking
is not setup when the application starts then it will be
unable to resolve, leading to frequent failures. This moves
the address resolution into the retry block so it will try
again if unsuccessful on first attempt.
Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
The current /exit method is not sufficient to test graceful shutdown
behaviors within Kube that allow services to remain available during
rolling restarts. Add support for `wait=DURATION` and
`timeout=DURATION` to the exit handler and wire that to the Go http
server's graceful termination.
With these methods netexec can be used in a pod to simulate graceful
shutdown by adding a preStop handler that hits the exit endpoint with
a timeout and wait period.
kubelet sometimes calls NodeStageVolume an NodePublishVolume too
often, which breaks this test and leads to flakiness. The test isn't
about that, so we can relax the checking and it still covers what it
was meant to cover.
collectPodsAndNetworkPolicies() is called to collect diagnostics
after a failure. Previously, if it encountered a failure in getting
the logs it would call Failf(), discarding the rest of the diagnostics
immediately.
Following changes in #87730, Kubelet is directly hcsshim to gather stats.
However, unlike `docker stats` API that was used before, hcsshim does not
keep information about exited containers.
When the Kubelet lists containers (`docker_container.go:ListContainers()`),
it sets `All: true`, retrieving non-running containers.
When docker stats is called with such container id, it'll return a valid JSON
with all values set to 0. The non-running containers are filtered later on in the process.
When the hcsshim is called with such container id, it'll return an error, effectively
stopping the stats retrieval for all containers.
"Volumes GlusterFS should be mountable" is a bit flaky in a downstream CI.
This PR make "should be mountable" test on par with the other GlusterFS
tests (in_tree.go: DeleteVolume())
commit 43c56eb403 introduced a change
where CPUAccounting, CPUAccounting and TasksAccounting are enabled for
the systemd service.
It causes a regression on RHEL 7.8 where systemd-run doesn't allow to
set TasksAccounting.
Since Delegate= already enables all the controllers, it is superfluous
to specify them.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
In caf0d1d61874a2c8687b7deb773eca30ddaee5b6 we documented a policy to
ensure that conformance tests should not rely in existence or use of
kubelet apis directly. So based on that we should drop conformance for
the two tests here that use the "/logs" endpoint directly.
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
The test "should not change the subpath mount on a container restart if the environment variable changes"
creates a pod with the liveness probe: cat /volume_mount/test.log. The test then
deletes that file, which causes the probe to fail and the container to be restarted.
After which it recreates the file by exec-ing into the pod, but there is a chance
that the container was not created yet, or it did not start yet.
This commit adds a few retries to the exec command.
this is mainly to ensure integration tests (which all end in _test)
are properly bossed around for their imports
I had to adjust some of the _test files to adhere to existing
reverse_rules specified elsewhere
specifically:
- cmd/kubeadm/.import-restrictions
- we don't need to explicitly allow k8s.io repos (external or published)
- rm pkg/controller/.import-restrictions
- pkg/client/unversioned was removed in 59042
- pkg/kubectl/.import-restrictions
- pkg/printers is no longer used
- pkg/api was masking all of the pkg/apis prefixes
- rm staging/src/k8s.io/code-generator/cmd/lister-gen/.import-restrictions
- noop / empty file
- test/e2e/framework/.import-restrictions
- we don't need to explicitly allow k8s.io repos (external or published)
yaml has comments, so we can explain why we have certain rules or
certain prefixes
for those files that weren't already commented yaml, I converted them to
yaml and took a best guess at comments based on the PRs that introduced
or updated them
When a test pattern or storage class uses late binding, the cleanup
code didn't know about the PV that may have been created for the PVC
since setting it up and thus then also didn't wait for PV deletion.
This is problematic for test isolation because the next test was
allowed to be started before fully cleaning up. Worse, it the driver
gets removed after the test, the volume might never get deleted.
'docker pull' is a time consuming operation. It makes sense to check
if image exists locally before pulling it from a registry.
Checked if image exists by running 'docker inspect'. Only pull if
image doesn't exist.
The service allocator is used to allocate ip addresses for the
Service IP allocator and NodePorts for the Service NodePort
allocator. It uses a bitmap backed by etcd to store the allocation
and tries to allocate the resources directly from the local memory
instead from etcd, that can cause issues in environment with
high concurrency.
It may happen, in deployments with multiple apiservers, that the
resource allocation information is out of sync, this is more
sensible with NodePorts, per example:
1. apiserver A create a service with NodePort X
2. apiserver B deletes the service
3. apiserver A creates the service again
If the allocation data of apiserver A wasn't refreshed with the
deletion of apiserver B, apiserver A fails the allocation because
the data is out of sync. The Repair loops solve the problem later,
but there are some use cases that require to improve the concurrency
in the allocation logic.
We can try to not do the Allocation and Release operations locally,
and try instead to check if the local data is up to date with etcd,
and operate over the most recent version of the data.
- add ./hack/tools/go.mod, this makes ./hack/tools a distinct module
- hack/tools/tools.go undescore imports bazel related tools, over time we
can add others.
- hack/*.sh scripts will cd to hack/tools and go install tools from there
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
During a parallel test run these tests were observed to cause a
preemption of another test:
```
Apr 19 16:59:06.749: INFO: At 2020-04-19 16:58:52 +0000 UTC - event for pod-init-b6fbd440-dbc2-454a-b31a-ce44266298d1: {default-scheduler } Scheduled: Successfully assigned e2e-init-container-7691/pod-init-b6fbd440-dbc2-454a-b31a-ce44266298d1 to ip-10-0-148-234.us-west-2.compute.internal
Apr 19 16:59:06.750: INFO: At 2020-04-19 16:58:54 +0000 UTC - event for pod-init-b6fbd440-dbc2-454a-b31a-ce44266298d1: {default-scheduler } Preempted: Preempted by e2e-resourcequota-priorityclass-8850/testpod-pclass9 on node ip-10-0-148-234.us-west-2.compute.internal
```
These tests have no need to actually land on a node to validate resource quota and so we can set an impossible scheduling condition. Hopefully we don't have tests that too broadly check impossible scheduling conditions.
The admission cache may take longer to see both ingress classes
than it takes to create the ingress. We must loop until we see
the appropriate error, cleaning up after ourselves as we go.
Don't set a connection deadline for reading, because the read operation will
fail if no data is reaceived after the deadline, and will not keep the
connection in the close_wait status.
Copy csi-hostpath driver manifests from
kubernetes-csi/csi-driver-host-path. It bumps version of all images to the
release shipped along Kubernetes 1.18.
As seen in one case (https://github.com/intel/pmem-csi/issues/587), a
pod can reach the "not running" state although its ephemeral volumes
are still being torn down by kubelet and the CSI driver. What happens
then is that the test returns too early and even deleting the
namespace and thus the pod succeeds before the NodeVolumeUnpublish
really finishes.
To avoid this, StopPod now waits for the pod to really disappear.
The agnhost image used for testing has a `netexec` path which supports
two new flags, `--tls-cert-file` and `--tls-private-key-file`. If the
former is provided, the HTTP server will be upgraded to HTTPS, using the
certificate (and private key) provided.
By default, there are keys already mounted into the container at
`/localhost.crt` and `/localhost.key`, which contain PEM-encoded TLS
certs with IP SANs for `127.0.0.1` and `[::1]`.
This adds 2 new tests covering EndpointSlices, including new coverage of
the self referential Endpoints and EndpointSlices that need to be
created by the API Server and the lifecycle of EndpointSlices from
creation to deletion. This also removes the [feature] indicator from the
name to ensure that this test will run more often now that it is enabled
by default.
Adds reviewers to the OWNERS files in the kubernetes/test/images folder.
The reviewers are added automatically, based on their contributions on
an image (>= 20% code churn).
Note that the code churn is taken into account for authors, and not committers.
Adds OWNERS files for: cuda-vector-add, nonewprivs, pets, redis, volume.
Adds reviewers to the OWNERS files in the kubernetes/test/images folder.
The reviewers are added automatically, based on their contributions on
an image (>= 20% code churn).
Note that the code churn is taken into account for authors, and not committers.
Adds ONWERS files for: apparmor-loader, echoserver, jessie-dnsutils, metadata-concealment,
sample-apiserver.
The build times are a bit high for the image builder (~50 minutes), and it will a bit more
when Windows support will be added to the other test images. This commit changes the
machineType to N1_HIGHCPU_8.
Reenables Windows test image building. Added DOCKER_CERT_BASE_PATH (default value: $HOME),
which will contain the path where the certificates needed for Remote Docker Connection can
be found.
If a REMOTE_DOCKER_URL was not set for a particular OS version, exclude that image from the
manifest list. This fixes an issue where, if REMOTE_DOCKER_URL was not set for Windows Server 1909,
the Windows were completely excluded from the manifest list, including for Windows Server 1809
and 1903 which could have been built and pushed.
Sets "test-webserver" as the default CMD for kitten and nautilus. Since they are now based on
agnhost, they should be set to run test-webserver to maintain previous behaviour.
Bumps the agnhost version to 2.13, as 2.12 has already been promoted. 2.13 will contain
Windows support.
Adds Windows support for the kitten and nautilus images, so they can promoted together
with agnhost (they were not previously promoted).
Adds OWNERS files to: agnhost, busybox, kitten, nautilus.
The timeout for the two loops inside the test itself are now bounded
by an upper limit for the duration of the entire test instead of
having their own, rather arbitrary timeouts.
The functionality included in the e2e/manifests is useful for writing
e2e tests and will be a good addition to the test framework as a
sub-package.
Signed-off-by: alejandrox1 <alarcj137@gmail.com>
Before https://github.com/kubernetes/kubernetes/pull/83084, `kubectl
apply --prune` can prune resources in all namespaces specified in
config files. After that PR got merged, only a single namespace is
considered for pruning. It is OK if namespace is explicitly specified
by --namespace option, but what the PR does is use the default
namespace (or from kubeconfig) if not overridden by command line flag.
That breaks the existing usage of `kubectl apply --prune` without
--namespace option. If --namespace is not used, there is no error,
and no one notices this issue unless they actually check that pruning
happens. This issue also prevents resources in multiple namespaces in
config file from being pruned.
kubectl 1.16 does not have this bug. Let's see the difference between
kubectl 1.16 and kubectl 1.17. Suppose the following config file:
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
creationTimestamp: null
name: foo
namespace: a
labels:
pl: foo
data:
foo: bar
---
apiVersion: v1
kind: ConfigMap
metadata:
creationTimestamp: null
name: bar
namespace: a
labels:
pl: foo
data:
foo: bar
```
Apply it with `kubectl apply -f file`. Then comment out ConfigMap foo
in this file. kubectl 1.16 prunes ConfigMap foo with the following
command:
$ kubectl-1.16 apply -f file -l pl=foo --prune
configmap/bar configured
configmap/foo pruned
But kubectl 1.17 does not prune ConfigMap foo with the same command:
$ kubectl-1.17 apply -f file -l pl=foo --prune
configmap/bar configured
With this patch, kubectl once again can prune the resource as before.
/cluster/kubeadm.sh is used to find the kubeadm binary.
This file is legacy and is removed.
Remove /test/cmd/kubeadm.sh. This file contains a function that is used
to build kubeadm and invoke "make test". Move the function contents
to hack/make-rules/test-cmd.cmd.
Stop sourcing /test/cmd/kubeadm.sh in /test/cmd/legacy-script.sh.
Also remove the --kubeadm-path invocation as this can be handled
with an env. variable directly.
The "error waiting for expected CSI calls" is redundant because it's
immediately followed by checking that error with:
framework.ExpectNoError(err, "while waiting for all CSI calls")
The mock driver gets instructed to return a ResourceExhausted error
for the first CreateVolume invocation via the storage class
parameters.
How this should be handled depends on the situation: for normal
volumes, we just want external-scheduler to retry. For late binding,
we want to reschedule the pod. It also depends on topology support.
The code became obsolete with the introduction of parseMockLogs
because that will retrieve the log itself. For debugging of a running
test the normal pod output logging is sufficient.
parseMockLogs is called potentially multiple times while waiting for
output. Dumping all CSI calls each time is quite verbose and
repetitive. To verify what the driver has done already, the normal
capturing of the container log can be used instead:
csi-mockplugin-0/mock@127.0.0.1: gRPCCall: {"Method":"/csi.v1.Node/NodePublishVolume","Request"...
As seen in some test
runs (https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/89041),
retrieving output can fail with "the server rejected our request for
an unknown reason (get pods csi-mockplugin-0)".
If this truly an intermittent error, then the existing retry logic in
the callers can deal with this.
Especially related to "uncertain" global mounts. A large refactoring of CSI
mock tests were necessary:
- to be able to script the driver to return errors as required by the test
- to parse the CSI driver logs to check kubelet called the right CSI calls
the e2e TCP CLOSE_WAIT has to create a server pod and then, from
a client, it creates a connection but doesn't notify the server
when closing it, so it stays on the CLOSE_WAIT status until it
times out.
Current test use a simple timeout for waiting the that server pod
is ready, it's better to use WaitForPodsReady for waiting that
the pod is available to avoid problems on busy environments like
the CI.
It also deletes the pods once the tests finish to avoid leaking
pods.
The original logic was that dumping can stop (for example, due to
loosing the connection to the apiserver) and then will start again as
long as the container exists. That it duplicates output on restarts
is better than skipping output that might not have been dumped yet.
But that logic then also dumped the output of containers that have
terminated multiple times:
- logging is started, dumps all output and stops because the
container has terminated
- next check finds the container again, sees no active logger,
repeats
This wasn't a problem for short-lived logging in a custom
namespace (the way how it is done for CSI drivers in Kubernetes E2E),
but other testsuites (like the one from PMEM-CSI) keep logging running
for the entire test suite duration: there duplicate output became a
problem when adding driver redeployment as part of the suite's run.
To avoid duplicated output for terminated containers, which containers
have been handled is now stored permanently. For terminated containers,
restarting of dumping is prevented. This comes with the risk that if
the previous dumping ended before capturing all output, some output
will get lost.
Marking the start and stop of the log was also useful when streaming
to a single writer and thus gets enabled.
There were several sshPort values in e2e test packages because
we've migrated code from e2e framework by copying and pastting.
This adds common SSHPort on e2essh package to reduce such duplicated
code.
Conformance tests must not rely on the kubelet API in order to
pass. In this case, I think it's unnecessary to verify that a
kubelet observes the deletion within gracePeriod seconds. The
remaining checks in this test verify that pod deletion happens,
and that the pod is removed.
Conformance tests must not rely on the kubelet API in order to
pass. SchedulerPredicates tests attempt to use the kubelet API
in their BeforeEach, some of which are tagged as Conformance.
Is there a compelling reason to use the kubelet's view of pods
for a given node instead of the apiserver's view of the pods?
we print yaml, so you can use yaml tools like `yq`:
```
e2e.test --list-conformance-tests | yq r - --collect *.testname
```
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
it turns out that the e2e test was not using the timeout used to
hold the CLOSE_WAIT status, hence the test was flake depending
on how fast it checked the conntrack table.
This PR replaces the dependency on ssh using a pod to check the conntrack
entries on the host in a loop, to make the test more robust
and reduce the flakiness due to race conditions and/or ssh issues.
It also fixes a bug trying to grep the conntrack entry, where
the error was swallowed if a conntrack entry wasn't found.
Integration tests imported e2e test code and the dependency made two drawbacks:
- Hard to move test/e2e/framework into staging (#74352)
- Need to run integration tests always even if PRs are just changing e2e test code
This enables import-boss check for blocking such dependency.
Sometimes the pod has already been cleaned up by the time the test
tried to grab the logs.
Mar 27 16:19:38.066: INFO: Waiting for client-a-jt4tf to complete.
Mar 27 16:19:38.066: INFO: Waiting up to 5m0s for pod "client-a-jt4tf" in namespace "e2e-network-policy-c-9007" to be "success or failure"
Mar 27 16:19:38.072: INFO: Pod "client-a-jt4tf": Phase="Pending", Reason="", readiness=false. Elapsed: 6.270302ms
Mar 27 16:19:40.078: INFO: Pod "client-a-jt4tf": Phase="Pending", Reason="", readiness=false. Elapsed: 2.01233019s
Mar 27 16:19:42.086: INFO: Pod "client-a-jt4tf": Phase="Succeeded", Reason="", readiness=false. Elapsed: 4.020186873s
STEP: Saw pod success
Mar 27 16:19:42.086: INFO: Pod "client-a-jt4tf" satisfied condition "success or failure"
Mar 27 16:19:42.093: FAIL: Error getting container logs: the server could not find the requested resource (get pods client-a-jt4tf)
Full Stack Trace
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/network.checkNoConnectivity(0xc00104adc0, 0xc0016b82c0, 0xc001666400, 0xc000c32000)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/network/network_policy.go:1457 +0x2a0
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/network.testCannotConnect(0xc00104adc0, 0xc0016b82c0, 0x55587e9, 0x8, 0xc000c32000, 0x50)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/network/network_policy.go:1406 +0x1fc
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/network.glob..func13.2.7()
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/network/network_policy.go:285 +0x883
github.com/openshift/origin/pkg/test/ginkgo.(*TestOptions).Run(0xc001e47830, 0xc001e50b70, 0x1, 0x1, 0x0, 0x0)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/pkg/test/ginkgo/cmd_runtest.go:59 +0x41f
main.newRunTestCommand.func1(0xc00121b900, 0xc001e50b70, 0x1, 0x1, 0x0, 0x0)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:238 +0x15d
github.com/openshift/origin/vendor/github.com/spf13/cobra.(*Command).execute(0xc00121b900, 0xc001e50b30, 0x1, 0x1, 0xc00121b900, 0xc001e50b30)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/spf13/cobra/command.go:826 +0x460
github.com/openshift/origin/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc00121b180, 0x0, 0x60d2d00, 0x9887ec8)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/spf13/cobra/command.go:914 +0x2fb
github.com/openshift/origin/vendor/github.com/spf13/cobra.(*Command).Execute(...)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/spf13/cobra/command.go:864
main.main.func1(0xc00121b180, 0x0, 0x0)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:59 +0x9c
main.main()
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:60 +0x341
STEP: Cleaning up the pod client-a-jt4tf
STEP: Cleaning up the policy.
RestartControllerManager() is kube-controller specific function
and it is better to separate the function as subpackage of e2e
test framework.
In addition, the function made invalid dependency into e2essh.
So this separates the function into e2ekubesystem subpackage.
- Move utilities or constants out so that both of them should be able
to run independently.
- Rename the legacy test so that it can eventually be deleted when the
perf dash changes is done
When deleting fails, the tests should be considered as failed,
too. Ignoring the error caused a wrong return code in the CSI mock
driver to go unnoticed (see
https://github.com/kubernetes-csi/csi-test/pull/250). The v3.1.0
release of the CSI mock driver fixes that.
The function is called from e2e/network test only, so this moves
the function into the test for reducing e2e/framework/util.go code
and removing invalid dependency on e2e test framework.
The function is for persistent volumes and it doesn't have any
reason why it stays in core test framework. So this moves the
function into e2epv package for reducing e2e/framework/util.go
code.
Since 4e7c2f638d the function has been
called from storage vsphere e2e test only. This moves the function
into the test file for
- Reducing test/e2e/framework/util.go which is one of huge files
- Remove invalid dependency on e2e test framework
- Remove unnecessary TODO
for removing invalid dependency from e2e core framework to e2essh
subpackage and reducing test/e2e/framework/util.go code which is
one of huge files today.
WaitForPod*() are just wrapper functions for e2epod package, and they
made an invalid dependency to sub e2e framework from the core framework.
So this replaces WaitForPodTerminated() with the e2epod function.
and they made an invalid dependency to sub e2e framework from the core framework.
So we can use e2epod.WaitTimeoutForPodReadyInNamespace to remove invalid dependency.
The main purpose of this pr is to handle the framework core package dependency subpackage pod.
WaitForPod*() are just wrapper functions for e2epod package, and they
made an invalid dependency to sub e2e framework from the core framework.
So this replaces WaitForPodNoLongerRunning() with the e2epod function.
When kubelet is restarted, it will now remove the resources for huge
page sizes no longer supported. This is required when:
- node disables huge pages
- changing the default huge page size in older versions of linux
(because it will then only support the newly set default).
- Software updates that change what sizes are supported (eg. by changing
boot parameters).
The e2e framework package podlogs is used in e2e/storage/testsuites
only. In addition we considered we should have a single e2e framework
package for pod without the podlogs. So this moves the podlogs into
e2e/storage/podlogs for the e2e storage tests.
Windows test "[sig-windows] [Feature:Windows] Cpu Resources Container
limits should not be exceeded after waiting 2 minutes" should be run
serially to prevent flakyness.
WaitForPod*() are just wrapper functions for e2epod package, and they
made an invalid dependency to sub e2e framework from the core framework.
So this replaces WaitForPodRunning() with the e2epod function.
Adds splitOsArch function to image-util.sh, which makes the script DRY-er.
When building a Windows test image, if REMOTE_DOCKER_URL is not set, skip the rest of the
building process for that image, which will save some time (no need to build binaries).
If a REMOTE_DOCKER_URL was not set for a particular OS version, exclude that image from the
manifest list. This fixes an issue where, if REMOTE_DOCKER_URL was not set for Windows Server 1909,
the Windows were completely excluded from the manifest list, including for Windows Server 1809
and 1903 which could have been built and pushed.
Sets "test-webserver" as the default CMD for kitten and nautilus. Since they are now based on
agnhost, they should be set to run test-webserver to maintain previous behaviour.
So multiple instances of kube-apiserver can bind on the same address and
port, to provide seamless upgrades.
Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
This change removes support for basic authn in v1.19 via the
--basic-auth-file flag. This functionality was deprecated in v1.16
in response to ATR-K8S-002: Non-constant time password comparison.
Similar functionality is available via the --token-auth-file flag
for development purposes.
Signed-off-by: Monis Khan <mok@vmware.com>
Some e2e tests depend on the controller-manager to expose metrics
on the path /metrics.
It may happen that when the test runs, the pod is not available or the
URL not ready, causing it to fail.
Previously, the test were waiting until the pod was running, but we
need to wait until the /metrics URL is ready.
The MetricsGrabber may use the controller-manager pod
to gather metrics, however, it doesn't wait until
it is ready to serve, failing the test if this is the
case.
We wait until the controller-manager pod is running
before trying to get metrics from it.
There were framework.ExpectNoError(fmt.Errorf(..)) calls which just
raise an exception without actual value checks, they just raised the
specified error messages. These usages of framework.ExpectNoError()
seemed a little tricky, so this replaces them with corresponding check
functions for the readability.
The configuration file was design as a yaml file on purpose.
To easily extend the test cases without a need to modify
the testing binary. Also, it's possible to extend the configuration
itself to enrich individual test cases.
The kubelet can race when a pod is deleted and report that a container succeeded
when it instead failed, and thus the pod is reported as succeeded. Create an e2e
test that demonstrates this failure.
After moving Permit() to the scheduling cycle test PermitPlugin should
no longer wait inside Permit() for another pod to enter Permit() and become waiting pod.
In the past this was a way to make test work regardless of order in
which pods enter Permit(), but now only one Permit() can be executed at
any given moment and waiting for another pod to enter Permit() inside
Permit() leads to timeouts.
In this change waitAndRejectPermit and waitAndAllowPermit flags make first
pod to enter Permit() a waiting pod and second pod to enter Permit()
either rejecting or allowing pod.
Mentioned in #88469
Extends agnhost with the capability to validate a mounted token against
the API server's OIDC endpoints.
Co-authored-by: Michael Taufen <mtaufen@google.com>
Close outbound connections when using a cert callback and certificates rotate. This means that we won't get into a situation where we have open TLS connections using expires certs, which would get unauthorized errors at the apiserver
Attempt to retrieve a new certificate if open connections near expiry, to prevent the case where the cert expires but we haven't yet opened a new TLS connection and so GetClientCertificate hasn't been called.
Move certificate rotation logic to a separate function
Rely on generic transport approach to handle closing TLS client connections in exec plugin; no need to use a custom dialer as this is now the default behaviour of the transport when faced with a cert callback. As a result of handling this case, it is now safe to apply the transport approach even in cases where there is a custom Dialer (this will not affect kubelet connrotation behaviour, because that uses a custom transport, not just a dialer).
Check expiry of the full TLS certificate chain that will be presented, not only the leaf. Only do this check when the certificate actually rotates. Start the certificate as a zero value, not nil, so that we don't see a rotation when there is in fact no client certificate
Drain the timer when we first initialize it, to prevent immediate rotation. Additionally, calling Stop() on the timer isn't necessary.
Don't close connections on the first 'rotation'
Remove RotateCertFromDisk and RotateClientCertFromDisk flags.
Instead simply default to rotating certificates from disk whenever files are exclusively provided.
Add integration test for client certificate rotation
Simplify logic; rotate every 5 mins
Instead of trying to be clever and checking for rotation just before an
expiry, let's match the logic of the new apiserver cert rotation logic
as much as possible. We write a controller that checks for rotation
every 5 mins. We also check on every new connection.
Respond to review
Fix kubelet certificate rotation logic
The kubelet rotation logic seems to be broken because it expects its
cert files to end up as cert data whereas in fact they end up as a
callback. We should just call the tlsConfig GetCertificate callback
as this obtains a current cert even in cases where a static cert is
provided, and check that for validity.
Later on we can refactor all of the kubelet logic so that all it does is
write files to disk, and the cert rotation work does the rest.
Only read certificates once a second at most
Respond to review
1) Don't blat the cert file names
2) Make it more obvious where we have a neverstop
3) Naming
4) Verbosity
Avoid cache busting
Use filenames as cache keys when rotation is enabled, and add the
rotation later in the creation of the transport.
Caller should start the rotating dialer
Add continuous request rotation test
Rebase: use context in List/Watch
Swap goroutine around
Retry GETs on net.IsProbableEOF
Refactor certRotatingDialer
For simplicity, don't affect cert callbacks
To reduce change surface, lets not try to handle the case of a changing
GetCert callback in this PR. Reverting this commit should be sufficient
to handle that case in a later PR.
This PR will focus only on rotating certificate and key files.
Therefore, we don't need to modify the exec auth plugin.
Fix copyright year
Quite a few images are only used a few times in a few tests. Thus,
the images are being centralized into the agnhost image, reducing
the number of images that have to be pulled and used.
This PR replaces the usage of the following images with agnhost:
- mounttest
- mounttest-user
Additionally, removes the usage of the mounttest-user image and removes
it from kubernetes/test/images. RunAsUser is set instead of having that image.
Most of these could have been refactored automatically but it wouldn't
have been uglier. The unsophisticated tooling left lots of unnecessary
struct -> pointer -> struct transitions.
This is gross but because NewDeleteOptions is used by various parts of
storage that still pass around pointers, the return type can't be
changed without significant refactoring within the apiserver. I think
this would be good to cleanup, but I want to minimize apiserver side
changes as much as possible in the client signature refactor.
The condition was not part of the message and so would not
match:
OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"rootfs_linux.go:58: mounting \\\"/var/lib/kubelet/pods/128aea1f-bde3-43d5-8b5f-dd86b9a5ef33/volumes/kubernetes.io~secret/default-token-v55hm\\\" to rootfs \\\"/var/lib/docker/overlay2/813487ba91d534ded546ae34f2a05e7d94c26bd015d356f9b2641522d8f0d6da/merged\\\" at \\\"/var/run/secrets/kubernetes.io/serviceaccount\\\" caused \\\"stat /var/lib/kubelet/pods/128aea1f-bde3-43d5-8b5f-dd86b9a5ef33/volumes/kubernetes.io~secret/default-token-v55hm: no such file or directory\\\"\"": unknown
Updated the check and regex.
Make sure the SR-IOV device plugin is ready, and that
there are enough SR-IOV devices allocatable before
spinning up test pods.
Signed-off-by: vpickard <vpickard@redhat.com>
1. move the integration test of TaintBasedEvictions to test/integration/node
2. move the e2e test of TaintBasedEvictions e2e test/e2e/node
3. modify the conformance file to adapt the TaintBasedEviction test
The current agnhost version is 2.12, 2.11 was not previously built as the
VERSION bumps merged one after the other, and the Image Promoter did not get to
build the 2.11 image.
In the current version, due to how make works, when building all the conformance
images (make all-push WHAT=all-conformance), ALL the images are being built first
before being pushed.
This PR will allow images to be built and pushed immediately afterwards, so the first
images that have been succesfully built are already pushed and promotable, even if
the the task failed on the last image, or it timed out.
A previous PR (#76838) introduced the ability to build and publish
Windows Test Images to kubernetes/test/images/image-util.sh.
Additionally, that PR also configured the Image Promoter to use a
few Windows Remote Docker build nodes to build the Windows Test Images,
however, there is a minor issue: the build container has a different $HOME
folder than expected (is: /builder/home, expected: /root - since it's the
root user), and the Remote Docker credentials are mounted in /root.
Because of that, image-build.sh cannot find the credentials it needs.
This will have to be properly fixed, but for now, we can just skip
the Windows image building part.
Quite a few images are only used a few times in a few tests. Thus,
the images are being centralized into the agnhost image, reducing
the number of images that have to be pulled and used.
This PR replaces the usage of the following images with agnhost:
- dnsutils
dnsmasq is a Linux specific binary. In order for the tests to also
pass on Windows, CoreDNS should be used instead.
- Search/replace Google Infra kube-cross locations for K8s Infra
- Update kube-cross make targets
- Don't attempt to pre-pull image (docker build --pull)
This prevents CI failures when the image under test doesn't exist
yet in the registry.
- 'make all' now builds and pushes the kube-cross image
- Allow 'TAG' to be specified via env var
- Use 'KUBE_CROSS_VERSION' to represent the kube-cross version
- Tag kube-cross images with both a kubernetes version
('git describe') and a kube-cross version
- Add a GCB (Google Cloud Build) config file (cloudbuild.yaml)
Signed-off-by: Stephen Augustus <saugustus@vmware.com>
We don't want to set the name directly because then starting the pod
can fail when the node is temporarily out of resources
(https://github.com/kubernetes/kubernetes/issues/87855).
For CSI driver deployments, we have three options:
- modify the pod spec with custom code, similar
to how the NodeSelection utility code does it
- add variants of SetNodeSelection and SetNodeAffinity which
work with a pod spec instead of a pod
- change their parameter from pod to pod spec and then use
them also when patching a pod spec
The last approach is used here because it seems more general. There
might be other cases in the future where there's only a pod spec that
needs to be modified.
A previous PR replaced the usage of Redis in the guestbook app test
with Agnhost. The replacement went well for Linux setups and Containers,
which is why the tests are green, but there is a network particularity on
Windows setups which won't allow the test to pass.
The issue was observed with another test hitting the same issue:
https://github.com/kubernetes/kubernetes/issues/83072
Here's exactly what happens during the test:
- frontend containers are created, having the /guestbook endpoint. Its main
purpose is to forward the call to either agnhost-master (cmd=set), or
agnhost-slave (cmd=get).
- agnhost-master container is created, having the /set endpoint, and the
/register endpoint, through which the agnhost-slave containers would
register to it. Its purpose is to propagate all data received through /set
to its clients.
- agnhost-slave containers are created, having the /set and /get endpoints.
They would register to agnhost-master, and then receive any and all updates
from it, which was then served through the /get endpoint.
For simplicity, all 3 types have the same agnhost subcommand (agnhost guestbook), being
able to satisfy its given purpose. For this, HTTP servers were being used, including
for the /register endpoints. agnhost-master would send its /set updates as /set HTTP
requests. However, because of the issue listed above, agnhost-master did not receive
the client's IP, but rather the container host's IP, resulting in the request being
sent to the wrong destination.
This PR updates the agnhost guestbook subcommand. Now, the agnhost subscriber nodes will
send their own IP to the /register endpoint (/endpoint?host=myip).
In order to promote the volume limits e2e test (from CSI Mock driver)
to Conformance, we can't rely on specific output of optional Condition
fields. Thus, this commit changes the test to only check the presence
of the right condition and verify that the optional fields are not empty.
The existing walk.go and conformance.txt have a few shortcomings
which we'd like to resolve:
- difficult to get the full test name due to test context nesting
- complicated AST logic and understanding necessary due to the
different ways a test can be invoked and written
This changes the AST parsing logic to be much more simple and simply
looks for the comments at/around a specific line. This file/line
information (and the full test name) is gathered by a custom ginkgo
reporter which dumps the SpecSummary data to a file.
Also, the SpecSummary dump can, itself, be potentially useful for
other post-processing and debugging tasks.
Signed-off-by: John Schnake <jschnake@vmware.com>
The service session affinity allows to set the maximum session
sticky timeout.
This commit adds e2e tests to check that the session is sticky
before the timeout and is not after.
Executing commands in pods is expensive in terms of time and the
execution time is unpredictable and random.
The session affinity tests send several http requests from a pod
to check that the session is sticky. Instead of executing one
http request at a time, we can execute several requests from the
pod at one time and process the output.
The image "gcr.io/authenticated-image-pulling/windows-nanoserver:v1" is not a
manifest list, and it is only useful for Windows Server 1809, which means that the
test "should be able to pull from private registry with secret" will fail for
environments with Windows Server 1903, 1909, or any other future version we might
want to test.
This commit adds the the ability to have an alternative private image to pull by
using a configurable docker config file which contains the necessary credentials
needed to pull the image.
Add a new e2e test to test multiple stacked NetworkPolicies with
Except clauses in IPBlock which overlaps with an allowed CIDR in
another NetworkPolicy. This test ensures that the order of the
creation of NetworkPolicies should not matter while evaluating
a Pods access to another Pod.
Previously, we've centralized several images into agnhost, including
test-webserver.
The Hybrid cluster network test was using the test-webserver image, and
was updated to use agnhost, but without properly making it so it behaves like
test-webserver, resulting in a failing test.
We have added and enabled the Image Promoter on the k/k test images, which
will build the conformance images after a PR that affects kubernetes/test/images
merges.
We have added support for image-util.sh to handle external Windows Docker connections
in order to build Windows images.
This PR enables the Image Promoter to use some Windows nodes to build the necessary
Windows images.
In order to build Windows container images for multiple OS versions,
--isolation=hyperv is required. However, not all clouds / nodes supports
or have it enabled by default, which is why we're going to rely on
having multiple nodes to build the Windows images, until this issue
is addressed.
This commit adds support for building test images for multiple
Windows versions, as we have to support both LTS and SAC channels.
With this, the format for Windows images in the BASEIMAGE files is:
OS/ARCH/OS_VERSION
Also adds --isolation-hyperv to the Windows docker build command, making sure
that container images for multiple OS versions can be built using the same
Windows node.
Adds Windows support to the test/images/image-util.sh script.
A Windows node with Docker installed is required to build Windows images.
The connection URL to it must be set in the REMOTE_DOCKER_URL env variable.
Additionally, the authentication to the remote docker node is done through
certificates, which must be found in ~/.docker.
By default, the REMOTE_DOCKER_URL env variable is set to "" in the Makefile,
and because of it, the image-util.sh script will skip building and pushing
Windows images.
Added GOOS argument to the go build process in order to be able to build
Windows binaries. Additionally, the OS env variable was added to the images
Makefiles (default value is "linux") in order to maintain default behaviour.
Some images require a different Dockerfile for Windows images, since they
have different ways of installing dependencies. Because of this, if a image
needs to be built for Windows, it will first check for a Dockerfile_windows
file instead of the default one. If there isn't one, it means that the
same Dockerfile can be used for both Windows and Linux.
All Windows images will be based on the image
"mcr.microsoft.com/windows/servercore:ltsc2019". There are a couple of features
that are needed from this image, especially powershell.
Added busybox image for Windows. Most Windows images will be based on it, which
will help reduce the command line differences between Linux and Windows, but
not entirely.
Added Windows support for agnhost image.
Changes the image naming template from:
$REGISTRY/$image-$arch:$TAG
to
$REGISTRY/$image:$TAG-$os_name-$arch
The previous naming template would generate a plethora of images (Ai * N images,
where Ai is the number of OS/architectures for the image i and N is the number
of images), while the new naming template will reduce the number of images to N.
The new template also includes the OS name, as we plan to integrate Windows
images into the manifest lists as well.
When building images, their REGISTRY can be set to a custom
one, instead of the default "gcr.io/kubernetes-e2e-test-images" or
"us.gcr.io/k8s-artifacts-prod/e2e-test-images".
Some images are based on other images we're already building
(e.g.: kitten, nautilus), but their base images
are set in the default registry name, which can be undesirable.
This commit addresses this issue.
Windows images will require other base images, and thus, we will need
to explicitly specify the OS type a base image is for in order to
avoid confusion or errors.
The way the images are built is going to be changed, and in order to avoid
overwritting and breaking the current images, the image versions are bumped.
Similar functionality is required across e2e tests for RuntimeClass.
Let's create runtimeclass as part of the framework/node package.
Signed-off-by: Eric Ernst <eric.ernst@intel.com>
Use bytes instead of strings, and slice in-place filter
(see https://github.com/golang/go/wiki/SliceTricks#filter-in-place)
to avoid copying strings around.
In my benchmark it shows almost 2x improvement:
BenchmarkString-8 1477207 10198 ns/op
BenchmarkBuffer-8 1561291 7622 ns/op
BenchmarkInPlace-8 2295714 5202 ns/op
String is the original implementation, Buffer is an intermediary
one that uses strings.Builder, and InPlace is the one from this commit.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Add a new e2e test to test the Except clauses in IPBlock CIDR
based NetworkPolicies. This test adds an egress rule which
allows client to connect to a CIDR which includes the
ServerPod's IP, however carves an except subnet which excludes
this ServerPod.