* capture loop variable
* capture the loop variable and don't fail on not found errors
* capture loop variable
* Revert "Mark restart_test as flaky"
This reverts commit 990e9506de.
* skip e2e node restart test with dockershim
* Update test/e2e_node/restart_test.go
Co-authored-by: Mike Miranda <mikemp96@gmail.com>
* capture loop using index
Co-authored-by: Mike Miranda <mikemp96@gmail.com>
The device_plugin_tests have not run successfully in a very long time,
initially being marked flaky and then eventually becoming stale.
The gpu_device_plugin_tests have been used to test the same behaviour,
but are incredibly high maintenance due to external changes in behaviour
from GCP/Nvidia that we have no control over.
This commit takes the existing device plugin tests, makes them look more
like the GPU tests, and removes the cases that have been unsupported for
a long time (namely restarting containers while the plugin is
unavailable).
It also removes the GPU plugin tests, as we do not get more signal by
using real devices here.
The featureGates field in ClusterConfiguration ends up
as a map[interface{}]interface{} in the test suite
and cannot be casted to map[string]bool directly.
Adapt the test to use map[interface{}]interface{}.
On the multi NUMA node environment, kernel splits hugepages allocated under
/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages file equally between NUMA nodes.
That makes it harder to predict where several pods will start because the number
of hugepages on each NUMA node will depend on the amount of NUMA nodes under the environment.
The memory manager test will allocate hugepages on the specific NUMA node to make
the test more predictable on the multi NUMA nodes environment.
Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
Let's wait for the local node (aka the kubelet)
to be ready before to query podresources again,
to avoid false negatives.
Co-authored-by: Artyom Lukianov <alukiano@redhat.com>
Signed-off-by: Francesco Romani <fromani@redhat.com>
DKC is being removed and we don't want it to continue flaking the rest
of our tests. Lets disable them when dkc is disabled rather than hard
failing. This fits more in line with our other E2Es, and reduces the
maintenance load in test-infra.
we need to make sure the system state is completely cleaned up
again, to avoid to mess up with the shared node state, before
we transition from one test to another.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Since commit 42dd01aa3f the cpuRequest is in millicores, hence
we need to properly check translating to exclusive cpus
when verifying the resource allocation.
Signed-off-by: Francesco Romani <fromani@redhat.com>
the intent is to make the code more readable, no intended
changes in behaviour. Now it should be a bit more explicit
why the code is checking some values.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Signed-off-by: wangyysde <net_use@bzhy.com>
Generation swagger.json.
Use v2 path for hpa_cpu_field.
run update-codegen.sh
Signed-off-by: wangyysde <net_use@bzhy.com>
Some of the networking tests are flaking, and logging the command stdout and stderr
might show us some additional information about the the underlying issue when it
occurs.
Some tests have a short timeout for starting the pods (1 minute), but if
those tests happen to be the first ones to run, and the images have to be
pulled, then the test could timeout, especially with larger images. This
commit will allow us to prepull commonly used E2E test images, so this issue
can be avoided.
The logic to detect stale endpoints was not assuming the endpoint
readiness.
We can have stale entries on UDP services for 2 reasons:
- an endpoint was receiving traffic and is removed or replaced
- a service was receiving traffic but not forwarding it, and starts
to forward it.
Add an e2e test to cover the regression
The --log-file parameter will be deprecated as of Kubernetes 1.23 and should be
avoided. The replacement for distroless images is the image with go-runner, a
tool that handles output redirection.
For kubemark to run in that image it must be built as static binary.
* Bump the pod status and node status update timeouts to avoid flakes
* Add a small delay after dbus restart to ensure dbus has enough time to
restart to startup prior to sending shutdown signal
* Change check of pod being terminated by graceful shutdown. Previously,
the pod phase was checked to see if it was `Failed` and the pod reason
string matched. This logic needs to change after 1.22 graceful node
shutdown change introduced in PR #102344 which changed behavior to no
longer put the pods into a failed phase. Instead, the test now checks
that containers are not ready, and the pod status message and reason
are set appropriately.
Signed-off-by: David Porter <david@porter.me>
For some test failures, checking the pod logs could potentially
yield some interesting information, which could be used to further
investigate certain failures / flakes (for example, if there are some
networking issues, we could at least see if requests reach the containers,
(agnhost logs the connections / requests), or if there were any
other issues during the container's startup).
It looks like it tests two pods sharing the same volume, but the goal is
actually the opposite - two pods with the same inline volume definition
should get separate volumes.
This commit forces Kubelet Configuration files to always be generated
and when possible will use the kubeletconfig file that has been provided
by the test orchestrator
This commit enables the remote runner to provide a KubeletConfiguration
file to the test suite when uploading it to a remote host, thet test
runner will then use this configuration to run the Kubelet with the
provided config.
* De-share the Handler struct in core API
An upcoming PR adds a handler that only applies on one of these paths.
Having fields that don't work seems bad.
This never should have been shared. Lifecycle hooks are like a "write"
while probes are more like a "read". HTTPGet and TCPSocket don't really
make sense as lifecycle hooks (but I can't take that back). When we add
gRPC, it is EXPLICITLY a health check (defined by gRPC) not an arbitrary
RPC - so a probe makes sense but a hook does not.
In the future I can also see adding lifecycle hooks that don't make
sense as probes. E.g. 'sleep' is a common lifecycle request. The only
option is `exec`, which requires having a sleep binary in your image.
* Run update scripts
* fix_dsc_rbac_pod_update
* add test for DaemonSet Controller updates label of the pod after "DedupCurHistories"
* rebase
* update parameter of dsc.Run
Copying from pvcBlock swapped name and namespace (breaking the PVC test case)
and some references to the pvcBlock variable were left unchanged (incorrect
annotations for test failures).
Add a e2e test to exercise the checkpoint recovery flow.
This means we need to actually create a old (V1, pre-1.20) checkpoint,
but if we do it only in the e2e test, it's still fine.
Signed-off-by: Francesco Romani <fromani@redhat.com>
The previous implementaton called Update() without changing anything
about the object, so no MODIFIED events were ever generated. This change
ensures that all calls to Update() cause mutations, thereby ensuring
that MODIFIED events happen in the watch stream.
Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
During PR review it was pointed out that the branches for ephemeral
vs. persistent make the test harder to read. Therefore all code that depends on
if checks gets moved into two different versions of the test, one hat runs for
ephemeral volumes and one for persistent volumes, with skip statements at the
beginning.
* Cleanup FeatureGate skippers
* Perform changes requested by review
* some more review related changes
* Rename skipper functions to make code more readable
* add utilfeature back in
Conceptually, snapshots have to be taken while the pod and thus the volume
exist. Snapshotting has an issue where flushing of data is not guaranteed while
the volume is still staged on the node, so the test relied on deleting the pod
and checking for the volume to be unused. That part of the test cannot be done
for ephmeral volumes.
This is a fix for the new test case from
https://github.com/kubernetes/kubernetes/pull/105636 which had to be merged
without prior testing due to not having a cluster to test on and no pull job
which runs these
tests. https://testgrid.k8s.io/sig-storage-kubernetes#gce-serial then showed a
failure.
The fix is simple: in the ephemeral case, the PVC name isn't set in advance in
pvc.Name and instead must be computed. The fix now was tested on a kubetest
cluster in GCE.
Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
This adds a new test pattern and uses it for the inline volume tests. Because
the kind of volume now varies more, validation of the mount or block device is
always done by the caller of TestEphemeral.
hostPath volume plugin creates a directory within /tmp on host machine, to be mounted as volume.
inject-pod writes content to the volume, and a client-pod tried the read the contents and verify.
when SELinux is enabled on the host, client-pod can not read the content, with permission denied.
running the client-pod as privileged, so that it can access the volume content, even when SEinux is enabled on the host.
Enable feature by default.
Update integration tests for other features to assume that finalizers are present.
Change-Id: Ie969344f572627dba882c0e862e5700dadaf3026
Besides "subPath should unmount if pod is gracefully deleted while kubelet is
down" we also need a special case for "subPath should unmount if pod is force
deleted while kubelet is down".
This fixes a test failure in https://testgrid.k8s.io/sig-storage-kubernetes#gce-serial
It shouldn't make any difference, but it's better to actually test that
assumption.
All existing tests which create pods get converted by skipping the explicit PVC
creation for the ephemeral case and instead modifying the test pod so that it
has a volume claim template with the same spec as the PVC.
The feature gate gets locked to "true", with the goal to remove it in two
releases.
All code now can assume that the feature is enabled. Tests for "feature
disabled" are no longer needed and get removed.
Some code wasn't using the new helper functions yet. That gets changed while
touching those lines.
Each e2e test knows it wants to restart a running kubelet or a
non-running kubelet. The vast majority of times, we want to
restart a running kubelet (e.g. to change config or to check
some properties hold across kubelet crashes/restarts), but sometimes
we stop the kubelet, do some actions and only then restart.
To accomodate both use cases, we just expose the `running` boolean
flag to the e2e tests.
Having the `restartKubelet` explicitly restarting a running kubelet
helps us to trobuleshoot e2e failures on which the kubelet
was supposed to be running, while it was not; attempting a restart
in such cases only murkied the waters further, making the
troubleshooting and the eventual fix harder.
In the happy path, no expected change in behaviour.
Signed-off-by: Francesco Romani <fromani@redhat.com>
In the `restartKubelet` helper, we use `exec.Command`, whose
return value is the output as the command, but as `[]byte`.
The way we logged the output of the command was as value, making
the output, meant to be human readable, unnecessarily hard to read.
We fix this annoying behaviour converting the output to string before
to log it out, making pretty obvious to understand the outcome of
the command.
Signed-off-by: Francesco Romani <fromani@redhat.com>
This patch changes cpuCount to cpuRequest in order to cater to cases
where guaranteed pods make non-integral CPU Requests.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
apparmor is no longer found in Alpine edge/testing but in
edge/community, presumably in preparation for full-fledged inclusion in
3.15. If so, once that is released, BASEIMAGE can be updated again and
the explicit --repository flag to 'apk add' dropped.
Fixes: https://github.com/kubernetes/kubernetes/issues/105528
Once the node gets deleted, the nodelifecycle controller
is racing to update pod status and the pod deletion logic
is failing causing tests to flake. This commit moves
the testContext creation to within the test loop and deletes nodes,
namespace within the test loop. We don't explicitly call the node
deletion within the loop but the `testutils.CleanupTest(t, testCtx)`
call ensures that the namespace, nodes gets deleted.
While running tests in parallel, especially those with higher loads
than others, it might take some time for Pods to be Running, even more
so if the image has to be pulled as well.
The test [sig-node] Pods should delete a collection of pods [Conformance]
only waits for the for the pods to be scheduled before deleting them, and
expects them to be gone in 1 minute, which can flake because of the above
reasons. Note that the operations are in order, and kubelet runs them in
order, which means that the pod first has to enter the Running state
before attempting to delete it.
This commit waits for the Pods to enter the Running state first before
deleting the entire collection.
Co-Authored-By: Antonio Ojea <aojea@redhat.com>
This commit fixes the LocalStorageCapacityIsolationEviction test by
acknowledging that in its default configuration kubelet will no-longer
evict memory-backed volume pods as they cannot use more than their
assigned limit with SizeMemoryBackedVolumes enabled.
To account for the old behaviour, we also add a test that explicitly
disables the feature to test the behaviour of memory backed local
volumes in those scenarios. That test can be removed when/if the feature
gate is removed.
Currently the storage eviction tests fail for a few reasons:
- They re-enter storage exhaustion after pulling the images during
cleanup (increasing test storage reqs, and adding verification for
future diagnosis)
- They were timing out, as in practice it seems that eviction takes just
over 10 minutes on an n1-standard in many cases. I'm raising these to
15 to provide some padding.
This should ideally bring these tests to passing on CI, as they've now
passed locally for me several times with the remote GCE env.
Follow up work involves diagnosing why these take so long, and
restructuring them to be less finicky.
When adding the ephemeral volume feature, the special case for
PersistentVolumeClaim volume sources in kubelet's host path and node
limits checks was overlooked. An ephemeral volume source is another
way of referencing a claim and has to be treated the same way.
The recommendation from #sig-cli was to print usage, then the error. Extra care
is taken to only print the usage instruction when the error really was about
flag parsing.
Taking kube-scheduler as example:
$ _output/bin/kube-scheduler
I0929 09:42:42.289039 149029 serving.go:348] Generated self-signed cert in-memory
...
W0929 09:42:42.489255 149029 client_config.go:620] error creating inClusterConfig, falling back to default config: unable to load in-cluster configuration, KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must be defined
E0929 09:42:42.489366 149029 run.go:98] "command failed" err="invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable"
$ _output/bin/kube-scheduler --xxx
Usage:
kube-scheduler [flags]
...
--vmodule moduleSpec
comma-separated list of pattern=N settings for file-filtered logging
Error: unknown flag: --xxx
The kubectl behavior doesn't change:
$ _output/bin/kubectl get nodes
Unable to connect to the server: dial tcp: lookup xxxx: No address associated with hostname
$ _output/bin/kubectl --xxx
Error: unknown flag: --xxx
See 'kubectl --help' for usage.
It wasn't documented that InitLogs already uses the log flush frequency, so
some commands have called it before parsing (for example, kubectl in the
original code for logs.go). The flag never had an effect in such commands.
Fixing this turned into a major refactoring of how commands set up flags and
run their Cobra command:
- component-base/logs: implicitely registering flags during package init is an
anti-pattern that makes it impossible to use the package in commands which
want full control over their command line. Logging flags must be added
explicitly now, something that the new cli.Run does automatically.
- component-base/logs: AddFlags would have crashed in kubectl-convert if it
had been called because it relied on the global pflag.CommandLine. This
has been fixed and kubectl-convert now has the same --log-flush-frequency
flag as other commands.
- component-base/logs/testinit: an exception are tests where flag.CommandLine has
to be used. This new package can be imported to add flags to that
once per test program.
- Normalization of the klog command line flags was inconsistent. Some commands
unintentionally didn't normalize to the recommended format with hyphens. This
gets fixed for sample programs, but not for production programs because
it would be a breaking change.
This refactoring has the following user-visible effects:
- The validation error for `go run ./cmd/kube-apiserver --logging-format=json
--add-dir-header` now references `add-dir-header` instead of `add_dir_header`.
- `staging/src/k8s.io/cloud-provider/sample` uses flags with hyphen instead of
underscore.
- `--log-flush-frequency` is not listed anymore in the --logging-format flag's
`non-default formats don't honor these flags` usage text because it will also
work for non-default formats once it is needed.
- `cmd/kubelet`: the description of `--logging-format` uses hyphens instead of
underscores for the flags, which now matches what the command is using.
- `staging/src/k8s.io/component-base/logs/example/cmd`: added logging flags.
- `apiextensions-apiserver` no longer prints a useless stack trace for `main`
when command line parsing raises an error.
We graduate the `CPUManagerPolicyOptions` feature to beta
in the 1.23 cycle, and we add new experimental feature gates
to guard new options which are planned in the 1.23 and in the
following cycles.
We introduce additional feature gate called `CPUManagerPolicyAlphaOptions` and
`CPUManagerPolicyBetaOptions`. The basic idea is to avoid the
cumbersome process of adding a feature gate for each option, and to have
feature gates which track the maturity level of _groups_ of options.
Besides this change, the graduation process, and the process in general,
for adding new policy options is still unchanged.
The `full-pcpus-only` option added in the 1.22 cycle is intentionally
moved into the beta policy options
For more details:
- KEP: https://github.com/kubernetes/enhancements/pull/2933
- sig-arch discussion:
https://groups.google.com/u/1/g/kubernetes-sig-architecture/c/Nxsc7pfe5rw
Signed-off-by: Francesco Romani <fromani@redhat.com>
The boolean values for --dry-run have been deprecated for removal since
1.18, more than 2 releases.
The default value for --dry-run with the flag set and unspecified has
been deprecated for removal since 1.18, more than 2 releases.
Both values are now removed in this change. Any kubectl --dry-run
usage no longer accepts --dry-run=(true|false) boolean values and usage
now requires that a value of (client|server|none) is specified.
The boom-server container forges out-of-order TCP packets and injects them into the network. This requires the container to have the CAP_NET_RAW linux capability, otherwise the test will fail.
Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
* Updates ImpersonationConfig in rest/config.go to include UID
attribute, and pass it through when copying the config
* Updates ImpersonationConfig in transport/config.go to include UID
attribute
* In transport/round_tripper.go, Set the "Impersonate-Uid" header in
requests based on the UID value in the config
* Update auth_test.go integration test to specify a UID through the new
rest.ImpersonationConfig field rather than manually setting the
Impersonate-Uid header
Signed-off-by: Margo Crawford <margaretc@vmware.com>
By parsing flags in the test's main function before starting etcd we bail out
early without ever starting etcd when the test was invoked with -help.
Otherwise etcd must be available, gets started and then hangs because
flag.Parse itself exits when called by testing.go. This bypasses the code in
EtcdMain which normally stops etcd.
Otherwise, nodeNameToPodList[nodeName] list will have all its references
identical (corresponding to the control variable reference).
Thus, making all the pods in the list identical.
The agnhost pods using netexec will bind by default to the UDP
port 8081, use a different port for hostNetwork pods to avoid
scheduling conflicts and fail the tests.
There are some tests that doesn't need the UDP listener, so they
can disable it.
This is specially needed for tests that use hostNetwork pods, if 2
pods try to bind to the same port, the test will fail because one
of the pod can't be scheduled because of the port conflict.
To keep backwards compatibility, we can add an option to disable
the UDP listener by setting the port number to -1, that is consistent
with the SCTP implementation.
The issue on both tests is that before the refactor we had a method that
was creating the `StorageClass` manifest only, this manifest was used
later to be created by `TestBindingWaitForFirstConsumerMultiPVC`, after
the refactor we're ensuring that the `StorageClass` exists as a resource
before calling `TestBindingWaitForFirstConsumerMultiPVC` however this
method is still attempting to create it, that's the reason behind the
error: `resourceVersion should not be set on objects to be created
This issue wasn't caught before because
`TestBindingWaitForFirstConsumerMultiPVC` is creating the StorageClass
without the common utility function, the solution is to remove the
snippet that attempts to create the StorageClass againo
This test case requires special test-handler setup which is only done
for gce clusters created by kube-up scripts. Let's skip the test when
run under other providers.
use Extraconfig to configure the repair interval
and add an integration test for services finalizers, and
possible races with the services repair loop.
- Debian base used was older (v2.1.3) missing multiple fixed CVEs
- Minor update to distroless debian image name to explicitly point
to debian 10
- Debian base image now points to buster-1.9.0
The Topology Manager e2e tests wants to run on real multi-NUMA system
and want to consume real devices supported by device plugins; SRIOV
devices happen to be the most commonly available of such devices.
CI machines aren't multi NUMA nor expose SRIOV devices, so the biggest portion
of the tests will just skip, and we need to keep it like this until we
figure out how to enable these features.
However, some organizations can and want to run the testsuite on bare metal;
in this case, the current test will skip (not fail) with misconfigured
boxes, and this reports a misleading result. It will be much better to
fail if the test preconditions aren't met.
To satisfy both needs, we add an option, controlled by an environment
variable, to fail (not skip) if the machine on which the test run
doesn't meet the expectations (multi-NUMA, 4+ cores per NUMA cell,
expose SRIOV VFs).
We keep the old behaviour as default to keep being CI friendly.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Previously we would try to infer the `ipFamilyPolicy` from `clusterIPs`
and/or `ipFamilies`. That is too tricky. Now you MUST specify
`ipFamilyPolicy` as one of the dual-stack options in order to get a
dual-stack service.
In older versions of Kubernetes (at least pre-0.19, it's the earliest
this test will run unmodified on), Pods that depended on devices could be
restarted after the device plugin had been removed. Currently however,
this isn't possible, as during ContainerManager.GetResources(), we
attempt to DeviceManager.GetDeviceRunContainerOptions() which fails as
there's no cached endpoint information for the plugin type.
This commit therefore breaks apart the existing test into two:
- One active test that validates that assignments are maintained across
restarts
- One skipped test that validates the behaviour after GPUs have been
removed, in case we decide that this is a bug that should be fixed in
the future.
15m is enough for Cluster Autoscaler to remove empty nodes, so we need
to break them sooner than that. Instead, wait 15m after breaking them to
ensure Cluster Autoscaler will consider them as unready instead of still
starting.
The profile gatherer has been removed in
https://github.com/kubernetes/kubernetes/pull/85304, so those options
are unused since then and can therefore be removed.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
The e2e test "should have Endpoints and EndpointSlices pointing to
the API Server Service" was veryfing the current endpoints
reconciler implementation on the apiservers, however, users may
disable the endpoint reconciler and create their own.
This e2e test is also a conformance test, so we should test the
behaviour and not the implementation details. The test verifies
that a kubernetes.default service exist, an endpoint and endpoint
slices object referencing that service exist and are equivalent.
The Container Images for Windows Server 2022 have been published, and we can
start adding jobs for them.
The ltsc2022-based images have been built and promoted with these image versions.
The PR https://github.com/kubernetes/kubernetes/pull/104575 introduces
some intermediate types which makes the 32GiB memory machine kill the
typecheck process. To resolve that issue and make the test more robust,
we now reduce the amount of parallel typechecks to run to `2`.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
Prior to this change, the pod was not getting scheduled on the node as
we don't have a running scheduler in e2e_node. PodClient solves this
problem by manually assigning the pod to the node.
The current GPU installer was built in 2017, from source that no longer
exists in Kubernetes ([adding commit][1]. The image was built on 2017-06-13.
Unfortunately, this installer no longer appears to work. When debugging
on the same node type as used by test-infra, it failed to build the
driver as the kernel sha was no longer available.
This lead to needing to find a new way to install GPUs. The smallest
logical change was switching to [cos-gpu-installer][2]
. There is a newer version of this available on [googlesource][3] that
I have not yet tested as it's not clear what the state of the project
is, as I couldn't find docs outside of the source itself.
We install things to the same location as previously to avoid needing
extra downstream changes. There are a couple of weird issues here
however, like needing to run the container twice to correctly update the
LD Cache.
[1]: 1e77594958/cluster/gce/gci/nvidia-gpus/Dockerfile
[2]: https://github.com/GoogleCloudPlatform/cos-gpu-installer
[3]: https://cos.googlesource.com/cos/tools/+/refs/heads/master/src/cmd/cos_gpu_installer/
Different CSI drivers have different error messages, making it difficult
to check them accurately. We remove the check for the error message and
only check the failure type instead, since that is all we need.
If device plugin returns device without topology, keep it internaly
as NUMA node -1, it helps at podresources level to not export NUMA
topology, otherwise topology is exported with NUMA node id 0,
which is not accurate.
It's imposible to unveile this bug just by tracing json.Marshal(resp)
in podresource client, because NUMANodes field ID has json property
omitempty, in this case when ID=0 shown as emtpy NUMANode.
To reproduce it, better to iterate on devices and just
trace dev.Topology.Nodes[0].ID.
Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>
The Container Images for Windows Server 2022 have been published, and
we can start building test images using them, so we can start adding
jobs for them.
The image versions for the e2e test images have been bumped in a previous
commit, but haven't been promoted yet. We don't need to bump them here.
httpd-2.4.46-win64-VC15.zip no longer exists, so we have to use
httpd-2.4.48-win64-VC15.zip instead.
Even DynamicKubeletConfig is deprecated it still used in e2e_node test.
The bug is hidden by forcibly enabled option
TEST_ARGS='--feature-gates=DynamicKubeletConfig=true'
if this option is not enabled setKubeletConfiguration tries to set
kubelet config via apiserver interface and failed with timeout.
Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>
Agnhost's serve-hostname at endpoint /hostname
will return hostname. Pods host node name may
return FQDN. Comparison between the two fails.
Signed-off-by: Martin Kennelly <mkennell@redhat.com>
The Container Images for Windows Server 2022 have been published, and
we can start building test images using them, so we can start adding
jobs for them.
The image versions for the e2e test images have been bumped in a previous
commit, but haven't been promoted yet. We don't need to bump them here.
We're starting with windows-servercore-cache and busybox images, since
they are needed for the other images the most.
A previous added LD_FLAGS for the go binary compilation, but it's not
defined for all images.
The pods using hostNetwork use the host network namespace, hence
they have to share it with the rest of the process and pods.
If several pods try to bind to the same port, the test will fail,
so we try to use a non common port, and run the different scenario
in the same test, so we only have to bind once and we avoid consuming
ports reducing the port collision risk.
In the test image build jobs, the image-util.sh script is not being run in a git
repository, which causes git log to fail.
In this case, we can use the PULL_BASE_SHA set in cloudbuild.yaml instead.
Windows Containerd has more features than Windows Docker. One of them is single file
mappings, allowing us to also map individual files into containers, not just folders.
This will set the tag [Excluded:WindowsDocker] for those tests instead of [LinuxOnly].
Co-authored-by: Mark Rossetti <marosset@microsoft.com>
Container without elevated privileges to bind to
host port less than 1024 causes bind permission
denied error.
Increase port number greater than 1024 to allow
binding.
Signed-off-by: Martin Kennelly <mkennell@redhat.com>
All dependencies of VolumeBinding plugin from
"k8s.io/kubernetes/pkg/controller/volume/scheduling" package moved to
"k8s.io/kubernetes/pkg/scheduler/framework/plugins/volumebinding" package:
- whole file pkg/controller/volume/scheduling/scheduler_assume_cache.go
- whole file pkg/controller/volume/scheduling/scheduler_assume_cache_test.go
- whole file pkg/controller/volume/scheduling/scheduler_binder.go
- whole file pkg/controller/volume/scheduling/scheduler_binder_fake.go
- whole file pkg/controller/volume/scheduling/scheduler_binder_test.go
Package "k8s.io/kubernetes/pkg/controller/volume/scheduling/metrics" moved
to "k8s.io/kubernetes/pkg/scheduler/framework/plugins/volumebinding/metrics"
because it only used in VolumeBinding plugin and (e2e) tests.
More described in issue #89930 and PR #102953.
Signed-off-by: Konstantin Misyutin <konstantin.misyutin@huawei.com>
Currently, whenever agnhost/VERSION is bumped, the version in
agnhost/agnhost.go has to be bumped as well. This is also verified
on presubmit (build/dependencies.yaml).
This means that whenever we need to bump the agnhost image version,
someone has to approve the build/dependencies.yaml, which is not as
easy.
This commit removes the need for this check by automatically setting
the Version inside agnhost.go at build time, simplifying the process.
This commit adds an e2e test for the kubelet flags `--lock-file` and
`exit-on-lock-contention`. Eventually we would like to move them to the
kubelet configuration file rather than flags.
This test is based on the premise that whenever there is a lock
contention of the lock file (e.g. /var/run/kubelet.lock), the running
kubelet must terminate and the waiting for the lock on the lock file to
be released before starting again.
In this test we simulate that behaviour of a file contention. The test
would try to acquire the lock on the lock file.
Success of the test is determined kubelet health check when the lock is
acquired by the test and passes when the lock on the lock file is
released.
Signed-off-by: Imran Pochi <imran@kinvolk.io>
DynamicFileCAContent and DynamicCertKeyPairContent used periodical job
to check whether the file content has changed, leading to 1 minute of
delay in worst case. This patch improves it by leveraging fsnotify
watcher. The content change will be reflected immediately.
Tests "Multi-AZ Cluster Volumes" should consider only nodes that are
schedulable and *untainted* when computing AZ where to run the tests.
GetReadySchedulableNodes() already filters schedulable + untainted nodes,
no need to do it again in GetSchedulableClusterZones().
For some reason when we send them to journald, many log lines are
consistently dropped as soon as the PLEG is started.
If we log directly to file, we don't have this problem. As a bonus, if
the tests crash, the kubelet logs will always be available since they
were already written; otherwise we normally wait until the end of the
test run to collect them from journald, meaning that we often end up
with empty logs.
Removes any reference from the registry gcr.io/kubernetes-e2e-test-images in
kubernetes/kubernetes, replacing it with k8s.gcr.io/kubernetes-e2e-test-images.
In some cases, the images had to be updated since a few things have changed since
their original implementation, most notably being the fact that some of the images
have been centralized into the agnhost image.
Co-Authored-By: Claudiu Belu <cbelu@cloudbasesolutions.com>
- recover to last-known-good ConfigMap.KubeletConfigKey
~12m to run in CI, 13m locally
- non-nil last-known-good to a new non-nil last-known-good
~24m to run in CI
- recover to last-known-good ConfigMap
~12m to run in CI
- state transitions
~8m to run in CI
Including a skip method as the first line of a test does not prevent the test to fail in the BeforeEach function.
If the test is skipped because of a tag in the name, then we can prevent such odd behavior.
We now use a host local exec instead of SSH commands to simplify the
test and make the result more robust.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
This assumes that SSH via bastion works if the `KUBE_SSH_BASTION`
environment variable is set, which is the case for
`pull-kubernetes-e2e-gce-correctness`.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
1. fix command empty issue for some Windows storage tests
2. enable more windows storage tests by adding ntfs test patten
Change-Id: Ic33be282d669a23107474a14d4368bbf95c9b459
This add e2e test for HPA ContainerResource metrics. This add test to cover two scenarios
1. Scale up on a busy application with an idle sidecar container
2. Do not scale up on a busy sidecar with an idle application.
Signed-off-by: Vivek Singh <svivekkumar@vmware.com>
* Squashed commit of the following:
commit 7f774dcb54b511a3956aed0fac5c803f145e383a
Author: Jay Vyas (jayunit100) <jvyas@vmware.com>
Date: Fri Jun 18 10:58:16 2021 +0000
fix commit message
commit 0ac09650742f02004dbb227310057ea3760c4da9
Author: jay vyas <jvyas@vmware.com>
Date: Thu Jun 17 07:50:33 2021 -0400
Update test/e2e/network/netpol/kubemanager.go
Co-authored-by: Antonio Ojea <antonio.ojea.garcia@gmail.com>
commit 6a8bf0a6a2690dac56fec2bdcdce929311c513ca
Author: jay vyas <jvyas@vmware.com>
Date: Sun Jun 13 08:17:25 2021 -0400
Implement Service polling for network policy suite to remove reliance on CoreDNS when verifying network policys
Update test/e2e/network/netpol/probe.go
Co-authored-by: Antonio Ojea <antonio.ojea.garcia@gmail.com>
Add deafultNS to use service probe
commit b9c17a48327aab35a855540c2294a51137aa4a48
Author: Matthew Fenwick <mfenwick100@gmail.com>
Date: Thu May 27 07:30:59 2021 -0400
address code review comments for networkpolicy decoupling from dns
commit e23ef6ff0d189cf2ed80dbafed9881d68402cb56
Author: jay vyas <jvyas@vmware.com>
Date: Wed May 26 13:30:21 2021 -0400
NetworkPolicy decoupling from DNS
gofmt
remove old function
* model refactor
* minor
* dropped getK8sModel func
* dropped modelMap, added global model in BeforeEach and subsequent changes
Co-authored-by: Rajas Kakodkar <rajaskakodkar16@gmail.com>
Prevent Kubelet from incorrectly interpreting "not yet started" pods as "ready to terminate pods" by unifying responsibility for pod lifecycle into pod worker
Add e2e tests to cover the basic flows for the `full-pcpus-only` option:
negative flow to ensure rejection with proper error message, and
positive flow to verify the actual cpu allocation.
Co-authored-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Francesco Romani <fromani@redhat.com>
Through Job.status.uncountedPodUIDs and a Pod finalizer
An annotation marks if a job should be tracked with new behavior
A separate work queue is used to remove finalizers from orphan pods.
Change-Id: I1862e930257a9d1f7f1b2b0a526ed15bc8c248ad
As of now, we allow PDBs to be applied to pods via
selectors, so there can be unmanaged pods(pods that
don't have backing controllers) but still have PDBs associated.
Such pods are to be logged instead of immediately throwing
a sync error. This ensures disruption controller is
not frequently updating the status subresource and thus
preventing excessive and expensive writes to etcd.
A number of race conditions exist when pods are terminated early in
their lifecycle because components in the kubelet need to know "no
running containers" or "containers can't be started from now on" but
were relying on outdated state.
Only the pod worker knows whether containers are being started for
a given pod, which is required to know when a pod is "terminated"
(no running containers, none coming). Move that responsibility and
podKiller function into the pod workers, and have everything that
was killing the pod go into the UpdatePod loop. Split syncPod into
three phases - setup, terminate containers, and cleanup pod - and
have transitions between those methods be visible to other
components. After this change, to kill a pod you tell the pod worker
to UpdatePod({UpdateType: SyncPodKill, Pod: pod}).
Several places in the kubelet were incorrect about whether they
were handling terminating (should stop running, might have
containers) or terminated (no running containers) pods. The pod worker
exposes methods that allow other loops to know when to set up or tear
down resources based on the state of the pod - these methods remove
the possibility of race conditions by ensuring a single component is
responsible for knowing each pod's allowed state and other components
simply delegate to checking whether they are in the window by UID.
Removing containers now no longer blocks final pod deletion in the
API server and are handled as background cleanup. Node shutdown
no longer marks pods as failed as they can be restarted in the
next step.
See https://docs.google.com/document/d/1Pic5TPntdJnYfIpBeZndDelM-AbS4FN9H2GTLFhoJ04/edit# for details
UserInfo contains a uid field alongside groups, username and extra.
This change makes it possible to pass a UID through as an impersonation header like you
can with Impersonate-Group, Impersonate-User and Impersonate-Extra.
This PR contains:
* Changes to impersonation.go to parse the Impersonate-Uid header and authorize uid impersonation
* Unit tests for allowed and disallowed impersonation cases
* An integration test that creates a CertificateSigningRequest using impersonation,
and ensures that the API server populates the correct impersonated spec.uid upon creation.
1. add AllocateLoadBalancerNodePorts fields in specs for validation test cases
2. update fuzzer
3. in resource quota e2e, allocate node port for loadbalancer type service and
exceed the node port quota
Signed-off-by: Hanlin Shi <shihanlin9@gmail.com>
This change updates the CSR API to add a new, optional field called
expirationSeconds. This field is a request to the signer for the
maximum duration the client wishes the cert to have. The signer is
free to ignore this request based on its own internal policy. The
signers built-in to KCM will honor this field if it is not set to a
value greater than --cluster-signing-duration. The minimum allowed
value for this field is 600 seconds (ten minutes).
This change will help enforce safer durations for certificates in
the Kube ecosystem and will help related projects such as
cert-manager with their migration to the Kube CSR API.
Future enhancements may update the Kubelet to take advantage of this
field when it is configured in a way that can tolerate shorter
certificate lifespans with regular rotation.
Signed-off-by: Monis Khan <mok@vmware.com>
Ensure resources are created in zone with schedulable
nodes. For example, if we have 4 zones with 3 zones
having worker nodes and 1 zone having master nodes(unscheduable
for workloads), we should not create resources like PV, PVC or
pods in that zone.
We're running ubernetes tests
`should only be allowed to provision PDs in zones
where nodes exist`
on gcp&gke. While the test is useful in exercising
the scenario of identifying extra zone and
creating a node in it, not every Kube
distribution uses the same approach to create a node,
further if even there is an extra zone, we cannot
guarantee the zone to have enough quota. There can also
be other GCP specific edge cases all of which cannot be
covered within this test. So, removing the test
as agreed upon with the storage team
The data structure would wrap an embedded filesystem andthe root
directory relative to which the embedded filesystem is constructed.
Signed-off-by: Nabarun Pal <pal.nabarun95@gmail.com>
We're trying to fix https://github.com/kubernetes/kubernetes/issues/75355
sicne long time, and we believe the current timeout could
actually be too low (despite being "forever", which is 30s).
To validate this theory, we set the timeout to one full minute.
Also, make the logging more verbose to make the troubleshooting easier.
Signed-off-by: Francesco Romani <fromani@redhat.com>
The PR https://github.com/kubernetes/kubernetes/pull/100041 updated
node-problem-detector to v0.8.7, but unfortunately we didn't update
also the image using in the e2e_node tests.
As result, the tests were failing like
E2eNode Suite: [sig-node] NodeProblemDetector [NodeFeature:NodeProblemDetector] [Serial] SystemLogMonitor should generate node condition and events for corresponding errors
_output/local/go/src/k8s.io/kubernetes/test/e2e_node/node_problem_detector_linux.go:301
Timed out after 60.000s.
Expected success, but got an error:
<*errors.errorString | 0xc0011f2600>: {
s: "expected total number of events was 4, actual events counted was 7\nEvents
This in turn was one of the contributing factors in making the
pull-kubernetes-node-kubelet-serial lane constantly failing.
This patch updates the image used in the tests, fixing the failure.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Some tests are checking the network connectivity using gomega.Consistently,
which will fail if any of the checks fails. This could lead to flakyness in
some scenarios in which kube-proxy was supposed to apply Policies for
Kubernetes services.
We can instead wait for the network connectivity to work first using gomega.Eventually,
after which we can check the consistency.
For manifest lists containing Windows images, it is important to also have the "os.version"
annotation set, as it is needed by the Windows nodes, so they can pull the appropriate image
from the list.
Previously, the docker manifest CLI did not have the capability to set it, so, we had to set
it outselves in the manifest list's image JSON file. This is no longer necessary since
docker 20.10.0, which includes docker manifest annotate --os-version.
The docker installed in the image gcr.io/k8s-testimages/gcb-docker-gcloud:v20210622-762366a
satisfies this version requirement.
The CPUManager graduated to beta a while ago (k8s 1.10?)
so let's get rid of the obsolete Alpha tag on its e2e tests.
Signed-off-by: Francesco Romani <fromani@redhat.com>
- verify memory manager data returned by `GetAllocatableResources`
- verify pod container memory manager data
Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
In the case of multinode clusters, the http server pod and the test cluster can
spawn on different nodes, which can be problematic for poststart / prestop hooks,
as they are executed by the kubelet itself, and the cross-node lifecycle hook might
fail (according to the Kubernetes network model, it is not mandatory for kubelet to
be able to access pods on a different node).
This commit ensures that the test pod spawns on the same node as the http server pod.
This test verifies an implementation detail in the in-tree gcepd
plugin. The behavior is not implementated in the gcepd CSI driver
and therefore the test will be obsolete after CSI migration.
Co-Authored-By: Riaan Kleinhans <riaan@ii.coop>
e2e test validates the following 3 extra endpoints
- patchAppsV1NamespacedStatefulSet
- listAppsV1StatefulSetForAllNamespaces
- deleteAppsV1CollectionNamespacedStatefulSet
Some CSI drivers can't clone a volume into other topology segment (e.g. a
cloud availability zone). The scheduler does not know about these
restrictions and schedules pods with PVCs that clone a volume mostly
randomly.
Run all volume cloning tests in the same topology segment, if such segment
is available and has at least one schedulable node.
The MetricsGrabber itself knows now whether it supports each
component. The checks inside the tests therefore are redundant at best
or worse, they are wrong: for example, on a KinD cluster the check for
"has master node registered" failed and metrics grabbing from
scheduler and controller manager were skipped unnecessarily.
The MetricsGrabber checked whether a component supported metrics
grabbing, but then tests didn't have an API to use the result of that
check. Because metrics grabbing is an optional debug feature, tests
must skip checks that depend on metrics data or, when the entire
test is about metrics data, skip the test.
This is now supported with a special error that gets wrapped and
returned by the individual Grab functions.
This can be checked by trying to retrieve log output. As in the case
of no pod found, a warning gets emitted when log retrieval fails and
metrics grabbing gets disabled.
Logging is checked instead of actual metrics retrieval because the
latter is more complex and thus more likely to fail for other reasons.
The previous approach with grabbing via a nginx proxy had some
drawbacks:
- it did not work when the pods only listened on localhost (as
configured by kubeadm) and the proxy got deployed on a different
node
- starting the proxy raced with starting the pods, causing
sporadic test failures because the proxy was not set up
properly unless it saw all pods when starting the e2e.test
- the proxy was always started, whether it is needed or not
- the proxy was left running after a test and then the next
test run triggered potentially confusing messages when
it failed to create objects for the proxy
The new approach is similar to "kubectl port-forward" + "kubectl get
--raw". It uses the port forwarding feature to establish a TCP
connection via a custom dialer, then lets client-go handle TLS and
credentials.
Somehow verifying the server certificate did not work. As this
shouldn't be a big concern for E2E testing, certificate checking gets
disabled on the client side instead of investigating this further.
The value here is that the exec plugin author can use the kubeconfig to assert
how standard input is treated with respect to the exec plugin, e.g.,
- an exec plugin author can ensure that kubectl fails if it cannot provide
standard input to an exec plugin that needs it (Always)
- an exec plugin author can ensure that an client-go process will still call an
exec plugin that prefers standard input even if standard input is not
available (IfAvailable)
Signed-off-by: Andrew Keesler <akeesler@vmware.com>
The apiserver and test suite in node e2e runs under the sshd daemon
that can limit the amount of files it can open. Set a higher limit
to address the issues.
Signed-off-by: Odin Ugedal <odin@uged.al>
on cgroup v2 the reported metric is recursive for the entire and it
includes all the sub cgroups.
Adjust the test accordingly.
Closes: https://github.com/kubernetes/kubernetes/issues/99230
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Node e2e tests exceeding the global timeout are sent SIGINT, resulting
in no artifacts or console output. This will ignore the first SIGINT,
and since all children processes are being stopped due to SIGINT, we can
clean up before exiting.
This reverts commit
c15fd76ee9. Most (all?) of the hostpath
tests and several other tests started to fail again in
gce-scale-master-correctness after re-enabling the controller. This
shows that it was not just the obsolete agent which causes scalability
problems, but also the controller.
It has to be disabled until the scalability problems are addressed.
This make sure the testcase that cannot run locally will be skipped
instead of throwing the misleading failure message.
Signed-off-by: Dave Chen <dave.chen@arm.com>
CSI driver need to pass special mount opts to XFS filesystem to be able to
mount a volume + its clone or its restored snapshot on the same node. Add a
test to exhibit this behavior.
The test is optional for now, giving CSI drivers time to fix it.
These tests have been flaky for a long time, with a relatively low
rate of flakes. Nonetheless it seems better to extend the timeouts to
reduce the flakiness.
It was disabled together with the agent to avoid test failures in
gce-master-scale-correctness (https://github.com/kubernetes/kubernetes/issues/102452). That
solved the problem, but we still need to check whether the controller
alone works.
They are not needed for any of the tests and may be causing too much
overhead (see
https://github.com/kubernetes/kubernetes/issues/102452#issuecomment-854452816).
We already disabled them earlier and then re-enabled them again
because it wasn't clear how much overhead they were causing. A recent
change in how the sidecars get
deployed (https://github.com/kubernetes/kubernetes/pull/102282) seems
to have made the situation worse again. There's no logical explanation
for that yet, though.
(cherry picked from commit 0c2cee5676e64976f9e767f40c4c4750a8eeb11f)
As seen in https://github.com/kubernetes/kubernetes/issues/102452, we
currently don't have pod events for the CSI driver pods because of the
different namespace and would need them to determine whether the
driver gets evicted.
Previously, only changes of the pods where logged. Perhaps even more
interesting are events in the namespace.
The "[Feature:SCTP]" tag was needed on "should not allow access by TCP
when a policy specifies only SCTP" back when SCTP was alpha, because
it wasn't possible to create a policy that even mentioned SCTP without
enabling the feature gate. This no longer applies, and the tag was
removed from the original copy of network_policy.go, but accidentally
got left behind in the netpol/ version.
Likewise, the newly-added "should not allow access by TCP when a
policy specifies only UDP" got tagged "[Feature:UDP]", but this was
never necessary, and is inconsistent with other UDP tests anyway.
Similarly, we need "[Feature:SCTPConnectivity]" on tests that make
SCTP connections, because that functionality is not available in all
clusters, but "[Feature:UDPConnectivity]" is unnecessary and
inconsistent.
This image is the same as "gcr.io/authenticated-image-pulling/windows-nanoserver:v1"
that is used for the "should be able to pull from private registry with secret"
test on Windows.
Adding this image will allow other people to build and push their own
images to their own private registries.
Make sure to use SIGKILL so that the service is killed in a dirty way.
In case container runtime use "Restart=on-abnormal" in systemd, killing
with SIGTERM will not restart the service, as the kill looks intentional
and clean. This is used by cri-o by default.
We can indirectly retrieve the kube-cross version from the
`build/build-image/cross/VERSION` for the sample-apiserver. This allows
us to simplify the handling in `build/dependencies.yaml` as well as
the required approval (via `OWNERS`) if the kube-cross version changes.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
Current test assumes that test pod is deleted when the test
namespace is deleted. However, namespace deletion is an asynchronous
operation. The pod may still be running and allocating hugepages
resources when next test case creates another pod that requests
the same hugepages resources. This can cause kubelet to fail the test
pod with this kind of error:
OutOfhugepages-2Mi: Node didn't have enough resource: hugepages-2Mi
requested: 6291456, used: 6291456, capacity: 10485760
Explicitly deleting test pod should fix this issue.
Update dependencies and the test images to use pause 3.5. We also
provide a changelog entry for the new container image version.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
The e2e that create/deletes pods rapidly and verifies their status
was reporting a very long timing:
timings total=12.211347385s t=491ms run=2s execute=450402h8m25s
in a few scenarios. Add error checks that clarify when this happens
and why. Report p50/75/90/99 latencies on teardown as observed from
the test for baseline for future changes.
This change fixes the "Variable Expansion should verify that a failing subpath
expansion can be modified during the lifecycle of a container" conformance test
where the annotations are replaced with those used by the test, instead of
appending them.
This may cause undesirable side-effects with other controllers that may be
using pod annotations.
If a user specifies basic auth, then apply the same short circuit logic
that we do for bearer tokens (see comment).
Signed-off-by: Andrew Keesler <akeesler@vmware.com>
e2e test validates the following 3 extra endpoints
- replaceAppsV1NamespacedDaemonSetStatus
- readAppsV1NamespacedDaemonSetStatus
- patchAppsV1NamespacedDaemonSetStatus
e2e test validates the following 3 extra endpoints
- replaceAppsV1NamespacedReplicaSetStatus
- readAppsV1NamespacedReplicaSetStatus
- patchAppsV1NamespacedReplicaSetStatus
Remove the e2e test for ClusterStatus in the kubeadm suite.
The object was deprecated in a previous release and is no longer
written by kubeadm v1.22 in the kubeadm-config config map.
e2e test validates the following 3 endpoints
- readAppsV1NamespacedDeploymentStatus
- replaceAppsV1NamespacedDeploymentStatus
- patchAppsV1NamespacedDeploymentStatus
e2e test validates the following 3 endpoints
- patchAppsV1NamespacedStatefulSetStatus
- readAppsV1NamespacedStatefulSetStatus
- replaceAppsV1NamespacedStatefulSetStatus
- ceph deploy on ARM64 depends on "libec_jerasure_neon.so" which is not included
in `ceph-base` package in fedora26 distro, updated the distro to fedora33 to
fix the issue
```
sh ./mon.sh "$(hostname -i)"
/usr/lib64/ceph/erasure-code/libec_jerasure_neon.so: cannot open shared object file
```
- default pool `rbd` is not created on arm64, need to created this pool manually.
```
rbd import --image-feature layering block foo
rbd: error opening default pool 'rbd'
```
Signed-off-by: Dave Chen <dave.chen@arm.com>
Before assuming that a certain host runs an SSH server, we now test its
`SSHPort` for connectivity. This means that the test `should be able to
run crictl on the node` can be now more failure proof by checking only
hosts where SSH actually runs. Beside that, we can also test all hosts
and not only the first one.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
Looking deeper into the logs there are a lot of errors like:
`script exited with error 1`
Initial reaction was that there was a problem with download, but it
looks like the script we use to register the qemu emulators may be at
fault, let's try this alternate mechanism.
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
The existing function in podlogs is copyAllLogs from a namespace. Add a
new function to copy log from a single pod.
Change-Id: I39d139a2a57d72a0af3432626b0c02c1b43333d2
- Cleans up some of the image registry handling by
initializing values in a more consistent and clear
manner.
- Adds the Docker library registry to the list of
values that is override-able.
- Adds a few branches to logic to ensure each registry
is handled the same.
Current test for service with endpoints only validate the API objects,
however, doesn't validate that the implementation of Services works.
This mean that a cluster without any component implementing Services,
can pass Conformance.
The conformance test for ServiceAccountIssuerDiscovery is currently
configured with --in-cluster-discovery, which only supports token
validation against in-cluster endpoints. Many cloud providers provide
their own, external endpoints for OIDC discovery, and because the iss
claim in tokens will point to these endpoints, but the client in this
test only trusts the Cluster CA, it will fail to connect to the external
discovery endpoints when validating the token.
To ensure that the conformance test at least supports scenario where
both the discovery doc endpoint and JWKS endpoint are cluster-local and
the scenario where both endpoints are cluster-external, this PR has the
test try both and requires at least one to pass.
Caveat: The test still won't support a configuration where one
endpoint is cluster-local and the other is external. We don't yet have
evidence that this is a configuration that is used in practice, so this
initial hotfix will at least fix the conformance test for the "both
external" configuration we know providers already use. Note that if one
endpoint is cluster-local, and the other is cluster-external, tokens can
still only be validated in-cluster, because both endpoints must be
accessible to Relying Parties that validate tokens.
Co-Authored-By: Riaan Kleinhans <riaan@ii.coop>
e2e test validates the following 2 extra endpoints
- listAppsV1ReplicaSetForAllNamespaces
- deleteAppsV1CollectionNamespacedReplicaSet
If no nodes have NodeExternalIP addresses, then clearly none of the
services are exposed externally, and the test should succeed.
Seen in OpenShift CI.
This reverts commit 0c2cee5676e64976f9e767f40c4c4750a8eeb11f.
The health check containers are not required for any test, but we want
to run them anyway to ensure that they cause no unexpected issues.
The purpose of the pod created by `createBalancedPodForNodes()` is to ensure
that all nodes have equal resource requests (as seen by the scheduler). This
prevents the default scheduling behavior (which attempts to balance resource requests)
from interfering with e2e's which test other priorities/score plugins.
Because the scheduler only worries about requests, specifying `Limits` in this pod
is unnecessary. In fact, if the calculated "balancing" limit is too low, it can cause
the balancing pod to never start due to OOMKill errors, leading to flakes and failures.
- Test all versions to make sure each resource version is in the
mappings
- Fail when request info contains an unrecognized version. We have tests
that guarantee that all known versions are in the mappings. If we
get a version in request info that is not there we should fail fast to
prevent inconsistent behaviour (e.g. for some reason the mappings is
not up to date).
Ensure all known versions are in mappings
In one mock test, the snapshotter needs permission to read
secrets. That was disabled in the RBAC files of recent releases. We
need to patch it back in during deployment.
Co-Authored-By: Riaan Kleinhans <riaan@ii.coop>
e2e test validates the following 2 extra endpoints
- listAppsV1NamespacedReplicaSet
- deleteAppsV1CollectionNamespacedReplicaSet
They are not needed for any of the tests and in practice apparently
caused enough overhead that even unrelated tests timed out. For
example, in the pull-kubernetes-e2e-kind test, 43 out of 5771 tests
failed, including tests from sig-node, sig-cli, sig-api-machinery,
sig-network.
Mirroring the various YAML files by hand is tedious. The new
update-hostpath.sh does all the necessary steps automatically.
The result is now a bit more consistent with the upstream repos in the
sense that the original file names and paths for the RBAC YAML files
are used.
The csi-hostpath-testing.yaml is included for the sake of
completeness, but not used during E2E testing.
The new hostpath driver release is v1.6.2, which adds the
external-health-monitor for the first time.
We were avoiding the scheduled using the pod.Spec.NodeName directly,
however, once we switched to using the node selector, the no_snat
e2e test started to fail because was trying to schedule pods on
nodes with taints, hence, failing the test.
based on this comment in
ea07644522/test/e2e/framework/pod/node_selection.go (L96-L101)
// pod.Spec.NodeName should not be set directly because
// it will bypass the scheduler, potentially causing
// kubelet to Fail the pod immediately if it's out of
// resources. Instead, we want the pod to remain
// pending in the scheduler until the node has resources
// freed up.
* Use deep copies in `PrepareForUpdate()`
* Preserve select metadata from new pod
* Use patch to add ephemeral container `kubectl debug`
* Distinguish between pod vs /ephemeralcontainers NotFound
This replaces the check of mount propagation to/from the host OS mount
namespace to a similar check about the mount namespace where kubelet is
running (which may or may not be the same mount namespace as the host
OS).
This addresses issue #100259
This changes the `/ephemeralcontainers` subresource of `/pods` to use
the `Pod` kind rather than `EphemeralContainers`.
When designing this API initially it seemed preferable to create a new
kind containing only the pod's ephemeral containers, similar to how
binding and scaling work.
It later became clear that this made admission control more difficult
because the controller wouldn't be presented with the entire Pod, so we
updated this to operate on the entire Pod, similar to how `/status`
works.
To run the tests in a single node cluster, create two pods consuming 2/5 of the extended resource instead of one consuming 2/3.
The low priority pod will be consuming 2/5 of the extended resource instead so in case there's only a single node,
a high priority pod consuming 2/5 of the extended resource can be still scheduled. Thus, making sure only the low priority pod
gets preempted once the preemptor pod consuming 2/5 of the extended resource gets scheduled while keeping the high priority pod untouched.
the tests with pods using hostNetwork need to bind pods for the
test. Since they use hostNetwork the ports are limited, hence, if
more than one run in parallel, one is going to fail because will not
be able to get the port.
Currently, the only image left in gcr.io/kubernetes-e2e-test-images is the
cuda-vector-add:1.0 image.
According to 8408188cdf, the 1.0 image was based on CUDA 8.0,
while the 2.0 version is based on CUDA 10.0. We can simply rebuild the 1.0 image based on
the CUDA 8.0 image and then promote the new image.
Added ALIAS file, which specifies what the image name should be, similarly to how we build
multiple versions of nginx and httpd.
Note that the image CMD was changed from "./vectorAdd" to "nvidia-smi && ./vectorAdd" in 2.0.
The test verifies a specific feature, in which GPUs are required, thus, cannot
be run in most testing environments. We should exclude this test from most test jobs.
We'll be doing this by adding the [Feature:GPUDevicePlugin] tag (which is also being
used by test/e2e/scheduling/nvidia-gpus.go), and then add it to the ginkgo skip regex.
The pause image should not run sleep commands. This will cause pod fail
to start correctly.
See details in issue #100047. We discovered some behavior about
development in certain cases like new pod fail to start correctly, but will be further investigated.
Change-Id: I9761bbefa694f6fe51a6f1e7561fa7e566ce4d8f
We can support topology detection for all drivers with a single
topology key by allowing different nodes to have the same topology
value. This makes the test a bit more generic and its configuration
simpler.
Drivers need to opt into the new test. Depending on how the driver
describes its behavior, the check can be more specific. Currently it
distinguishes between getting any kind of information about the
storage class (i.e. assuming that capacity is not exhausted) and
getting one object per node (for local storage). Discovering nodes
only works for CSI drivers.
The immediate usage of this new test is for csi-driver-host-path with
the deployment for distributed provisioning and storage capacity
tracking. Periodic kubernetes-csi Prow and pre-merge jobs can run this
test.
The alternative would have been to write a test that manages the
deployment of the csi-driver-host-path driver itself, i.e. use the E2E
manifests. But that would have implied duplicating the
deployments (in-tree and in csi-driver-host-path) and then changing
kubernetes-csi Prow jobs to somehow run for in-tree driver definition
with newer components, something that currently isn't done. The test
then also wouldn't be applicable to out-of-tree driver deployments.
Yet another alternative would be to create a separate E2E test suite
either in csi-driver-host-path or external-provisioner. The advantage
of such an approach is that the test can be written exactly for the
expected behavior of a deployment and thus be more precise than the
generic version of the test in k/k. But this again wouldn't be
reusable for other drivers and also a lot of work to set up as no such
E2E test suite currently exists.
Removing myself from OWNERS as I haven't had the bandwidth to go over
these reviews for a long time.
This should have been part of #99718, which has already been merged.
Previously the code used to delete pods serially.
In this patch we factor out code to do that in parallel,
using goroutines.
This shaves some time in the e2e tm test run with no intended
changes in behaviour.
Signed-off-by: Francesco Romani <fromani@redhat.com>
The image "e2eteam/powershell-helper:6.2.7-linux-cache" is a Linux image. Because we're running "docker buildx build --platform windows/amd64", docker buildx will consider it as a Windows image unless we explicitly specify otherwise. If the image's platform is not correctly identified, we can run into problems when trying to build the image.
We are already doing something similar with the windows-servercore-cache image.
These tests were not performing evictions because:
- Test pods were not on the test node (fixed with spec.nodeName)
- The test pods are unmanaged and require the --force flag to be evicted
(added the flag)
- The test node does not run pods, which prevents pod deletion from
finalizing (worked around with --skip-wait-for-delete-timeout)
We can cache the powershell-helper image's results into a scratch Linux image using
docker buildx. This will allow us to spend less time pulling the data we need from the
powershell-helper image when we need it.
Additionally, docker buildx might have some issues with cross-registry images, so this
will allow us to circumvent it.
While adding annotations to the namespace, using the Update API may result in
conflicts as "the object has been modified; please apply your changes to the
latest version and try again". Use Patch API to avoid this.
Signed-off-by: Chaitanya Bandi <kbandi@cs.stonybrook.edu>