We now use a host local exec instead of SSH commands to simplify the
test and make the result more robust.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
This assumes that SSH via bastion works if the `KUBE_SSH_BASTION`
environment variable is set, which is the case for
`pull-kubernetes-e2e-gce-correctness`.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
1. fix command empty issue for some Windows storage tests
2. enable more windows storage tests by adding ntfs test patten
Change-Id: Ic33be282d669a23107474a14d4368bbf95c9b459
This add e2e test for HPA ContainerResource metrics. This add test to cover two scenarios
1. Scale up on a busy application with an idle sidecar container
2. Do not scale up on a busy sidecar with an idle application.
Signed-off-by: Vivek Singh <svivekkumar@vmware.com>
* Squashed commit of the following:
commit 7f774dcb54b511a3956aed0fac5c803f145e383a
Author: Jay Vyas (jayunit100) <jvyas@vmware.com>
Date: Fri Jun 18 10:58:16 2021 +0000
fix commit message
commit 0ac09650742f02004dbb227310057ea3760c4da9
Author: jay vyas <jvyas@vmware.com>
Date: Thu Jun 17 07:50:33 2021 -0400
Update test/e2e/network/netpol/kubemanager.go
Co-authored-by: Antonio Ojea <antonio.ojea.garcia@gmail.com>
commit 6a8bf0a6a2690dac56fec2bdcdce929311c513ca
Author: jay vyas <jvyas@vmware.com>
Date: Sun Jun 13 08:17:25 2021 -0400
Implement Service polling for network policy suite to remove reliance on CoreDNS when verifying network policys
Update test/e2e/network/netpol/probe.go
Co-authored-by: Antonio Ojea <antonio.ojea.garcia@gmail.com>
Add deafultNS to use service probe
commit b9c17a48327aab35a855540c2294a51137aa4a48
Author: Matthew Fenwick <mfenwick100@gmail.com>
Date: Thu May 27 07:30:59 2021 -0400
address code review comments for networkpolicy decoupling from dns
commit e23ef6ff0d189cf2ed80dbafed9881d68402cb56
Author: jay vyas <jvyas@vmware.com>
Date: Wed May 26 13:30:21 2021 -0400
NetworkPolicy decoupling from DNS
gofmt
remove old function
* model refactor
* minor
* dropped getK8sModel func
* dropped modelMap, added global model in BeforeEach and subsequent changes
Co-authored-by: Rajas Kakodkar <rajaskakodkar16@gmail.com>
Prevent Kubelet from incorrectly interpreting "not yet started" pods as "ready to terminate pods" by unifying responsibility for pod lifecycle into pod worker
Add e2e tests to cover the basic flows for the `full-pcpus-only` option:
negative flow to ensure rejection with proper error message, and
positive flow to verify the actual cpu allocation.
Co-authored-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Francesco Romani <fromani@redhat.com>
Through Job.status.uncountedPodUIDs and a Pod finalizer
An annotation marks if a job should be tracked with new behavior
A separate work queue is used to remove finalizers from orphan pods.
Change-Id: I1862e930257a9d1f7f1b2b0a526ed15bc8c248ad
As of now, we allow PDBs to be applied to pods via
selectors, so there can be unmanaged pods(pods that
don't have backing controllers) but still have PDBs associated.
Such pods are to be logged instead of immediately throwing
a sync error. This ensures disruption controller is
not frequently updating the status subresource and thus
preventing excessive and expensive writes to etcd.
A number of race conditions exist when pods are terminated early in
their lifecycle because components in the kubelet need to know "no
running containers" or "containers can't be started from now on" but
were relying on outdated state.
Only the pod worker knows whether containers are being started for
a given pod, which is required to know when a pod is "terminated"
(no running containers, none coming). Move that responsibility and
podKiller function into the pod workers, and have everything that
was killing the pod go into the UpdatePod loop. Split syncPod into
three phases - setup, terminate containers, and cleanup pod - and
have transitions between those methods be visible to other
components. After this change, to kill a pod you tell the pod worker
to UpdatePod({UpdateType: SyncPodKill, Pod: pod}).
Several places in the kubelet were incorrect about whether they
were handling terminating (should stop running, might have
containers) or terminated (no running containers) pods. The pod worker
exposes methods that allow other loops to know when to set up or tear
down resources based on the state of the pod - these methods remove
the possibility of race conditions by ensuring a single component is
responsible for knowing each pod's allowed state and other components
simply delegate to checking whether they are in the window by UID.
Removing containers now no longer blocks final pod deletion in the
API server and are handled as background cleanup. Node shutdown
no longer marks pods as failed as they can be restarted in the
next step.
See https://docs.google.com/document/d/1Pic5TPntdJnYfIpBeZndDelM-AbS4FN9H2GTLFhoJ04/edit# for details
UserInfo contains a uid field alongside groups, username and extra.
This change makes it possible to pass a UID through as an impersonation header like you
can with Impersonate-Group, Impersonate-User and Impersonate-Extra.
This PR contains:
* Changes to impersonation.go to parse the Impersonate-Uid header and authorize uid impersonation
* Unit tests for allowed and disallowed impersonation cases
* An integration test that creates a CertificateSigningRequest using impersonation,
and ensures that the API server populates the correct impersonated spec.uid upon creation.
1. add AllocateLoadBalancerNodePorts fields in specs for validation test cases
2. update fuzzer
3. in resource quota e2e, allocate node port for loadbalancer type service and
exceed the node port quota
Signed-off-by: Hanlin Shi <shihanlin9@gmail.com>
This change updates the CSR API to add a new, optional field called
expirationSeconds. This field is a request to the signer for the
maximum duration the client wishes the cert to have. The signer is
free to ignore this request based on its own internal policy. The
signers built-in to KCM will honor this field if it is not set to a
value greater than --cluster-signing-duration. The minimum allowed
value for this field is 600 seconds (ten minutes).
This change will help enforce safer durations for certificates in
the Kube ecosystem and will help related projects such as
cert-manager with their migration to the Kube CSR API.
Future enhancements may update the Kubelet to take advantage of this
field when it is configured in a way that can tolerate shorter
certificate lifespans with regular rotation.
Signed-off-by: Monis Khan <mok@vmware.com>
Ensure resources are created in zone with schedulable
nodes. For example, if we have 4 zones with 3 zones
having worker nodes and 1 zone having master nodes(unscheduable
for workloads), we should not create resources like PV, PVC or
pods in that zone.
We're running ubernetes tests
`should only be allowed to provision PDs in zones
where nodes exist`
on gcp&gke. While the test is useful in exercising
the scenario of identifying extra zone and
creating a node in it, not every Kube
distribution uses the same approach to create a node,
further if even there is an extra zone, we cannot
guarantee the zone to have enough quota. There can also
be other GCP specific edge cases all of which cannot be
covered within this test. So, removing the test
as agreed upon with the storage team
The data structure would wrap an embedded filesystem andthe root
directory relative to which the embedded filesystem is constructed.
Signed-off-by: Nabarun Pal <pal.nabarun95@gmail.com>
We're trying to fix https://github.com/kubernetes/kubernetes/issues/75355
sicne long time, and we believe the current timeout could
actually be too low (despite being "forever", which is 30s).
To validate this theory, we set the timeout to one full minute.
Also, make the logging more verbose to make the troubleshooting easier.
Signed-off-by: Francesco Romani <fromani@redhat.com>
The PR https://github.com/kubernetes/kubernetes/pull/100041 updated
node-problem-detector to v0.8.7, but unfortunately we didn't update
also the image using in the e2e_node tests.
As result, the tests were failing like
E2eNode Suite: [sig-node] NodeProblemDetector [NodeFeature:NodeProblemDetector] [Serial] SystemLogMonitor should generate node condition and events for corresponding errors
_output/local/go/src/k8s.io/kubernetes/test/e2e_node/node_problem_detector_linux.go:301
Timed out after 60.000s.
Expected success, but got an error:
<*errors.errorString | 0xc0011f2600>: {
s: "expected total number of events was 4, actual events counted was 7\nEvents
This in turn was one of the contributing factors in making the
pull-kubernetes-node-kubelet-serial lane constantly failing.
This patch updates the image used in the tests, fixing the failure.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Some tests are checking the network connectivity using gomega.Consistently,
which will fail if any of the checks fails. This could lead to flakyness in
some scenarios in which kube-proxy was supposed to apply Policies for
Kubernetes services.
We can instead wait for the network connectivity to work first using gomega.Eventually,
after which we can check the consistency.
For manifest lists containing Windows images, it is important to also have the "os.version"
annotation set, as it is needed by the Windows nodes, so they can pull the appropriate image
from the list.
Previously, the docker manifest CLI did not have the capability to set it, so, we had to set
it outselves in the manifest list's image JSON file. This is no longer necessary since
docker 20.10.0, which includes docker manifest annotate --os-version.
The docker installed in the image gcr.io/k8s-testimages/gcb-docker-gcloud:v20210622-762366a
satisfies this version requirement.
The CPUManager graduated to beta a while ago (k8s 1.10?)
so let's get rid of the obsolete Alpha tag on its e2e tests.
Signed-off-by: Francesco Romani <fromani@redhat.com>
- verify memory manager data returned by `GetAllocatableResources`
- verify pod container memory manager data
Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
In the case of multinode clusters, the http server pod and the test cluster can
spawn on different nodes, which can be problematic for poststart / prestop hooks,
as they are executed by the kubelet itself, and the cross-node lifecycle hook might
fail (according to the Kubernetes network model, it is not mandatory for kubelet to
be able to access pods on a different node).
This commit ensures that the test pod spawns on the same node as the http server pod.
This test verifies an implementation detail in the in-tree gcepd
plugin. The behavior is not implementated in the gcepd CSI driver
and therefore the test will be obsolete after CSI migration.
Co-Authored-By: Riaan Kleinhans <riaan@ii.coop>
e2e test validates the following 3 extra endpoints
- patchAppsV1NamespacedStatefulSet
- listAppsV1StatefulSetForAllNamespaces
- deleteAppsV1CollectionNamespacedStatefulSet
Some CSI drivers can't clone a volume into other topology segment (e.g. a
cloud availability zone). The scheduler does not know about these
restrictions and schedules pods with PVCs that clone a volume mostly
randomly.
Run all volume cloning tests in the same topology segment, if such segment
is available and has at least one schedulable node.
The MetricsGrabber itself knows now whether it supports each
component. The checks inside the tests therefore are redundant at best
or worse, they are wrong: for example, on a KinD cluster the check for
"has master node registered" failed and metrics grabbing from
scheduler and controller manager were skipped unnecessarily.
The MetricsGrabber checked whether a component supported metrics
grabbing, but then tests didn't have an API to use the result of that
check. Because metrics grabbing is an optional debug feature, tests
must skip checks that depend on metrics data or, when the entire
test is about metrics data, skip the test.
This is now supported with a special error that gets wrapped and
returned by the individual Grab functions.
This can be checked by trying to retrieve log output. As in the case
of no pod found, a warning gets emitted when log retrieval fails and
metrics grabbing gets disabled.
Logging is checked instead of actual metrics retrieval because the
latter is more complex and thus more likely to fail for other reasons.
The previous approach with grabbing via a nginx proxy had some
drawbacks:
- it did not work when the pods only listened on localhost (as
configured by kubeadm) and the proxy got deployed on a different
node
- starting the proxy raced with starting the pods, causing
sporadic test failures because the proxy was not set up
properly unless it saw all pods when starting the e2e.test
- the proxy was always started, whether it is needed or not
- the proxy was left running after a test and then the next
test run triggered potentially confusing messages when
it failed to create objects for the proxy
The new approach is similar to "kubectl port-forward" + "kubectl get
--raw". It uses the port forwarding feature to establish a TCP
connection via a custom dialer, then lets client-go handle TLS and
credentials.
Somehow verifying the server certificate did not work. As this
shouldn't be a big concern for E2E testing, certificate checking gets
disabled on the client side instead of investigating this further.
The value here is that the exec plugin author can use the kubeconfig to assert
how standard input is treated with respect to the exec plugin, e.g.,
- an exec plugin author can ensure that kubectl fails if it cannot provide
standard input to an exec plugin that needs it (Always)
- an exec plugin author can ensure that an client-go process will still call an
exec plugin that prefers standard input even if standard input is not
available (IfAvailable)
Signed-off-by: Andrew Keesler <akeesler@vmware.com>
The apiserver and test suite in node e2e runs under the sshd daemon
that can limit the amount of files it can open. Set a higher limit
to address the issues.
Signed-off-by: Odin Ugedal <odin@uged.al>
on cgroup v2 the reported metric is recursive for the entire and it
includes all the sub cgroups.
Adjust the test accordingly.
Closes: https://github.com/kubernetes/kubernetes/issues/99230
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Node e2e tests exceeding the global timeout are sent SIGINT, resulting
in no artifacts or console output. This will ignore the first SIGINT,
and since all children processes are being stopped due to SIGINT, we can
clean up before exiting.
This reverts commit
c15fd76ee9. Most (all?) of the hostpath
tests and several other tests started to fail again in
gce-scale-master-correctness after re-enabling the controller. This
shows that it was not just the obsolete agent which causes scalability
problems, but also the controller.
It has to be disabled until the scalability problems are addressed.
This make sure the testcase that cannot run locally will be skipped
instead of throwing the misleading failure message.
Signed-off-by: Dave Chen <dave.chen@arm.com>
CSI driver need to pass special mount opts to XFS filesystem to be able to
mount a volume + its clone or its restored snapshot on the same node. Add a
test to exhibit this behavior.
The test is optional for now, giving CSI drivers time to fix it.
These tests have been flaky for a long time, with a relatively low
rate of flakes. Nonetheless it seems better to extend the timeouts to
reduce the flakiness.
It was disabled together with the agent to avoid test failures in
gce-master-scale-correctness (https://github.com/kubernetes/kubernetes/issues/102452). That
solved the problem, but we still need to check whether the controller
alone works.
They are not needed for any of the tests and may be causing too much
overhead (see
https://github.com/kubernetes/kubernetes/issues/102452#issuecomment-854452816).
We already disabled them earlier and then re-enabled them again
because it wasn't clear how much overhead they were causing. A recent
change in how the sidecars get
deployed (https://github.com/kubernetes/kubernetes/pull/102282) seems
to have made the situation worse again. There's no logical explanation
for that yet, though.
(cherry picked from commit 0c2cee5676e64976f9e767f40c4c4750a8eeb11f)
As seen in https://github.com/kubernetes/kubernetes/issues/102452, we
currently don't have pod events for the CSI driver pods because of the
different namespace and would need them to determine whether the
driver gets evicted.
Previously, only changes of the pods where logged. Perhaps even more
interesting are events in the namespace.
The "[Feature:SCTP]" tag was needed on "should not allow access by TCP
when a policy specifies only SCTP" back when SCTP was alpha, because
it wasn't possible to create a policy that even mentioned SCTP without
enabling the feature gate. This no longer applies, and the tag was
removed from the original copy of network_policy.go, but accidentally
got left behind in the netpol/ version.
Likewise, the newly-added "should not allow access by TCP when a
policy specifies only UDP" got tagged "[Feature:UDP]", but this was
never necessary, and is inconsistent with other UDP tests anyway.
Similarly, we need "[Feature:SCTPConnectivity]" on tests that make
SCTP connections, because that functionality is not available in all
clusters, but "[Feature:UDPConnectivity]" is unnecessary and
inconsistent.
This image is the same as "gcr.io/authenticated-image-pulling/windows-nanoserver:v1"
that is used for the "should be able to pull from private registry with secret"
test on Windows.
Adding this image will allow other people to build and push their own
images to their own private registries.