Automatic merge from submit-queue
AllowedNotReadyNodes allowed to be not ready for absolutely *any* reason
It's as good as we allow those many nodes to be not part of the cluster at all, ever.
Btw - currently our 5k-node correctness test fails if "kubelet stopped posting node status" or "route not created", etc (ref: https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/3/build-log.txt)
cc @kubernetes/sig-scalability-misc
Automatic merge from submit-queue (batch tested with PRs 50213, 50707, 49502, 51230, 50848)
StatefulSet: Deflake e2e `kubectl exec` commands.
This may help with another source of flakiness found while investigating #48031.
We seem to get a lot of flakes due to "connection refused" while running `kubectl exec`. I can't find any reason this would be caused by the test flow, so I'm adding retries to see if that helps.
Automatic merge from submit-queue (batch tested with PRs 51224, 51191, 51158, 50669, 51222)
StatefulSet: Deflake e2e "restart" phase.
This addresses another source of flakiness found while investigating #48031.
The test used to scale the StatefulSet down to 0, wait for ListPods to return 0 matching Pods, and then scale the StatefulSet back up.
This was prone to a race in which StatefulSet was told to scale back up before it had observed its own deletion of the last Pod, as evidenced by logs showing the creation of Pod ss-1 prior to the creation of the replacement Pod ss-0.
Instead, we now wait for the controller to observe all deletions before scaling it back up. This should fix flakes of the form:
```
Too many pods scheduled, expected 1 got 2
```
We seem to get a lot of flakes due to "connection refused" while running
`kubectl exec`. I can't find any reason this would be caused by the test
flow, so I'm adding retries to see if that helps.
Automatic merge from submit-queue (batch tested with PRs 51108, 51035, 50539, 51160, 50947)
Auto-calculate CLUSTER_IP_RANGE based on cluster size
In preparation for eliminating CLUSTER_IP_RANGE env var from job configs, making it less error prone while folks try to start their own large cluster tests (https://github.com/kubernetes/kubernetes/issues/50907).
/cc @kubernetes/sig-scalability-misc @wojtek-t @gmarek
The test used to scale the StatefulSet down to 0, wait for ListPods to
return 0 matching Pods, and then scale the StatefulSet back up.
This was prone to a race in which StatefulSet was told to scale back up
before it had observed its own deletion of the last Pod, as evidenced by
logs showing the creation of Pod ss-1 prior to the creation of the
replacement Pod ss-0.
We now wait for the controller to observe all deletions before
scaling it back up. This should fix flakes of the form:
```
Too many pods scheduled, expected 1 got 2
```
Automatic merge from submit-queue
StatefulSet: Deflake e2e "Saturate" phase.
This should reduce one source of flakiness found while investigating #48031.
The "Saturate" phase of StatefulSet e2e tests verifies orderly startup by controlling when each Pod is allowed to report Ready. If a Pod unexepectedly goes down during the test, the replacement Pod
created by the controller will forget if it was already allowed to report Ready.
After this change, the signal that allows each Pod to report Ready is persisted in the Pod's PVC. Thus, the replacement Pod will remember that it was already told to proceed to a Ready state.
Automatic merge from submit-queue (batch tested with PRs 50893, 50913, 50963, 50629, 50640)
Increase latency threshold for list api calls
This is only a short-term solution to make our density test green. In the long-term, we should measure as per our new SLIs.
From @wojtek-t's [doc](https://docs.google.com/document/d/1Q5qxdeBPgTTIXZxdsFILg7kgqWhvOwY8uROEf0j5YBw) on the new SLIs/SLOs, we have the following SLO for list calls:
```
SLO1: In default Kubernetes installation, 99th percentile of SLI2 per cluster-day:
<= 1s if total number of objects of the same type as resource in the system <= X
<= 5s if total number of objects of the same type as resource in the system <= Y
<= 30s if total number of objects of the same types as resource in the system <= Z
```
I would guess that 170,000 pods would fall into the 2nd bracket (at least) and hence the new value of 5s. WDYT?
cc @kubernetes/sig-scalability-misc @wojtek-t @gmarek
The "Saturate" phase of StatefulSet e2e tests verifies orderly startup
by controlling when each Pod is allowed to report Ready.
If a Pod unexepectedly goes down during the test, the replacement Pod
created by the controller will forget if it was already allowed to
report Ready.
After this change, the signal that allows each Pod to report Ready is
persisted in the Pod's PVC. Thus, the replacement Pod will remember that
it was already told to proceed to a Ready state.
Automatic merge from submit-queue (batch tested with PRs 46512, 50146)
Make metav1.(Micro)?Time functions take pointers
Is there any reason for those functions not to be on pointers?
What this PR does / why we need it:
This adds an e2e test for aggregation based on the sample-apiserver.
Currently is uses a sample-apiserver built as of 1.7.
This should ensure that the aggregation system works end-to-end.
It will also help detect if we break "old" extension api servers.
Which issue this PR fixes (optional, in fixes #<issue number>(, fixes
fixes#43714
Fixed bazel for the change.
Fixed # of args issue from govet.
Added code to test dynamic.Client.
Automatic merge from submit-queue (batch tested with PRs 50550, 50768)
Don't SSH to master for metrics in case of GKE
cc @kubernetes/sig-scalability-misc @crassirostris
Automatic merge from submit-queue (batch tested with PRs 49904, 50484, 50214)
Adding support for internal IP for e2e tests
Currently IssueSSHComand in util.go only checks for External IP address
to ssh, this PR adds check for internal IP too.
Closes#50630
Automatic merge from submit-queue (batch tested with PRs 50537, 49699, 50160, 49025, 50205)
When not using a CloudProvider, set both InternalIP and ExternalIP on Nodes
#36095 changed all of the cloudproviders to set both InternalIP and ExternalIP on Nodes, but the non-cloudprovider fallback code now only sets InternalIP.
This causes the test "should be able to create a functioning NodePort service" in test/e2e/service.go to fail on cloud-provider-less clusters, because (with LegacyHostIP gone), it now will only try to work with ExternalIPs, and will fail if the node has only an InternalIP.
There isn't much other code that assumes that ExternalIP will always be set (there's something in pkg/master/master.go, but I don't know what it's doing, so maybe it's only useful in the case where InternalIP != ExternalIP anyway). But given that several of the cloudproviders (mesos, ovirt, rackspace) now explicitly set both InternalIP and ExternalIP to the same value always, it seemed right to do that in the fallback case too.
@deads2k FYI
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue
Remove deprecated ESIPP beta annotations
**What this PR does / why we need it**:
Remove deprecated ESIPP beta annotations.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes#50187
**Special notes for your reviewer**:
/assign @MrHohn
/sig network
**Release note**:
```release-note
Beta annotations `service.beta.kubernetes.io/external-traffic` and `service.beta.kubernetes.io/healthcheck-nodeport` have been removed. Please use fields `service.spec.externalTrafficPolicy` and `service.spec.healthCheckNodePort` instead.
```
Automatic merge from submit-queue
Migrate to controller references helpers in meta/v1
**What this PR does / why we need it**:
This is a follow up for #48319 that migrates all method usages to new methods in meta/v1.
**Special notes for your reviewer**:
Looking at each commit individually might be easier.
**Release note**:
```release-note
NONE
```
/sig api-machinery
/kind cleanup
Automatic merge from submit-queue (batch tested with PRs 45186, 50440)
Add functionality needed by Cluster Autoscaler to Kubemark Provider.
Make adding nodes asynchronous. Add method for getting target
size of node group. Add method for getting node group for node.
Factor out some common code.
**Release note**:
```
NONE
```
Make adding nodes asynchronous. Add method for getting target
size of node group. Add method for getting node group for node.
Factor out some common code.
Automatic merge from submit-queue
Add waitForFailure for e2e test framework
**What this PR does / why we need it**:
Add waitForFailure for e2e test framework, this could reduce the reliance on logs.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*:
Part of #44118. Refer https://github.com/kubernetes/kubernetes/pull/48858#discussion_r128331726
**Special notes for your reviewer**:
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue
Add a simple cloud provider for e2e tests on kubemark
**What this PR does / why we need it**:
Adds a simplified cloud provider for kubemark. This enables us to add and
remove nodes and operate on nodegroups while running tests on kubemark.
This is needed to run scalability tests for cluster autoscaler on kubemark.
See https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/kubemark_integration.md
**Release note**:
```
NONE
```
Automatic merge from submit-queue
StatefulSet scale subresource
**What this PR does / why we need it**: This PR implements scale subresource for StatefulSet.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes#46005
**Special notes for your reviewer**:
**Release note**:
```release-note
StatefulSet uses scale subresource when scaling in accord with ReplicationController, ReplicaSet, and Deployment implementations.
```
**Feature Checklist**:
- [x] Introduce Registry interface for storage purpose
- [x] Introduce `ScaleREST New(), Get() and Update()` utility functions
- [x] Create a `ScaleREST` object at `NewREST()` and return it
- [x] Enable scale subresource by adding `/scale` field to the storage map
**Testing Checklist**:
- Unit testing
- [x] Modify `newStorage()` to call `NewStorage()`, and change all unit tests accordingly
- [x] Add unit tests for `ScaleREST Get() and Update()` utility functions
- [x] Add missing unit test for `ShortNames`
- Manual testing
- [x] Verify existence of the subresource using `kubectl proxy` command
- [x] Modify the subresource using `curl` via `POST`
- e2e testing
- [x] Add e2e tests using `RESTClient`
Automatic merge from submit-queue (batch tested with PRs 49524, 46760, 50206, 50166, 49603)
Handled taints on node in batch.
**What this PR does / why we need it**:
Enhanced helpers to handled taints on node in batch.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes#49522
**Release note**:
```release-note
None
```
This is needed for cluster autoscaler e2e test to
run on kubemark. We need the ability to add and
remove nodes and operate on nodegroups. Kubemark
does not provide this at the moment.
Automatic merge from submit-queue (batch tested with PRs 50091, 50231, 50238, 50236, 50243)
Fix storage tests for multizone test configuration.
**What this PR does / why we need it**:
This PR modifies "[sig-storage] Volumes PD should be mountable with (ext3|ext4)" tests to schedule pods in zone, where PD is created.
This is to make the test work in multizone environment.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
**Special notes for your reviewer**:
**Release note**:
```release-note
```
Automatic merge from submit-queue (batch tested with PRs 50091, 50231, 50238, 50236, 50243)
Move the sig-instrumentation test to a dedicated folder
Move the last remaining test to sig-instrumentation folder, also move "metrics" package to the "framework" folder
Related issue: https://github.com/kubernetes/kubernetes/issues/49161
/cc @xiangpengzhao @piosz
Automatic merge from submit-queue (batch tested with PRs 50091, 50231, 50238, 50236, 50243)
Modify e2e.go to arbitrarily pick one of zones we have nodes in for multizone tests.
**What this PR does / why we need it**:
When e2e runs in multizone configuration, zone config property can be empty.
This PR, in that case, overrides an empty value with arbitrarily chosen zone that we have nodes in.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
**Special notes for your reviewer**:
**Release note**:
```release-note
```