Automatic merge from submit-queue (batch tested with PRs 43018, 42713)
Log instead of fail on GLBCs tendency to leak resources
**What this PR does / why we need it**:
Stops upgrade tests from flaking because the GLBC does not cleanup all resources due to a race condition.
**Which issue this PR fixes**: fixes#38569
**Special notes for your reviewer**:
To be reviewed by @mml
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 42775, 42991, 42968, 43029)
Add e2e test for Deployment controllerRef orphaning and adoption
Follow up #42908
@enisoc @kubernetes/sig-apps-bugs @kargakis
Automatic merge from submit-queue (batch tested with PRs 42775, 42991, 42968, 43029)
Initial breakout of scheduling e2es to help assist in assignment and refactoring
**What this PR does / why we need it**:
This PR segregates the scheduling specific e2es to isolate the library which will assist both in refactoring but also auto-assignment of issues.
**Which issue this PR fixes**
xref: https://github.com/kubernetes/kubernetes/issues/42691#issuecomment-285563265
**Special notes for your reviewer**:
All this change does is shuffle code around and quarantine. Behavioral, and other cleanup changes, will be in follow on PRs. As of today, the e2es are a monolith and there is massive symbol pollution, this 1st step allows us to segregate the e2es and tease apart the dependency mess.
**Release note**:
```
NONE
```
/cc @kubernetes/sig-scheduling-pr-reviews @kubernetes/sig-testing-pr-reviews @marun @skriss
/cc @gmarek - same trick for load + density, etc.
Automatic merge from submit-queue (batch tested with PRs 43034, 43066)
Allow StatefulSet controller to PATCH Pods.
**What this PR does / why we need it**:
StatefulSet now needs the PATCH permission on Pods since it calls into ControllerRefManager to adopt and release. This adds the permission and the missing e2e test that should have caught this.
**Which issue this PR fixes**:
**Special notes for your reviewer**:
This is based on #42925.
**Release note**:
```release-note
```
cc @kubernetes/sig-apps-pr-reviews
Automatic merge from submit-queue (batch tested with PRs 42942, 42935)
[Bug] Handle container restarts and avoid using runtime pod cache while allocating GPUs
Fixes#42412
**Background**
Support for multiple GPUs is an experimental feature in v1.6.
Container restarts were handled incorrectly which resulted in stranding of GPUs
Kubelet is incorrectly using runtime cache to track running pods which can result in race conditions (as it did in other parts of kubelet). This can result in same GPU being assigned to multiple pods.
**What does this PR do**
This PR tracks assignment of GPUs to containers and returns pre-allocated GPUs instead of (incorrectly) allocating new GPUs.
GPU manager is updated to consume a list of active pods derived from apiserver cache instead of runtime cache.
Node e2e has been extended to validate this failure scenario.
**Risk**
Minimal/None since support for GPUs is an experimental feature that is turned off by default. The code is also isolated to GPU manager in kubelet.
**Workarounds**
In the absence of this PR, users can mitigate the original issue by setting `RestartPolicyNever` in their pods.
There is no workaround for the race condition caused by using the runtime cache though.
Hence it is worth including this fix in v1.6.0.
cc @jianzhangbjz @seelam @kubernetes/sig-node-pr-reviews
Replaces #42560
Automatic merge from submit-queue
Allow DaemonSet controller to PATCH pods, and add more steps and logs in DaemonSet pods adoption e2e test
DaemonSet pods adoption failed because DS controller aren't allowed to patch pods when claiming pods.
[Edit] This PR fixes#42908 by modifying RBAC to allow DaemonSet controllers to patch pods, as well as adding more logs and steps to the original e2e test to make debugging easier.
Tested locally with a local cluster and GCE cluster.
@kargakis @lukaszo @kubernetes/sig-apps-pr-reviews
Automatic merge from submit-queue (batch tested with PRs 42940, 42906, 42970, 42848)
Move node and event observer helpers to e2e/common
**What this PR does / why we need it**:
Moves existing test helper functions in OIR e2e tests to `test/e2e/common`. These functions wrap informers to help test writers to observe events instead of long-polling for status updates.
For usage examples, see `test/e2e/opaque_resource.go`.
cc @kubernetes/sig-scheduling-misc
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue
Add fabianofranz as approver for test/e2e/kubectl.go
Adding myself as approver for `kubectl` end-to-end tests.
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 41794, 42349, 42755, 42901, 42933)
Fixes kubectl skew test failure when using kubectl.sh
Fixes leftovers from https://github.com/kubernetes/kubernetes/pull/42737.
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 41794, 42349, 42755, 42901, 42933)
Fix DefaultTolerationSeconds admission plugin
DefaultTolerationSeconds is not working as expected. It is supposed to add default tolerations (for unreachable and notready conditions). but no pod was getting these toleration. And api server was throwing this error:
```
Mar 08 13:43:57 fedora25 hyperkube[32070]: E0308 13:43:57.769212 32070 admission.go:71] expected pod but got Pod
Mar 08 13:43:57 fedora25 hyperkube[32070]: E0308 13:43:57.789055 32070 admission.go:71] expected pod but got Pod
Mar 08 13:44:02 fedora25 hyperkube[32070]: E0308 13:44:02.006784 32070 admission.go:71] expected pod but got Pod
Mar 08 13:45:39 fedora25 hyperkube[32070]: E0308 13:45:39.754669 32070 admission.go:71] expected pod but got Pod
Mar 08 14:48:16 fedora25 hyperkube[32070]: E0308 14:48:16.673181 32070 admission.go:71] expected pod but got Pod
```
The reason for this error is that the input to admission plugins is internal api objects not versioned objects so expecting versioned object is incorrect. Due to this, no pod got desired tolerations and it always showed:
```
Tolerations: <none>
```
After this fix, the correct tolerations are being assigned to pods as follows:
```
Tolerations: node.alpha.kubernetes.io/notReady=:Exists:NoExecute for 300s
node.alpha.kubernetes.io/unreachable=:Exists:NoExecute for 300s
```
@davidopp @kevin-wangzefeng @kubernetes/sig-scheduling-pr-reviews @kubernetes/sig-scheduling-bugs @derekwaynecarr
Fixes https://github.com/kubernetes/kubernetes/issues/42716
Automatic merge from submit-queue (batch tested with PRs 41794, 42349, 42755, 42901, 42933)
AppArmor cluster upgrade test
Add a cluster upgrade test for AppArmor. I still need to test this (having some trouble with the cluster-upgrade tests), but wanted to start the review process.
/cc @dchen1107 @roberthbailey
Automatic merge from submit-queue (batch tested with PRs 41794, 42349, 42755, 42901, 42933)
[Federation][e2e] Add framework for upgrade test in federation
Adding framework for federation upgrade tests. please refer to #41791
cc @madhusudancs @nikhiljindal @kubernetes/sig-federation-pr-reviews
The Deployment controller was not propagating ReadyReplicas to underlying clusters causing these errors:
```
Error syncing cluster controller: Deployment.apps "federation-deployment" is invalid: status.availableReplicas: Invalid value: 5: cannot be greater than readyReplicas
```
This was caught in e2e testing and is a 1.6 regression for support that was added in #37959. Without this fix, users will be unable to scale up their deployments.
Automatic merge from submit-queue (batch tested with PRs 42608, 42444)
Return nil when deleting non-exist GCE PD
When gce cloud tries to delete a disk, if the disk could not be found
from the zones, the function should return nil error. This modified behavior is also consistent with AWS
Automatic merge from submit-queue (batch tested with PRs 36704, 42719)
Extend timeouts in taints test to account for slow Pod deletions
Fix#42685
Before merging this we need a consensus on what to do with slow Pod deletions.
Automatic merge from submit-queue
e2e test: Log container output on TestContainerOutput error
When a pod started with TestContainerOutput or TestContainerOutputRegexp
fails from unknown reason, we should log all output of all its containers
so we can analyze what went wrong.
This would help us to see what wrong in https://github.com/kubernetes/kubernetes/issues/40811 - a container is running there for 3 minutes and dies and we want to see what it did for these 3 minutes.
```release-note
NONE
```
When a pod started with TestContainerOutput or TestContainerOutputRegexp
fails from unknown reason, we should log all output of all its containers
so we can analyze what went wrong.
Automatic merge from submit-queue (batch tested with PRs 42734, 42745, 42758, 42814, 42694)
Implement automated downgrade testing.
Node version cannot be higher than the master version, so we must
switch the node version first. Also, we must use the upgrade script
from the appropriate version for GCE.
Automatic merge from submit-queue (batch tested with PRs 42734, 42745, 42758, 42814, 42694)
Create DefaultPodDeletionTimeout for e2e tests
In our e2e and e2e_node tests, we had a number of different timeouts for deletion.
Recent changes to the way deletion works (#41644, #41456) have resulted in some timeouts in e2e tests. #42661 was the most recent fix for this.
Most of these tests are not meant to test pod deletion latency, but rather just to clean up pods after a test is finished.
For this reason, we should change all these tests to use a standard, fairly high timeout for deletion.
cc @vishh @Random-Liu