Automatic merge from submit-queue
In 'kubectl describe', find controllers with ControllerRef, instead of showing the original creator
@enisoc @kargakis @kubernetes/sig-apps-pr-reviews @kubernetes/sig-cli-pr-reviews
```release-note
In 'kubectl describe', find controllers with ControllerRef, instead of showing the original creator.
```
Automatic merge from submit-queue
Exit from NewController() for PersistentVolumeController when InitPlugins() failed
Exit from NewController() for PersistentVolumeController when InitPlugins() failed just like NewAttachDetachController() does
**Release note**:
```release-note
NONE
```
@jsafrane @saad-ali PTAL. Thanks in advance
Automatic merge from submit-queue
Move api helpers.go to a subpackage
Part of https://github.com/kubernetes/kubernetes/issues/44065.
This PR moves the pkg/api/helpers.go to its own subpackage. It's mostly a mechanic move, except that
* I removed ConversionError in helpers.go, it's not used by anyone
* I moved the 3 methods of Taint and Toleration to pkg/api/methods.go, and left a TODO saying refactoring these methods to functions.
I'll send a few more PRs to make the k8s.io/kubernetes/pkg/api package only contains the code we want in the k8s.io/api repo, then we can run a [script](a0015fd1be (diff-7a2fbb4371972350ee414c6b88aee1c8)) to cut the new repo.
Automatic merge from submit-queue
Remove alphaProvisioner in PVController and AlphaStorageClassAnnotation
remove alpha annotation and alphaProvisioner
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue
Remove [Flaky] from presistent volume NFS tests.
**What this PR does / why we need it**:
PV e2e test flaked because of a lack of isolation of test objects (PVs, claims). The common symptoms being volume-mounting pods timing out or an unexpected number of unbound claims being detected.
PR #43645 Introduced the selector labels to test objects that restricted binds to within each "testspace". To accomplish this the test namespace was set as a label for all PVs and selector for all Claims. This allows each test to act as if it's isolated while still creating and binding PVs in parallel w/ other tests.
This has been tested in parallel on both 3-node gci and debian clusters as
`--ginkgo.focus="\[Volume\]" --ginkgo.skip="\[Disruptive\]|\[Feature.*\]|\[Serial\]"`
with out signs of flakiness.
cc @jeffvance
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 43304, 41427, 43490, 44352)
Update etcd-client godep to 3.1.5
This transitively level sets the godeps to yank in the 3.1.5 client.
Currently WIP, b/c it required some regen and I had some weird local permissions issue.
xref: #41143
/cc @xiang90 @mml
Automatic merge from submit-queue (batch tested with PRs 43304, 41427, 43490, 44352)
Node failure tests for cluster autoscaler
E2e tests checking whether CA is still working with a single broken node.
cc: @MaciekPytel @jszczepkowski @fgrzadkowski
Automatic merge from submit-queue (batch tested with PRs 43545, 44293, 44221, 43888)
make unstructured items correspond to other items for storage
"normal" `Items` elements include the struct itself, not a pointer to the struct. Some of the deeper bits of storage rely on this behavior in reflective paths.
This updates the `UnstructuredList` to be "normal".
@kubernetes/sig-api-machinery-pr-reviews
Automatic merge from submit-queue (batch tested with PRs 43844, 44284)
Add a retry to cluster-autoscaler e2e
This should fix https://github.com/kubernetes/kubernetes/issues/44268.
The flake was caused by following sequence of events:
1. Cluster was at minimum size (3), some node was unneeded for a while.
2. Setup for some test (scale-down, failure) would increase node group size (to 5) and wait for new nodes to come up.
3. As soon as new node come up (cluster size 4) CA would scale-down the old unneeded node (setting node group size to 4).
4. Node group would not reach size 5 (as the target was now 4) and the test would timeout and fail.
This PR makes the setup monitor re-set the target node group size if the above scenario happens.
Automatic merge from submit-queue (batch tested with PRs 43866, 42748)
hack/cluster: download cfssl if not present
hack/local-up-cluster.sh uses cfssl to generate certificates and
will exit it cfssl is not already installed. But other cluster-up
mechanisms (GCE) that generate certs just download cfssl if not
present. Make local-up-cluster.sh do that too so users don't have
to bother installing it from somewhere.
Automatic merge from submit-queue (batch tested with PRs 41758, 44137)
Removed hostname/subdomain annotation.
fixes#44135
```release-note
Remove `pod.beta.kubernetes.io/hostname` and `pod.beta.kubernetes.io/subdomain` annotations.
Users should use `pod.spec.hostname` and `pod.spec.subdomain` instead.
```
Automatic merge from submit-queue
test/e2e*: add/update README.md files
**What this PR does / why we need it**:
This PR is adding `README.md` files with a link to the documentation to all E2E tests.
Automatic merge from submit-queue (batch tested with PRs 44019, 42225)
federation: Fixing runtime-config support for federation-apiserver
Fixes https://github.com/kubernetes/kubernetes/issues/42587
Ref https://github.com/kubernetes/kubernetes/issues/38593
Fixing the broken `--runtime-config` flag support in federation-apiserver. Fixing the bugs and using it to disable batch and autoscaling groups. Users can enable them by passing `--runtime-config=apis/all=true` to federation-apiserver.
~This also includes a bug fix to kube-apiserver registry that allows users to disable api/v1 resources~
cc @kubernetes/sig-federation-pr-reviews
Automatic merge from submit-queue
Scheduler can recieve its policy configuration from a ConfigMap
**What this PR does / why we need it**: This PR adds the ability to scheduler to receive its policy configuration from a ConfigMap. Before this, scheduler could receive its policy config only from a file. The logic to watch the ConfigMap object will be added in a subsequent PR.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
**Special notes for your reviewer**:
**Release note**:
```Add the ability to the default scheduler to receive its policy configuration from a ConfigMap object.
```
Automatic merge from submit-queue
Add node e2e tests for hostPid
For #44118. Add node e2e tests for hostPid.
cc @yujuhong @kubernetes/sig-node-pr-reviews
Automatic merge from submit-queue
Add test for provisioning with storage class
This PR re-introduces e2e test for dynamic provisioning with storage classes.
It adds the same test as it was merged in PR #32485 with an extra patch adding region to AWS calls. It works well on my AWS setup, however I'm using shared company account and I can't run kube-up.sh and run the tests in the "official" way.
@zmerlynn, can you please try to run tests that led to #34961?
@justinsb, you're my AWS guru, would there be a way how to introduce fully initialized AWS cloud provider into e2e test framework? It would simplify everything. GCE has it there, but it's easier to initialize, I guess. See https://github.com/kubernetes/kubernetes/blob/master/test/e2e/pd.go#L486 for example - IMO tests should not talk to AWS directly.
Automatic merge from submit-queue (batch tested with PRs 43373, 41780, 44141, 43914, 44180)
move PD delete to AfterEach
**What this PR does / why we need it**:
Fixes a bug in PersistentVolume E2E that causes GCEPDs to be left un-deleted after a test ends.
```release-note
None
```
Automatic merge from submit-queue
fix wrong error return
Signed-off-by: Crazykev <crazykev@zju.edu.cn>
**What this PR does / why we need it**: The err return here is wrong, correct it.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
**Special notes for your reviewer**:
**Release note**:
```release-note
```
Automatic merge from submit-queue (batch tested with PRs 44191, 44117, 44072)
[Federation] Cleanup e2e framework
This PR is intended to simplify maintenance of the federation e2e tests:
- refactor cluster helpers into framework so they can be reused by upgrade tests
- remove unused functions
- move backend pod management to the tests that require it rather than relying on the common cluster object
- use a slice instead of a map to manage cluster objects to make it simpler for tests that need it to find the primary cluster themselves
Commit ``fed: Refactor e2e cluster functions into framework for reuse`` was originally included in #43500, but is included here because it formed the basis for this cleanup.
cc: @kubernetes/sig-federation-pr-reviews @perotinus
Automatic merge from submit-queue (batch tested with PRs 44191, 44117, 44072)
Fix pv upgrade test failure
**What this PR does / why we need it**:
2 changes were made that broke the pv upgrade test:
1. MakePod() was changed so that the volume mount point is at /mnt/volume<n> instead of /mnt. This change updates the path that the pod writes to.
2. The MakePod() call was changed to CreatePod(), which creates the pod and expects a long running pod. But the upgrade test uses a short running pod, and the call to TestContainerOutput() will also create the pod. This changes it back to MakePod()
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes#44168
**Special notes for your reviewer**:
**Release note**:
NONE
Automatic merge from submit-queue
[Federation] Add integration test for secrets
This PR adds an integration test for secrets that:
- performs create/read/update/delete on federation resources and validates that the changes are propagated to member clusters.
- uses an abstraction layer (fixture and adapter) to minimize the code required to support each federated type
- It should be possible to replace a test-specific adapter with a runtime adapter in the future (as per #41050)
- reuses fixture (federation api and clusters) across different resource types to minimize setup overhead
- on a fast machine, setup takes ~4s, and validating each type takes ~2s
- uses the [Subtest feature added in Go 1.7](https://blog.golang.org/subtests) to allow the test for a specific controller to be run in isolation
- ``make test-integration WHAT="federation -test.run=TestFederationCRUD/secret"``
Once this PR merges the test can be extended to target other federated types.
This PR targets #40705
cc: @kubernetes/sig-federation-pr-reviews @derekwaynecarr
Automatic merge from submit-queue (batch tested with PRs 41189, 43818)
Reduce deployment replicas in e2e tests
Fixes https://github.com/kubernetes/kubernetes/issues/41063
There are still two tests that run multiple replicas (testScaledRolloutDeployment, testIterativeDeployments), we may want to revisit them at some point.
Automatic merge from submit-queue (batch tested with PRs 43871, 44053)
Fix original object mutation on patch retry
When applying a patch, the original state of the object and patch are captured so that retries can determine if an overlapping update was made to the object, and the patch API call should abort with a conflict.
However, the `strategicPatchObject -> applyPatchToObject -> StrategicMergeMapPatch -> mergeMap` call mutates both the original object map and the patch map, making them unsuitable for reusing for subsequent calls.
We saw this in a [downstream test](https://github.com/openshift/origin/blob/master/test/integration/patch_test.go) that exercises patch conflict retries, where the data being submitted in a patch was showing up in the original object data map.
Since mergeMap *also* mutates the patch map passed in (deletes patch directives as it encounters them), we *also* cannot reuse the patch map in `applyPatchToObject` once it has been used.
This PR:
* Builds `originalObjMap` separate from the initial patch application for later use in conflict detection
* Changes the `originalPatchMap` to a helper that returns a fresh map from original sources
* Adds an integration test that exercises retry of non-overlapping patches with patches containing `$patch` directives
Automatic merge from submit-queue (batch tested with PRs 44143, 44133)
Mark PD test as flaky.
**What this PR does / why we need it**: Marks "PD should be mountable" as flaky. See #43977, which shows that it has flaked at least 7 times in the last 3 days. @kubernetes/sig-storage-test-failures
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue
Extract e2e utility code into framework
**What this PR does / why we need it**:
There's a growing dependency on Volume e2e utilities related to creating / test against NFS volumes. For this reason, it's useful to relocate the relevant functions to the `framework` pkg. Doing so makes these utility functions available to e2e tests outside the `e2e` package.
This PR only moves code from the `e2e` package to `framework` and handle the relevant changes in calls. It does not change any logic.
```release-note
NONE
```
@jingxu97 I think there's value here in reducing duplicate code in the `common` package, given that these functions have been copied down to it. However, there's been some divergence. Can you PTAL and let me know if there's any reason we can't remove the duplicate `common` code?
cc @jeffvance
Automatic merge from submit-queue
Fix Cluster-Autoscaler e2e on testgrid
Fix an e2e test failing in CI. The failures were caused by maximum nodes in test env being lower than the number of nodes required by the test.
Automatic merge from submit-queue (batch tested with PRs 43963, 43965)
Wait for clean old RSs statuses in the middle of Recreate rollouts
After https://github.com/kubernetes/kubernetes/pull/43508 got merged, we started returning ReplicaSets with no pods but with stale statuses back to the rollout functions. As a consequence, one of our e2e tests that checks if a Recreate Deployment runs pods from different versions, started flakying because the Deployment status may be incorrect. This change simply waits for the statuses to get cleaned up before proceeding with scaling up the new RS.
Fixes https://github.com/kubernetes/kubernetes/issues/43864
@kubernetes/sig-apps-bugs
Automatic merge from submit-queue (batch tested with PRs 44104, 43903, 44109)
Make sure Teardown is called.
This will ensure that tests get a chance to clean up resources even if
setup failed part way through.
Automatic merge from submit-queue (batch tested with PRs 44097, 42772, 43880, 44031, 44066)
[Federation] Improve e2e test setup
This PR improves federation e2e test setup:
- reuses e2e framework setup (``NewDefaultFramework``) instead of duplicating it
- ensures ``FederationAfterEach`` is called if an error occurs in ``FederationBeforeEach`` (as per the [example](https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/framework.go#L161) of the e2e framework)
- skips creation of a test namespace in the hosting cluster (not used for a federation e2e test)
cc: @kubernetes/sig-federation-pr-reviews @kubernetes/sig-testing-pr-reviews
```release-note
kube-apiserver: --service-account-lookup now defaults to true. This enables service account tokens to be revoked by deleting the Secret object containing the token.
```
- reuse default framework setup rather than duplicating
- skip namespace creation for each test in hosting cluster
- ensure FederationAfterEach is called if BeforeEach fails
Moved remaining util functions
moved cinder specific function back to volumes.go, will have to be extracted later when a cinder e2e package is created.
remove dupe code from common/volume.go
Moved [Volume] tags to KubeDescribe
Automatic merge from submit-queue
node e2e: improve the validate OOM score test for infra containers
The test blindly checked all "pause" processes on the node, assuming
they were all infra containers. This change takes a snapshot of all
existing "pause" processes on the node, and exclude them in the
validation. The test still relies on the fact that it runs exclusively
on the node. If that assumption changes, we will need other methods to
locate the PIDs of the infra containers.
This fixes#37580
Automatic merge from submit-queue
Move autoscaling e2e tests to a separate directory
For fine-grain access control. Autoscaling team is expanding the e2e test coverage and the need for getting an approval for every PR is annoying.
cc: @MaciekPytel @jszczepkowski @fgrzadkowski @wojtek-t
Automatic merge from submit-queue
Support status.hostIP in downward API
**What this PR does / why we need it**:
Exposes pod's hostIP (node IP) via downward API.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*:
fixes https://github.com/kubernetes/kubernetes/issues/24657
**Special notes for your reviewer**:
Not sure if there's more documentation that's needed, please point me in the right direction and I will add some :)
Automatic merge from submit-queue
Add/remove kargakis from a couple of places
I've replaced myself in some tests in test_owners with the actual owners of those tests and I've also picked up a bunch of deployment tests. Also due to lack of review bandwidth I am removing myself from sig_cli reviewers.
@janetkuo
Automatic merge from submit-queue
test/e2e_node: prepull images with CRI
Part of https://github.com/kubernetes/kubernetes/issues/40739
- This PR builds on top of #40525 (and contains one commit from #40525)
- The second commit contains a tiny change in the `Makefile`.
- Third commit is a patch to be able to prepull images using the CRI (as opposed to run `docker` to pull images which doesn't make sense if you're using CRI most of the times)
Marked WIP till #40525 makes its way into master
@Random-Liu @lucab @yujuhong @mrunalp @rhatdan
Automatic merge from submit-queue
tests: e2e-node: refactor node-problem-detector test to avoid selinux…
Fixes https://github.com/kubernetes/kubernetes/issues/43401
This test creates a file in /tmp on the host and creates a HostPath volume in the container to so that the host can inject messages into the logfile being read by the node problem detector in the container. However, selinux prohibits the container from reading files out of /tmp on the host.
This PR modifies the test to create the log file in an EmptyDir volume instead, which will be properly labeled for container access.
@derekwaynecarr
Automatic merge from submit-queue
Add derekwaynecarr to approvers list for test/e2e_node
I am a maintainer and have approve rights in `pkg/kubelet`.
/cc @vishh
Automatic merge from submit-queue (batch tested with PRs 42667, 43923)
Adding test to perform volume operations storm
**What this PR does / why we need it**:
Adding new test to perform volume operations storm
Test Steps
1. Create storage class for thin Provisioning.
2. Create 30 PVCs using above storage class in annotation, requesting 2 GB files.
3. Wait until all disks are ready and all PVs and PVCs get bind. (**CreateVolume** storm)
4. Create pod to mount volumes using PVCs created in step 2. (**AttachDisk** storm)
5. Wait for pod status to be running.
6. Verify all volumes accessible and available in the pod.
7. Delete pod.
8. wait until volumes gets detached. (**DetachDisk** storm)
9. Delete all PVCs. This should delete all Disks. (**DeleteVolume** storm)
10. Delete storage class.
This test will help validate issue reported at https://github.com/vmware/kubernetes/issues/71
**Which issue this PR fixes**
fixes #
**Special notes for your reviewer**:
executed test on 1.5.3 release with `VOLUME_OPS_SCALE` set to `5`
Will execute test with the changes made on PR - https://github.com/kubernetes/kubernetes/pull/42422, with the `VOLUME_OPS_SCALE` set to `30`
**Release note**:
```release-note
None
```
cc: @abrarshivani @BaluDontu @tusharnt @pdhamdhere @luomiao @kerneltime
Per Clayton's suggestion, move stuff from cluster/lib/util.sh to
hack/lib/util.sh. Also consolidate ensure-temp-dir and use the
hack/lib/util.sh implementation rather than cluster/common.sh.
Automatic merge from submit-queue (batch tested with PRs 42662, 43035, 42578, 43682)
Apply [Feature:Volumes] to subset of tests in volumes.go
**What this PR does / why we need it**:
According to [this test doc](https://github.com/kubernetes/community/blob/master/contributors/devel/e2e-tests.md#kinds-of-tests), the `[Feature:xyz]` tag indicates that the tests have a non-default requirements and should not be run in the core suite. Previously all test in _e2e/volumes.go_ were tagged `[Feature:Volumes]` but I believe only six of the tests warrant this tag: iSCSI, GlusterFS, Ceph-RBD, Ceph-FS, vSPhere, and Cinder. The remaining tests (NFS, PD, ConfigMap) do not need the `Feature:` tag.
**Special notes for your reviewer**:
I moved the `[Volume]` tags from the `It`'s to the outer `KubeDescribe` to simplify the `It` text and remove redundancy.
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 42662, 43035, 42578, 43682)
[Federation] Fix a typo in the ReplicaSet E2E test: s/extrace/extract/
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 43726, 43643)
Make a smaller redis image for testing, based on Alpine.
**What this PR does / why we need it**:
This shrinks gcr.io/google_containers/redis from 400MB to 5MB, which should reduce flakes.
**Which issue this PR fixes**:
fixes#43631
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 42617, 43247, 43509, 43644, 43820)
Adding a SSH tunnel check to GKE upgrades
**What this PR does / why we need it**: After an upgrade, it takes some time fore the master to re-establish SSH tunnels to the nodes. This adds a wait for GKE since it runs in this configuration.
**Which issue this PR fixes**: fixes#43611, #43612
Automatic merge from submit-queue (batch tested with PRs 38741, 41301, 43645, 43779, 42337)
De-Flake PersistentVolume E2E: Isolate resources with selectors
PersistentVolume tests continue to flake because tests Claims often bind to other, parallel tests' volumes. This is addressed by isolating test resource with selector labels specific to each test namespace. Doing so enables deterministic binding within each test space and allows tests to run in parallel.
1. Assign selector label to volumes and claims per test.
2. Remove `[Serial]` tag
cc @jeffvance
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 38741, 41301, 43645, 43779, 42337)
replicaset adopt and release e2e tests
**What this PR does / why we need it**: adds e2e tests for replicaset pod adoption and release
**Which issue this PR fixes** fixes#39962
**Special notes for your reviewer**: for adoption pod is created, a replicaset with a selector matching the pod's label is created and then we check for the pod being in the list of pods of the replicaset. For release, a replicaset with two replicas is created, the label of one of them is patched and then we check that that pod is no longer in the list of pods of the replicaset.
For checking the list of pods the `framework.PodsCreated` is used, maybe the name is not very adecuate given that it is used to return the pods belonging to one replicaset, no matter how they were created. Also, the patching is done through `framework.KubectlOrDie`, not sure if this is better that calling the API directly.
cc @liggitt
Comments, log lines, exported MakePersistentVolume
Moved pv/pvcConfig assignment out of diskName check, nil ptrs and configs afterEach
change label/selector code to use k8s defined types from generic string:string maps
adjust for gce refactor
Automatic merge from submit-queue
Extract GCEPD pv tests
**What this PR does / why we need it**:
This is strictly a refactor moving the GCEPD suite in Persistent Volumes E2E to it's own file. This will make future provider specific additions to pv testing much more organized and readable. It will also enable a smoother transition to providers moving out of tree by consolidating related tests.
```release-note
NONE
```
Automatic merge from submit-queue
Volume Provisioning E2E: test PVC delete causes PV delete
**What this PR does / why we need it**:
Test for a regression addressed in #21268. There was a case where the PVC being created and deleted quickly may result in a provisioned PV left behind as `Available.`
```release-note
NONE
```
cc @jeffvance
Removed wait for PVC phase Pending.
iterate test 100 times to increase chance of regression
Moved claim obj assignment out of loop.
add wait loop check for PVs
loop until no PVs detected
refactor per git comments
replace api calls with framework wrappers
add default suffix
Automatic merge from submit-queue
Fix problems of not-starting image pullers
In e2e.go there are the following lines:
https://github.com/kubernetes/kubernetes/blob/master/test/e2e/e2e.go#L150
```
if err := framework.WaitForPodsSuccess(c, metav1.NamespaceSystem, framework.ImagePullerLabels, imagePrePullingTimeout); err != nil {
// There is no guarantee that the image pulling will succeed in 3 minutes
// and we don't even run the image puller on all platforms (including GKE).
// We wait for it so we get an indication of failures in the logs, and to
// maximize benefit of image pre-pulling.
framework.Logf("WARNING: Image pulling pods failed to enter success in %v: %v", imagePrePullingTimeout, err)
}
```
However, few lines above:
https://github.com/kubernetes/kubernetes/blob/master/test/e2e/e2e.go#L143
we were waiting for all image pullers to actually enter Success state. It's pretty clear that the latter wasn't expected.
This PR is fixing this problem.
Ref #43728
@anhowe @davidopp
Automatic merge from submit-queue
Move cluster logging tests to a separate folder
Since there are several e2e tests for cluster logging and the infrastructure for them got complicated, it makes sense to move those tests to a separate folder.
Also, adding myself and Piotr to OWNERS of this directory as owners of the tests.
Automatic merge from submit-queue
Move DNS configmap tests to slow, serial suites
These tests take a long time due to the ConfigMap update interval
and may briefly disrupt DNS resolution in the cluster.
Automatic merge from submit-queue
Use ping to ip instead of wget google.com in net connectivity check
This is a flakey test and this commit reduces the number of dependent
systems involved with the flake.
Automatic merge from submit-queue (batch tested with PRs 42835, 42974)
remove legacy insecure port options from genericapiserver
The insecure port has been a source of problems and it will prevent proper aggregation into a cluster, so the genericapiserver has no need for it. In addition, there's no reason for it to be in the main kube-apiserver flow either. This pull removes it from genericapiserver and removes it from the shared kube-apiserver code. It's still wired up in the command, but its no longer possible for someone to mess up and start using in mainline code.
@kubernetes/sig-api-machinery-misc @ncdc
Automatic merge from submit-queue
proxy to IP instead of name, but still use host verification
I think I found a setting that lets us proxy to an IP and still do hostname verification on the certificate.
@liggitt @sttts Can you see if you agree that this knob does what I think it does? Last commit only, still needs tests.
Automatic merge from submit-queue (batch tested with PRs 41728, 42231)
Adding new tests to e2e/vsphere_volume_placement.go
**What this PR does / why we need it**:
Adding new tests to e2e/vsphere_volume_placement.go
Below is the tests description and test steps.
**Test Back-to-back pod creation/deletion with different volume sources on the same worker node**
1. Create volumes - vmdk2, vmdk1 is created in the test setup.
2. Create pod Spec - pod-SpecA with volume path of vmdk1 and NodeSelector set to label assigned to node1.
3. Create pod Spec - pod-SpecB with volume path of vmdk2 and NodeSelector set to label assigned to node1.
4. Create pod-A using pod-SpecA and wait for pod to become ready.
5. Create pod-B using pod-SpecB and wait for POD to become ready.
6. Verify volumes are attached to the node.
7. Create empty file on the volume to make sure volume is accessible. (Perform this step on pod-A and pod-B)
8. Verify file created in step 5 is present on the volume. (perform this step on pod-A and pod-B)
9. Delete pod-A and pod-B
10. Repeatedly (5 times) perform step 4 to 9 and verify associated volume's content is matching.
11. Wait for vmdk1 and vmdk2 to be detached from node.
12. Delete vmdk1 and vmdk2
**Test multiple volumes from different datastore within the same pod**
1. Create volumes - vmdk2 on non default shared datastore.
2. Create pod Spec with volume path of vmdk1 (vmdk1 is created in test setup on default datastore) and vmdk2.
3. Create pod using spec created in step-2 and wait for pod to become ready.
4. Verify both volumes are attached to the node on which pod are created. Write some data to make sure volume are accessible.
5. Delete pod.
6. Wait for vmdk1 and vmdk2 to be detached from node.
7. Create pod using spec created in step-2 and wait for pod to become ready.
8. Verify both volumes are attached to the node on which PODs are created. Verify volume contents are matching with the content written in step 4.
9. Delete POD.
10. Wait for vmdk1 and vmdk2 to be detached from node.
11. Delete vmdk1 and vmdk2
**Test multiple volumes from same datastore within the same pod**
1. Create volumes - vmdk2, vmdk1 is created in testsetup
2. Create pod Spec with volume path of vmdk1 (vmdk1 is created in test setup) and vmdk2.
3. Create pod using spec created in step-2 and wait for pod to become ready.
4. Verify both volumes are attached to the node on which pod are created. Write some data to make sure volume are accessible.
5. Delete pod.
6. Wait for vmdk1 and vmdk2 to be detached from node.
7. Create pod using spec created in step-2 and wait for pod to become ready.
8. Verify both volumes are attached to the node on which PODs are created. Verify volume contents are matching with the content written in step 4.
9. Delete POD.
10. Wait for vmdk1 and vmdk2 to be detached from node.
11. Delete vmdk1 and vmdk2
**Which issue this PR fixes**
fixes #
**Special notes for your reviewer**:
Executed tests against K8S v1.5.3 release
**Release note**:
```release-note
NONE
```
cc: @kerneltime @abrarshivani @BaluDontu @tusharnt @pdhamdhere
Automatic merge from submit-queue (batch tested with PRs 43149, 41399, 43154, 43569, 42507)
Distribute load in cluster load tests uniformly
This PR makes cluster logging load tests distribute logging uniformly, to avoid situation, where 80% of pods are allocated on one node and overall results are worse then it could be.
Automatic merge from submit-queue (batch tested with PRs 43429, 43416, 43312, 43141, 43421)
Make e2e-dns test more stable by increasing wait time
**What this PR does / why we need it**:
In many cases, 60 seconds are not enough for generating all dns results.
Since the probeCmd takes up to 600 seconds, use 600 here too.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*:
fixes#43264, #43094, #43100
**Special notes for your reviewer**:
**Release note**:
NONE
Automatic merge from submit-queue (batch tested with PRs 43378, 43216, 43384, 43083, 43428)
Darwin won't build: syscall.Sysinfo issue.
**What this PR does / why we need it**: On darwin had problems building and testing because of syscall.Sysinfo_t etc which is a linux specific command.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*:
**Special notes for your reviewer**: Definitely would like another set of eyes on the bootTime function, it will have to be inaccurate but open to suggestions about improving this for darwin.
**Release note**:
```
NONE
```
Automatic merge from submit-queue (batch tested with PRs 43144, 42671, 43226, 43314, 43361)
Removal of unused mesos e2e test.
**What this PR does / why we need it**:
Remove mesos e2e test which is not used.
```
NONE
```
/cc @sttts @k82cn
Automatic merge from submit-queue
add local option to APIService
APIServices need an option to avoid proxying in cases where the groupversion is handled later in the chain. This will allow a coherent and complete set of APIServices, but won't require extra connections.
@kubernetes/sig-api-machinery-misc @ncdc @cheftako
Automatic merge from submit-queue (batch tested with PRs 42672, 42770, 42818, 42820, 40849)
kubemark test: Bump addon-manager to v6.4-beta.1
Follow up PR of #42760. This PR bumps addon-manager to v6.4-beta.1 for kubemark test.
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 43048, 43624, 43649)
[Federation][e2e] Ingress delays and service DNS issues
Ingress has been seen to take >10 minutes to allocate an IP in some circumstances (even more so in parallel testing). Also, due to issues with Services and DNS, disable those tests so we can get a green grid (see #43646)
Automatic merge from submit-queue (batch tested with PRs 43653, 43654)
[Federation] Disable the E2E test for federated replica set rebalancing
We are able to reproduce the flaky failure locally, and can debug without running this on the CI.
Automatic merge from submit-queue (batch tested with PRs 43642, 43170, 41813, 42170, 41581)
Add the ability to customize federation system namespace in e2e turn up scripts.
**Release note**:
```release-note
NONE
```
Ingress has been seen to take >10 minutes to allocate an IP in
some circumstances (even more so in parallel testing). Also, due
to issues with Services and DNS, disable those tests so we can
get a green grid.
Automatic merge from submit-queue (batch tested with PRs 42237, 42297, 42279, 42436, 42551)
Cleanup federation_util.go in e2e/framework
The only function GetValidDNSSubdomainName in test/e2e/framework/federation_util.go is no longer used for some time now. so cleaning it up.
cc @kubernetes/sig-federation-pr-reviews @madhusudancs
Automatic merge from submit-queue (batch tested with PRs 42237, 42297, 42279, 42436, 42551)
Reword PVC polling message to log a more readable message.
**What this PR does / why we need it**:
Previous message used to report an error is misleading and poorly written. This PR changes the log to be more readable.
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 42237, 42297, 42279, 42436, 42551)
should replace errors.New(fmt.Sprintf(...)) with fmt.Errorf(...)
Signed-off-by: yupengzte <yu.peng36@zte.com.cn>
**What this PR does / why we need it**:
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
**Special notes for your reviewer**:
**Release note**:
```release-note
```
Automatic merge from submit-queue
Fix test for provisioning in unmanaged zone.
defer evaluates arguments of the deferred function immediately, so it actually
deleted a storage class and a claim before the test could do anything useful.
The test passed just accidentally, as the test is expected to time out. It
timed out from wrong reasons though.
@copejon @kubernetes/sig-storage-pr-reviews
```release-note
NONE
```
defer evaluates arguments of the deferred function immediately, so it actually
deleted a storage class and a claim before the test could do anything useful.
The test passed just accidentally, as the test is expected to time out. It
timed out from wrong reasons though.
Automatic merge from submit-queue
Bump CNI consumers to v0.5.1
**What this PR does / why we need it**:
- vendored CNI plugins properly handle `DEL` on missing resources
- update CNI version refs
**Which issue this PR fixes**
fixes#43488
**Release note**:
`bumps CNI to version v0.5.1 where plugins properly handle DEL on non existent resources`
Automatic merge from submit-queue
Increase delays between calling Stackdriver Logging API in e2e tests
Fix https://github.com/kubernetes/kubernetes/issues/43442
This is a temporary hack, proper solution will be implemented soon
Automatic merge from submit-queue (batch tested with PRs 43465, 43529, 43474, 43521)
Added retransmissions in service call by e2e resource consumer library.
Added retransmissions in service call by e2e resource consumer library.
Fixes#43187.
```release-note
NONE
```
Automatic merge from submit-queue
update influxdb dependency to v1.1.1 and change client to v2
**What this PR does / why we need it**:
1. it updates version of influxdb libraries used by tests to v1.1.1 to match version used by grafana
2. it switches influxdb client to v2 to address the fact that [v1 is being depricated](https://github.com/influxdata/influxdb/tree/v1.1.1/client#description)
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
**Special notes for your reviewer**:
cc @piosz
1. [vendor/BUILD](https://github.com/KarolKraskiewicz/kubernetes/blob/master/vendor/BUILD) didn't get regenerated after executing `./hack/godep-save.sh` so I left previous version.
Not sure how to trigger regeneration of this file.
2. `tests/e2e/monitoring.go` seem to be passing without changes, even after changing version of the client.
**Release note**:
```release-note
```
Automatic merge from submit-queue
e2e test for cluster-autoscaler draining node
**What this PR does / why we need it**:
Adds an e2e test for Cluster-Autoscaler removing a node with a pod running (by rescheduling the pod).
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
**Special notes for your reviewer**:
@mwielgus can you take a look?
**Release note**:
```release-note
```
Automatic merge from submit-queue
Unify test timeouts under a common name.
Some timeouts were too aggressive and since we've slowly been moving every controller to 5 minutes, consolidate everyone under ``federatedDefaultTestTimeout``. To aid in debugging some service-related issues, if a service cannot be deleted, we issue a kubectl describe on it prior to failing.
Automatic merge from submit-queue (batch tested with PRs 42452, 43399)
Fix faulty assumptions in summary API testing
**What this PR does / why we need it**:
1. on systemd, launch kubelet in dedicated part of cgroup hierarchy
1. bump allowable memory usage for busy box containers as my own local testing often showed values > 1mb which were valid per the memory limit settings we impose
1. there is a logic flaw today in how we report node.memory.stats that needs to be fixed in follow-on.
for the last issue, we look at `/sys/fs/cgroup/memory.stat[rss]` value which if you have global accounting enabled on systemd machines (as expected) will report 0 because nothing runs local to the root cgroup. we really want to be showing the total_rss value for non-leaf cgroups so we get the full hierarchy of usage.
bazel update
added new files to reflect that only one method has changed between arch types.
forgot to add changes to a commit.
changes made and gfmt run.
changed node_problem_detector to node_problem_detector_linux and made it linux only.
updated bazel
Automatic merge from submit-queue
Loosen requirements of cluster logging e2e tests, make them more stable
There should be an e2e test for cloud logging in the main test suite, because this is the important part of functionality and it can be broken by different components.
However, existing cluster logging e2e tests were too strict for the current solution, which may loose some log entries, which results in flakes. There's no way to fix this problem in 1.6, so this PR makes basic cluster logging e2e tests less strict.
Automatic merge from submit-queue (batch tested with PRs 43355, 42827)
[Federation] Rewrite ReplicaSet CRUD and Preferences tests.
I think `should create replicasets and rebalance them` test is still flaky. I still don't know the source of this flakiness. I will continue hunting. But it is a lot less flaky than before (or perhaps it even never passed before?). This PR could be merged now and flake hunting can happen in parallel.
```release-note
NONE
```
Automatic merge from submit-queue
Use storage.k8s.io/v1 in tests instead of v1beta1
This is trimmed version of #42477 and contains only tests of the new storage API. Together with #43285 it passes all dynamic provisioning tests on my GCE.
I did not change vsphere_utils.go and vsphere_volume_diskformat.go as @divyenpatel runs master vsphere tests with Kubernetes 1.5 - @divyenpatel, did I get it right?
@kubernetes/sig-storage-pr-reviews, @msau42, @ethernetdan
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 43313, 43257, 43271, 43307)
In DaemonSet e2e tests, use Patch instead of Update to avoid conflict
Fixes#43310
@marun @kargakis @lukaszo @kubernetes/sig-apps-bugs
Automatic merge from submit-queue
kubectl: Use v1.5-compatible ownership logic when listing dependents.
**What this PR does / why we need it**:
This restores compatibility between kubectl 1.6 and clusters running Kubernetes 1.5.x. It introduces transitional ownership logic in which the client considers ControllerRef when it exists, but does not require it to exist.
If we were to ignore ControllerRef altogether (pre-1.6 client behavior), we would introduce a new failure mode in v1.6 because controllers that used to get stuck due to selector overlap will now make progress. For example, that means when reaping ReplicaSets of an overlapping Deployment, we would risk deleting ReplicaSets belonging to a different Deployment that we aren't about to delete.
This transitional logic avoids such surprises in 1.6 clusters, and does no worse than kubectl 1.5 did in 1.5 clusters. To prevent this when kubectl 1.5 is used against 1.6 clusters, we can cherrypick this change.
**Which issue this PR fixes**:
Fixes#43159
**Special notes for your reviewer**:
**Release note**:
```release-note
```
Automatic merge from submit-queue (batch tested with PRs 42869, 43298, 43285)
Fix default storage class tests
Name of the default storage class is not "default", it must be discovered dynamically.
```release-note
NONE
```
This fixes flake `storageclasses.storage.k8s.io "default" not found` in #43261
Automatic merge from submit-queue (batch tested with PRs 42869, 43298, 43285)
Bumped Heapster to v1.3.0
``` release-note
Bumped Heapster to v1.3.0.
More details about the release https://github.com/kubernetes/heapster/releases/tag/v1.3.0
```
Automatic merge from submit-queue
Add retry to monitoring e2e
**What this PR does / why we need it**:
Add retry to monitoring e2e to prevent it from failing because heapster have not yet been started after cluster creation.
@piosz @jszczepkowski
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes#43024
**Special notes for your reviewer**:
**Release note**:
```release-note
```
The test blindly checked all "pause" processes on the node, assuming
they were all infra containers. This change takes a snapshot of all
existing "pause" processes on the node, and exclude them in the
validation. The test still relies on the fact that it runs exclusively
on the node. If that assumption changes, we will need other methods to
locate the PIDs of the infra containers.
In particular, we should not assume ControllerRefs are necessarily set.
However, we can still use ControllerRefs that do exist to avoid
interfering with controllers that do use it.
Automatic merge from submit-queue
Update npd to the official v0.3.0 release.
Update npd to the official release v0.3.0.
This also fixes a npd bug https://github.com/kubernetes/node-problem-detector/pull/98.
@dchen1107 @kubernetes/node-problem-detector-reviewers
Automatic merge from submit-queue
Add guards for StatefulSet and AppArmor upgrade testing
This PR adds automated upgrade infrastructure to allow test suites to know what versions and node images are going to be testing and whether or not they should be skipped. It also adds a guard to prevent StatefulSets from being tested with versions prior to 1.5.0, and a guard to prevent AppArmor from running on distros other than gci and ubuntu.
Automatic merge from submit-queue (batch tested with PRs 43180, 42928)
Fix waitForScheduler in scheduer predicates e2e tests
**What this PR does / why we need it**: Fixes waitForScheduler in e2e to resolve flaky tests in scheduler_predicates.go
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes#42691
**Special notes for your reviewer**:
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 43162, 43157)
Use beta default class annotation for default storageclass tests.
**What this PR does / why we need it**:
The default storageclasses are still installed with the beta annotation, so the test should explicitly use the beta annotation.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes#43150
**Special notes for your reviewer**:
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue
Add saad-ali and marun to test/OWNERS
/assign @saad-ali @marun
Also ensure that approvers are in the reviewer list, and sort both lists.
Automatic merge from submit-queue (batch tested with PRs 40964, 42967, 43091, 43115)
Add process debug information to summary test
Print out the processes in each system cgroup when the Summary API test fails, to help debug https://github.com/kubernetes/kubernetes/issues/40607
/cc @yujuhong @Random-Liu
Automatic merge from submit-queue (batch tested with PRs 40964, 42967, 43091, 43115)
fixes dswp flake
Sometimes a pod may not appear in desired state
of world immediately, we poll before failing.
It only adds additional 30s to tests in worst case.
Fixes#42990
cc @jingxu97
Automatic merge from submit-queue
Guarantee watch before action in e2e event observer helper function.
**What this PR does / why we need it**:
Adds a missing synchronization barrier to an e2e event observation helper function.
- This change should guarantee that in observeEventAfterAction,
the action is only executed after the informer begins watching
the event stream.
**Release note**:
```release-note
NONE
```
cc @kubernetes/sig-scheduling-pr-reviews @bsalamat
Automatic merge from submit-queue (batch tested with PRs 40404, 43134, 43117)
Fix ES cluster logging test
Fix#37324
Test was broken because fluentd-gcp now parses golang and fluentd-es doesn't
Automatic merge from submit-queue
Fix Deployment upgrade test.
**What this PR does / why we need it**:
When the upgrade test operates on Deployments in a pre-1.6 cluster (i.e. during the Setup phase), it needs to use the v1.5 deployment/util logic. In particular, the v1.5 logic does not filter children to only those with a matching ControllerRef.
**Which issue this PR fixes**:
Fixes#42738
**Special notes for your reviewer**:
**Release note**:
```release-note
```
cc @kubernetes/sig-apps-pr-reviews
Automatic merge from submit-queue (batch tested with PRs 43106, 43110)
Wait for garbagecollector to be synced in test
Fix#42952
Without the `cache.WaitForCacheSync` in the test, it's possible for the GC to get a merged event of RC's creation and its update (update to deletionTimestamp != 0), before GC gets the creation event of the pod, so it's possible the GC will handle the foreground deletion of the RC before it adds the Pod to the dependency graph, thus the race.
With the `cache.WaitForCacheSync` in the test, because GC runs a single thread to process graph changes, it's guaranteed the Pod will be added to the dependency graph before GC handles the foreground deletion of the RC.
Note that this pull fixes the race in the test. The race described in the first point of #26120 still exists.
Automatic merge from submit-queue
Retry calls to ReadFileViaContainer in PD tests
**What this PR does / why we need it**:
kubectl exec occasionally fails to return a valid output string. It seems to be an issue with docker #34256. This PR retries the 'kubectl exec' call to workaround the issue. This should fix the flaky PD test issues.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes#28283
**Release note**:
NONE
Automatic merge from submit-queue (batch tested with PRs 42854, 43105, 43090)
Add a timeout to allow replacement pod to become ready
Hopefully fixes https://github.com/kubernetes/kubernetes/issues/37259
```
I0314 04:26:02.562] Mar 14 04:26:02.562: INFO: Pod my-hostname-net-1bgrj still exists
I0314 04:26:22.491] Mar 14 04:26:22.491: INFO: Waiting for pod my-hostname-net-1bgrj to disappear
I0314 04:26:22.496] Mar 14 04:26:22.495: INFO: Pod my-hostname-net-1bgrj no longer exists
I0314 04:26:22.496] STEP: verifying whether the pod from the unreachable node is recreated
I0314 04:26:22.498] Mar 14 04:26:22.498: INFO: Pod name my-hostname-net: Found 3 pods out of 3
I0314 04:26:22.499] STEP: ensuring each pod is running
I0314 04:26:22.499] STEP: trying to dial each unique pod
I0314 04:26:22.579] Mar 14 04:26:22.579: INFO: Controller my-hostname-net: Got expected result from replica 1 [my-hostname-net-5jrdb]: "my-hostname-net-5jrdb", 1 of 3 required successes so far
I0314 04:26:22.642] Mar 14 04:26:22.642: INFO: Controller my-hostname-net: Got expected result from replica 2 [my-hostname-net-mjf3c]: "my-hostname-net-mjf3c", 2 of 3 required successes so far
I0314 04:31:22.645] Mar 14 04:31:22.644: INFO: Controller my-hostname-net: Failed to Get from replica 3 [my-hostname-net-rf46s]: Get https://35.184.87.178/api/v1/namespaces/e2e-tests-network-partition-s5gqt/pods/my-hostname-net-rf46s/proxy/: context deadline exceeded
```
The issue appears to be that we have a race between the pod being "running + ready" and being accessible via the APIServer proxy.
cc @kow3ns @bowei @davidopp
Automatic merge from submit-queue (batch tested with PRs 42854, 43105, 43090)
Move e2e sched event predicates to new file.
**What this PR does / why we need it**:
Small e2e test refactor for scheduler. Moves scheduler event predicates out of opaque_resource.go for reuse elsewhere.
**Release note**:
```release-note
NONE
```
cc @kubernetes/sig-scheduling-pr-reviews @timothysc @bsalamat
When the upgrade test operates on Deployments in a pre-1.6 cluster
(i.e. during the Setup phase), it needs to use the v1.5 deployment/util
logic. In particular, the v1.5 logic does not filter children to only
those with a matching ControllerRef.
Automatic merge from submit-queue (batch tested with PRs 43018, 42713)
Log instead of fail on GLBCs tendency to leak resources
**What this PR does / why we need it**:
Stops upgrade tests from flaking because the GLBC does not cleanup all resources due to a race condition.
**Which issue this PR fixes**: fixes#38569
**Special notes for your reviewer**:
To be reviewed by @mml
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 42775, 42991, 42968, 43029)
Add e2e test for Deployment controllerRef orphaning and adoption
Follow up #42908
@enisoc @kubernetes/sig-apps-bugs @kargakis
Automatic merge from submit-queue (batch tested with PRs 42775, 42991, 42968, 43029)
Initial breakout of scheduling e2es to help assist in assignment and refactoring
**What this PR does / why we need it**:
This PR segregates the scheduling specific e2es to isolate the library which will assist both in refactoring but also auto-assignment of issues.
**Which issue this PR fixes**
xref: https://github.com/kubernetes/kubernetes/issues/42691#issuecomment-285563265
**Special notes for your reviewer**:
All this change does is shuffle code around and quarantine. Behavioral, and other cleanup changes, will be in follow on PRs. As of today, the e2es are a monolith and there is massive symbol pollution, this 1st step allows us to segregate the e2es and tease apart the dependency mess.
**Release note**:
```
NONE
```
/cc @kubernetes/sig-scheduling-pr-reviews @kubernetes/sig-testing-pr-reviews @marun @skriss
/cc @gmarek - same trick for load + density, etc.
Automatic merge from submit-queue (batch tested with PRs 43034, 43066)
Allow StatefulSet controller to PATCH Pods.
**What this PR does / why we need it**:
StatefulSet now needs the PATCH permission on Pods since it calls into ControllerRefManager to adopt and release. This adds the permission and the missing e2e test that should have caught this.
**Which issue this PR fixes**:
**Special notes for your reviewer**:
This is based on #42925.
**Release note**:
```release-note
```
cc @kubernetes/sig-apps-pr-reviews
Automatic merge from submit-queue (batch tested with PRs 42942, 42935)
[Bug] Handle container restarts and avoid using runtime pod cache while allocating GPUs
Fixes#42412
**Background**
Support for multiple GPUs is an experimental feature in v1.6.
Container restarts were handled incorrectly which resulted in stranding of GPUs
Kubelet is incorrectly using runtime cache to track running pods which can result in race conditions (as it did in other parts of kubelet). This can result in same GPU being assigned to multiple pods.
**What does this PR do**
This PR tracks assignment of GPUs to containers and returns pre-allocated GPUs instead of (incorrectly) allocating new GPUs.
GPU manager is updated to consume a list of active pods derived from apiserver cache instead of runtime cache.
Node e2e has been extended to validate this failure scenario.
**Risk**
Minimal/None since support for GPUs is an experimental feature that is turned off by default. The code is also isolated to GPU manager in kubelet.
**Workarounds**
In the absence of this PR, users can mitigate the original issue by setting `RestartPolicyNever` in their pods.
There is no workaround for the race condition caused by using the runtime cache though.
Hence it is worth including this fix in v1.6.0.
cc @jianzhangbjz @seelam @kubernetes/sig-node-pr-reviews
Replaces #42560
Automatic merge from submit-queue
Allow DaemonSet controller to PATCH pods, and add more steps and logs in DaemonSet pods adoption e2e test
DaemonSet pods adoption failed because DS controller aren't allowed to patch pods when claiming pods.
[Edit] This PR fixes#42908 by modifying RBAC to allow DaemonSet controllers to patch pods, as well as adding more logs and steps to the original e2e test to make debugging easier.
Tested locally with a local cluster and GCE cluster.
@kargakis @lukaszo @kubernetes/sig-apps-pr-reviews
Automatic merge from submit-queue (batch tested with PRs 42940, 42906, 42970, 42848)
Move node and event observer helpers to e2e/common
**What this PR does / why we need it**:
Moves existing test helper functions in OIR e2e tests to `test/e2e/common`. These functions wrap informers to help test writers to observe events instead of long-polling for status updates.
For usage examples, see `test/e2e/opaque_resource.go`.
cc @kubernetes/sig-scheduling-misc
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue
Add fabianofranz as approver for test/e2e/kubectl.go
Adding myself as approver for `kubectl` end-to-end tests.
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 41794, 42349, 42755, 42901, 42933)
Fixes kubectl skew test failure when using kubectl.sh
Fixes leftovers from https://github.com/kubernetes/kubernetes/pull/42737.
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 41794, 42349, 42755, 42901, 42933)
Fix DefaultTolerationSeconds admission plugin
DefaultTolerationSeconds is not working as expected. It is supposed to add default tolerations (for unreachable and notready conditions). but no pod was getting these toleration. And api server was throwing this error:
```
Mar 08 13:43:57 fedora25 hyperkube[32070]: E0308 13:43:57.769212 32070 admission.go:71] expected pod but got Pod
Mar 08 13:43:57 fedora25 hyperkube[32070]: E0308 13:43:57.789055 32070 admission.go:71] expected pod but got Pod
Mar 08 13:44:02 fedora25 hyperkube[32070]: E0308 13:44:02.006784 32070 admission.go:71] expected pod but got Pod
Mar 08 13:45:39 fedora25 hyperkube[32070]: E0308 13:45:39.754669 32070 admission.go:71] expected pod but got Pod
Mar 08 14:48:16 fedora25 hyperkube[32070]: E0308 14:48:16.673181 32070 admission.go:71] expected pod but got Pod
```
The reason for this error is that the input to admission plugins is internal api objects not versioned objects so expecting versioned object is incorrect. Due to this, no pod got desired tolerations and it always showed:
```
Tolerations: <none>
```
After this fix, the correct tolerations are being assigned to pods as follows:
```
Tolerations: node.alpha.kubernetes.io/notReady=:Exists:NoExecute for 300s
node.alpha.kubernetes.io/unreachable=:Exists:NoExecute for 300s
```
@davidopp @kevin-wangzefeng @kubernetes/sig-scheduling-pr-reviews @kubernetes/sig-scheduling-bugs @derekwaynecarr
Fixes https://github.com/kubernetes/kubernetes/issues/42716
Automatic merge from submit-queue (batch tested with PRs 41794, 42349, 42755, 42901, 42933)
AppArmor cluster upgrade test
Add a cluster upgrade test for AppArmor. I still need to test this (having some trouble with the cluster-upgrade tests), but wanted to start the review process.
/cc @dchen1107 @roberthbailey
Automatic merge from submit-queue (batch tested with PRs 41794, 42349, 42755, 42901, 42933)
[Federation][e2e] Add framework for upgrade test in federation
Adding framework for federation upgrade tests. please refer to #41791
cc @madhusudancs @nikhiljindal @kubernetes/sig-federation-pr-reviews
The Deployment controller was not propagating ReadyReplicas to underlying clusters causing these errors:
```
Error syncing cluster controller: Deployment.apps "federation-deployment" is invalid: status.availableReplicas: Invalid value: 5: cannot be greater than readyReplicas
```
This was caught in e2e testing and is a 1.6 regression for support that was added in #37959. Without this fix, users will be unable to scale up their deployments.
Automatic merge from submit-queue (batch tested with PRs 42608, 42444)
Return nil when deleting non-exist GCE PD
When gce cloud tries to delete a disk, if the disk could not be found
from the zones, the function should return nil error. This modified behavior is also consistent with AWS
Automatic merge from submit-queue (batch tested with PRs 36704, 42719)
Extend timeouts in taints test to account for slow Pod deletions
Fix#42685
Before merging this we need a consensus on what to do with slow Pod deletions.
Automatic merge from submit-queue
e2e test: Log container output on TestContainerOutput error
When a pod started with TestContainerOutput or TestContainerOutputRegexp
fails from unknown reason, we should log all output of all its containers
so we can analyze what went wrong.
This would help us to see what wrong in https://github.com/kubernetes/kubernetes/issues/40811 - a container is running there for 3 minutes and dies and we want to see what it did for these 3 minutes.
```release-note
NONE
```
When a pod started with TestContainerOutput or TestContainerOutputRegexp
fails from unknown reason, we should log all output of all its containers
so we can analyze what went wrong.
Automatic merge from submit-queue (batch tested with PRs 42734, 42745, 42758, 42814, 42694)
Implement automated downgrade testing.
Node version cannot be higher than the master version, so we must
switch the node version first. Also, we must use the upgrade script
from the appropriate version for GCE.
Automatic merge from submit-queue (batch tested with PRs 42734, 42745, 42758, 42814, 42694)
Create DefaultPodDeletionTimeout for e2e tests
In our e2e and e2e_node tests, we had a number of different timeouts for deletion.
Recent changes to the way deletion works (#41644, #41456) have resulted in some timeouts in e2e tests. #42661 was the most recent fix for this.
Most of these tests are not meant to test pod deletion latency, but rather just to clean up pods after a test is finished.
For this reason, we should change all these tests to use a standard, fairly high timeout for deletion.
cc @vishh @Random-Liu
Automatic merge from submit-queue
Don't wait for the final deletion of pod
The final deletion of the pod depends on kubelet and other components operating correctly. The purpose of this e2e test is verifying the clientset can handle deleteOptions correctly, so waiting for the deletionTimestamp and deletionGraceperiod get set is good enough.
In the long run, we should move this set of e2e tests to integration tests.
Fix#42724#42646
cc @marun
Node version cannot be higher than the master version, so we must
switch the node version first. Also, we must use the upgrade script
from the appropriate version for GCE.
Automatic merge from submit-queue
add debugging to the client watch test
Adds debugging information for https://github.com/kubernetes/kubernetes/issues/42724. I suspect that the watch is closing early, but I'd like proof before I consider things like retrying the list and doing another watch to observe the delete. I'm not even sure that would satisfy the test
It seems like a flaky way to build the test. Why wouldn't we delete non-gracefully?
@kubernetes/sig-api-machinery-misc @caesarxuchao
@wojtek-t saw you just hit this if you wanted to take a quick look at the debugging I added.
Automatic merge from submit-queue (batch tested with PRs 42728, 42278)
[Federation] Create integration test fixture for api
This PR factors a reusable fixture for the federation api server out of the existing integration test.
Targets #40705
cc: @kubernetes/sig-federation-pr-reviews
Automatic merge from submit-queue (batch tested with PRs 42762, 42739, 42425, 42778)
kubeadm: update docker version for CE and EE
**What this PR does / why we need it**: Update regex for docker version to also capture new CE and EE versions.
**Which issue this PR fixes**: fixes #https://github.com/kubernetes/kubeadm/issues/189
**Special notes for your reviewer**: /cc @jbeda @luxas
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 42762, 42739, 42425, 42778)
[Federation][e2e] Use correct default dns name in e2e-testing
After some kubefed changes, the environment variable did not get propagated and we defaulted back to 'federation' instead of 'e2e-federation'. This fixes ongoing service test issues in e2e.
Automatic merge from submit-queue
Add default storageclass tests
**What this PR does / why we need it**:
Adds test cases for using and disabling the default storageclass.
**Release note**:
NONE
Automatic merge from submit-queue (batch tested with PRs 42211, 38691, 42737, 42757, 42754)
[Federation] Generate a random nodePort for each service object in e2e tests.
We now run e2e tests in parallel in the CI environment and nodeports are a single available range of numbers for all the service objects, so they have to be unique for each service object.
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 42211, 38691, 42737, 42757, 42754)
Add more e2e tests for DaemonSet templateGeneration and pod adoption
Depends on #42173
@erictune @kargakis @lukaszo @kubernetes/sig-apps-pr-reviews
After some kubefed changes, the environment variable did not get propagated
and we defaulted back to 'federation' instead of
'e2e-federation'. This fixes ongoing service test issues in e2e.
This change allows validators to pass warnings as well as errors. This
was needed because of how support for docker 1.13+ and the new EE and CE
versions is currently being handled.
We now run e2e tests in parallel in the CI environment and nodeports are
a single available range of numbers for all the service objects, so they
have to be unique for each service object.
Automatic merge from submit-queue (batch tested with PRs 42652, 42681, 42708, 42730)
e2e: fix restarting the apiserver
The string used to match the image name of the apiserver (e.g., `gcr.io/google_containers/kube-apiserver:3be...`),
but this no longer works. Change the test to locate the kube-apiserver container by name.
Automatic merge from submit-queue
Use namespace from context
Fixes#42653
Updates rbac_test.go to submit objects without namespaces set, which matches how actual objects are submitted to the API.
Automatic merge from submit-queue (batch tested with PRs 42705, 42647)
[federation][e2e] Increase timeout waiting for service shard to appear
Most of recent federation service tests are failing with timeouts. although some times they do pass, giving kind of flaky nature. So increasing the timeout waiting for service shard to appear in federated cluster from 1 min to 5 mins.
cc @madhusudancs @kubernetes/sig-federation-bugs
Automatic merge from submit-queue
[Federation] Use and return created replicaset instead of the passed object.
Passed replicaset object doesn't contain object name, but has a prefix set in `GenerateName`. However, we need to operate on the object name later to uniquely identified the created object. So we need the created object with the name set by the API server.
```release-note
NONE
```
Passed replicaset object doesn't contain object name, but has a prefix
set in `GenerateName`. However, we need to operate on the object name
later to uniquely identified the created object. So we need the created
object with the name set by the API server.
Automatic merge from submit-queue
New e2e node test suite with memcg turned on
The flag --experimental-kernal-memcg-notification was initially added to allow disabling an eviction feature which used memcg notifications to make memory evictions more reactive.
As documented in #37853, memcg notifications increased the likelihood of encountering soft lockups, especially on CVM.
This feature would valuable to turn on, at least for GCI, since soft lockup issues were less prevalent on GCI and appeared (at the time) to be unrelated to memcg notifications.
In the interest of caution, I would like to monitor serial tests on GCI with --experimental-kernal-memcg-notification=true.
cc @vishh @Random-Liu @dchen1107 @kubernetes/sig-node-pr-reviews
Automatic merge from submit-queue (batch tested with PRs 42664, 42687)
[Fix Flaky Tests] E2e Node Flaky test suite runs serially
The [e2e Node Flaky Test Suite](https://k8s-testgrid.appspot.com/google-node#kubelet-flaky-gce-e2e&width=20) has been failing with strange errors.
This is because the tests in that suite are meant to be run serially, but are running in parallel, since that was left out of the config. This PR fixes this by changing the Flaky test suite to serial
cc @Random-Liu
Automatic merge from submit-queue
[Bug Fix] Garbage Collect Node e2e Failing
This node e2e test uses its own deletion timeout (1 minute) instead of the default (3 minutes).
#41644 likely increased time for deletion. See that PR for analysis on that.
There may be other problems with this test, but those are difficult to pick apart from hitting this low timeout.
This PR changes the Garbage Collector test to use the default timeout. This should allow us to discern if there are any actual bugs to fix.
cc @kubernetes/sig-node-bugs @calebamiles @derekwaynecarr
Automatic merge from submit-queue
Create "framework" per upgrade test
There were already a few tests just using the default framework
namespace instead of creating a new one. Also there are several
testing libraries that use the default framework's default namespace
as well. It's just easier this way.
Automatic merge from submit-queue
deflake TestPatch by waiting for cache
Fixes#39471
Rather than retry on conflicts for an unknown number of times, we have the resource version after the previous patch and we can use that to wait for a GET response that is at least as current as that resourceVersion.
@kubernetes/sig-api-machinery-pr-reviews @liggitt @wojtek-t
Automatic merge from submit-queue (batch tested with PRs 41890, 42593, 42633, 42626, 42609)
Pods pending due to insufficient OIR should get scheduled once sufficient OIR becomes available (e2e disabled).
#41870 was reverted because it introduced an e2e test flake. This is the same code with the e2e for OIR disabled again.
We can attempt to enable the e2e test cases one-by-one in follow-up PRs, but it would be preferable to get the main fix merged in time for 1.6 since OIR is broken on master (see #41861).
cc @timothysc
Automatic merge from submit-queue (batch tested with PRs 41890, 42593, 42633, 42626, 42609)
Remove everything that is not new from batch/v2alpha1
Fixes#37166.
@lavalamp you've asked for it
@erictune this is a prereq for moving CronJobs to beta. I initially planned to put all in one PR, but after I did that I figured out it'll be easier to review separately. ptal
@kubernetes/api-approvers @kubernetes/sig-api-machinery-pr-reviews ptal
Automatic merge from submit-queue (batch tested with PRs 42506, 42585, 42596, 42584)
[Federation][e2e] Add wait for namespaces to appear in federated cluster
Some federation e2e tests are failing because the namespace (created for every test case) does not seem to appear in one (or more) of federated clusters.
This may be a timing issue, so introducing a wait until the namespace is created in federated clusters.
cc @madhusudancs, @nikhiljindal, @kubernetes/sig-federation-bugs
Automatic merge from submit-queue
cgroup names created by kubelet should be lowercased
**What this PR does / why we need it**:
This PR modifies the kubelet to create cgroupfs names that are lowercased. This better aligns us with the naming convention for cgroups v2 and other cgroup managers in ecosystem (docker, systemd, etc.)
See: https://www.kernel.org/doc/Documentation/cgroup-v2.txt
"2-6-2. Avoid Name Collisions"
**Special notes for your reviewer**:
none
**Release note**:
```release-note
kubelet created cgroups follow lowercase naming conventions
```
Automatic merge from submit-queue (batch tested with PRs 42080, 41653, 42598, 42555)
Fix resource cleanup in ingress_utils.go within e2e/framework
**What this PR does / why we need it**:
The GLBC is failing to delete resources during the etcd rollback tests and the e2e cleanup is leaking them. After a short while, tests are failing to create new resources.
This PR addresses the e2e/framework's ability to delete GLBC-created resources and adds more logging.
**Which issue this PR fixes**:
Helps #38569 but does not completely close this flake
**Special notes for your reviewer**:
Resources were not being deleted because resource names were being truncated and then their ability to be deleted was determined by the entire cluster id existing in the name. Truncated names also have an extra '0' append to the end of their name (unknown origin). This PR tries to match on a common prefix.
Minor changes were made to improve log readability.
**Testing this PR**:
This was tested by running a master upgrade test and by adding a second forwarding-rule mid-run. This forwarding rule referenced the same url-map used by the first forwarding-rule created by the GLBC. Therefore, the GLBC will be able to delete the forwarding-rule but not anymore L7 resources. This second forwarding rule's name was nearly identical to the first forwarding rule so that the cleanup code will find it.
As you can see from the test run below, the cleanup code deleted all the resources that the GLBC could not.
```log
...
Mar 5 18:35:53.112: INFO: Monitoring glbc's cleanup of gce resources:
k8s-fws-e2e-tests-ingress-upgrsde-0px85-static-ip--5f38ac0e2420 (forwarding rule)
k8s-tps-e2e-tests-ingress-upgrade-0px85-static-ip--5f38ac0e2420 (target-https-proxy)
k8s-um-e2e-tests-ingress-upgrade-0px85-static-ip--5f38ac0e24260 (url-map)
k8s-be-32331--5f38ac0e2426f796 (backend-service)
k8s-be-32613--5f38ac0e2426f796 (backend-service)
k8s-be-32331--5f38ac0e2426f796 (http-health-check)
k8s-be-32613--5f38ac0e2426f796 (http-health-check)
k8s-ig--5f38ac0e2426f796 (instance-group)
k8s-ssl-e2e-tests-ingress-upgrade-0px85-static-ip--5f38ac0e2420 (ssl-certificate)
STEP: Performing final delete of any remaining resources
Mar 5 18:35:54.055: INFO: Deleting forwarding-rules: k8s-fws-e2e-tests-ingress-upgrsde-0px85-static-ip--5f38ac0e2420
Mar 5 18:36:06.945: INFO: Deleting target-https-proxies: k8s-tps-e2e-tests-ingress-upgrade-0px85-static-ip--5f38ac0e2420
Mar 5 18:36:14.301: INFO: Deleting url-map: k8s-um-e2e-tests-ingress-upgrade-0px85-static-ip--5f38ac0e24260
Mar 5 18:36:18.309: INFO: Deleting backed-service: k8s-be-32331--5f38ac0e2426f796
Mar 5 18:36:22.112: INFO: Deleting backed-service: k8s-be-32613--5f38ac0e2426f796
Mar 5 18:36:26.192: INFO: Deleting http-health-check: k8s-be-32331--5f38ac0e2426f796
Mar 5 18:36:29.846: INFO: Deleting http-health-check: k8s-be-32613--5f38ac0e2426f796
Mar 5 18:36:33.722: INFO: Deleting instance-group: k8s-ig--5f38ac0e2426f796
Mar 5 18:36:37.762: INFO: Deleting ssl-certificate: k8s-ssl-e2e-tests-ingress-upgrade-0px85-static-ip--5f38ac0e2420
STEP: No resources leaked.
Mar 5 18:36:46.441: INFO: Deleting addresses: e2e-tests-ingress-upgrade-0px85-static-ip
Mar 5 18:36:53.902: INFO: L7 controller failed to delete all cloud resources on time. timed out waiting for the condition
...
```
Automatic merge from submit-queue (batch tested with PRs 42080, 41653, 42598, 42555)
Revert "Pods pending due to insufficient OIR should get scheduled once sufficient OIR becomes available."
Reverts kubernetes/kubernetes#41870 for stopping bleeding edge: #42597
cc/ @ConnorDoyle @kubernetes/release-team
Connor if there is a pending pr to fix the issue, please point it out to me. We can close this one, otherwise, I would like to revert the pr first. You can resubmit the fix. Thanks!
Automatic merge from submit-queue (batch tested with PRs 42080, 41653, 42598, 42555)
StatefulSet: Respect ControllerRef
**What this PR does / why we need it**:
This is part of the completion of the [ControllerRef](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/controller-ref.md) proposal. It brings StatefulSet into full compliance with ControllerRef. See the individual commit messages for details.
**Which issue this PR fixes**:
Fixes#36859
**Special notes for your reviewer**:
**Release note**:
```release-note
StatefulSet now respects ControllerRef to avoid fighting over Pods. At the time of upgrade, **you must not have StatefulSets with selectors that overlap** with any other controllers (such as ReplicaSets), or else [ownership of Pods may change](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/controller-ref.md#upgrading).
```
cc @erictune @kubernetes/sig-apps-pr-reviews
The list functions in deployment/util are used outside the Deployment
controller itself. Therefore, they don't do actual adoption/orphaning.
However, they still need to avoid listing things that don't belong.
Automatic merge from submit-queue (batch tested with PRs 41826, 42405)
Fixed too long name in HPA e2e upgrade test.
Fixed too long name in HPA e2e upgrade test.
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 41826, 42405)
Add stubDomains and upstreamNameservers configuration to kube-dns
```release-note
Updates the dnsmasq cache/mux layer to be managed by dnsmasq-nanny.
dnsmasq-nanny manages dnsmasq based on values from the
kube-system:kube-dns configmap:
"stubDomains": {
"acme.local": ["1.2.3.4"]
},
is a map of domain to list of nameservers for the domain. This is used
to inject private DNS domains into the kube-dns namespace. In the above
example, any DNS requests for *.acme.local will be served by the
nameserver 1.2.3.4.
"upstreamNameservers": ["8.8.8.8", "8.8.4.4"]
is a list of upstreamNameservers to use, overriding the configuration
specified in /etc/resolv.conf.
```
There were already a few tests just using the default framework
namespace instead of creating a new one. Also there are several
testing libraries that use the default framework's default namespace
as well. It's just easier this way.
Automatic merge from submit-queue (batch tested with PRs 31783, 41988, 42535, 42572, 41870)
Pods pending due to insufficient OIR should get scheduled once sufficient OIR becomes available.
This appears to be a regression since v1.5.0 in scheduler behavior for opaque integer resources, reported in https://github.com/kubernetes/kubernetes/issues/41861.
- [X] Add failing e2e test to trigger the regression
- [x] Restore previous behavior (pods pending due to insufficient OIR get scheduled once sufficient OIR becomes available.)
Automatic merge from submit-queue (batch tested with PRs 31783, 41988, 42535, 42572, 41870)
update names for kube plugin initializer to avoid conflicts
Fixes#42581
Other API servers are likely to create admission plugin initializers and so the names we choose for our interfaces matter (they may want to run multiple initializers in the chain). This updates the names for the plugin initializers to be more specific. No other changes.
@ncdc
Automatic merge from submit-queue
Remove the kube-discovery binary from the tree
**What this PR does / why we need it**:
kube-discovery was a temporary solution to implementing proposal: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/bootstrap-discovery.md
However, this functionality is now gonna be implemented in the core for v1.6 and will fully replace kube-discovery:
- https://github.com/kubernetes/kubernetes/pull/36101
- https://github.com/kubernetes/kubernetes/pull/41281
- https://github.com/kubernetes/kubernetes/pull/41417
So due to that `kube-discovery` isn't used in any v1.6 code, it should be removed.
The image `gcr.io/google_containers/kube-discovery-${ARCH}:1.0` should and will continue to exist so kubeadm <= v1.5 continues to work.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
**Special notes for your reviewer**:
**Release note**:
```release-note
Remove cmd/kube-discovery from the tree since it's not necessary anymore
```
@jbeda @dgoodwin @mikedanese @dmmcquay @lukemarsden @errordeveloper @pires
Automatic merge from submit-queue
Add ProviderUid support to Federated Ingress
This PR (along with GLBC support [here](https://github.com/kubernetes/ingress/pull/278)) is a proposed fix for #39989. The Ingress controller uses a configMap reconciliation process to ensure that all underlying ingresses agree on a unique UID. This works for all of GLBC's resources except firewalls which need their own cluster-unique UID. This PR introduces a ProviderUid which is maintained and synchronized cross-cluster much like the UID. We chose to derive the ProviderUid from the cluster name (via md5 hash).
Testing here is augmented to guarantee that configMaps are adequately propagated prior to Ingress creation.
```release-note
Federated Ingress over GCE no longer requires separate firewall rules to be created for each cluster to circumvent flapping firewall health checks.
```
cc @madhusudancs @quinton-hoole
Automatic merge from submit-queue (batch tested with PRs 42456, 42457, 42414, 42480, 42370)
In DaemonSet e2e test, don't check nodes with NoSchedule taints
Fixes#42345
For example, master node has a ismaster:NoSchedule taint. We don't expect pods to be created there without toleration.
cc @marun @lukaszo @kargakis @yujuhong @Random-Liu @davidopp @kubernetes/sig-apps-pr-reviews
Automatic merge from submit-queue (batch tested with PRs 42456, 42457, 42414, 42480, 42370)
node e2e: apparmor test should fail instead of panicking
This doesn't fix#42420, but at least stop the test from panicking.
Automatic merge from submit-queue (batch tested with PRs 42456, 42457, 42414, 42480, 42370)
Update npd in kubemark since #42201 is merged.
Revert https://github.com/kubernetes/kubernetes/pull/41716.
#42201 has been merged, and #41713 is fixed. Now we could retry update npd in kubemark.
/cc @shyamjvs @wojtek-t @dchen1107
Automatic merge from submit-queue (batch tested with PRs 42369, 42375, 42397, 42435, 42455)
Add alsologtostderr flag to hollow node
@yujuhong @wojtek-t that should solve the kubemark log issue.
Automatic merge from submit-queue (batch tested with PRs 42369, 42375, 42397, 42435, 42455)
[Bug Fix]: Avoid evicting more pods than necessary by adding Timestamps for fsstats and ignoring stale stats
Continuation of #33121. Credit for most of this goes to @sjenning. I added volume fs timestamps.
**why is this a bug**
This PR attempts to fix part of https://github.com/kubernetes/kubernetes/issues/31362 which results in multiple pods getting evicted unnecessarily whenever the node runs into resource pressure. This PR reduces the chances of such disruptions by avoiding reacting to old/stale metrics.
Without this PR, kubernetes nodes under resource pressure will cause unnecessary disruptions to user workloads.
This PR will also help deflake a node e2e test suite.
The eviction manager currently avoids evicting pods if metrics are old. However, timestamp data is not available for filesystem data, and this causes lots of extra evictions.
See the [inode eviction test flakes](https://k8s-testgrid.appspot.com/google-node#kubelet-flaky-gce-e2e) for examples.
This should probably be treated as a bugfix, as it should help mitigate extra evictions.
cc: @kubernetes/sig-storage-pr-reviews @kubernetes/sig-node-pr-reviews @vishh @derekwaynecarr @sjenning
Automatic merge from submit-queue
Eviction Manager Enforces Allocatable Thresholds
This PR modifies the eviction manager to enforce node allocatable thresholds for memory as described in kubernetes/community#348.
This PR should be merged after #41234.
cc @kubernetes/sig-node-pr-reviews @kubernetes/sig-node-feature-requests @vishh
** Why is this a bug/regression**
Kubelet uses `oom_score_adj` to enforce QoS policies. But the `oom_score_adj` is based on overall memory requested, which means that a Burstable pod that requested a lot of memory can lead to OOM kills for Guaranteed pods, which violates QoS. Even worse, we have observed system daemons like kubelet or kube-proxy being killed by the OOM killer.
Without this PR, v1.6 will have node stability issues and regressions in an existing GA feature `out of Resource` handling.
Automatic merge from submit-queue (batch tested with PRs 42443, 38924, 42367, 42391, 42310)
Fix StatefulSet e2e flake
**What this PR does / why we need it**:
Fixes StatefulSet e2e flake by ensuring that the StatefulSet controller has observed the unreadiness of Pods prior to attempting to exercise scale functionality.
**Which issue this PR fixes**
fixes#41889
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 42443, 38924, 42367, 42391, 42310)
Cast system uptime to time.Duration to fix cross build.
Fixes https://github.com/kubernetes/kubernetes/issues/42441.
Cast system uptime to `time.Duration` to avoid different behavior on different architectures.
@sjenning @ixdy @ncdc
When gce cloud tries to delete a disk, if the disk could not be found
from the zones, the function should return nil error. This modified behavior is also consistent with AWS
Automatic merge from submit-queue
Critial pod test uses allocatable instead of capacity
This solves #42239.
When this test was first introduced, pods could request up to the capacity of the node.
With the addition of allocatable introduced in #41234, this is no longer the case, and pods can only use up to allocatable.
This should be included in 1.6, as it is a bug related to a 1.6 feature.
cc @vish @yujuhong
Automatic merge from submit-queue (batch tested with PRs 41306, 42187, 41666, 42275, 42266)
Bump test timeouts to make secret tests work in large clusters
The previous Get/Update pattern with no retry on resource version mismatch
would flake with the following error:
"the object has been modified; please apply your changes to the latest
version and try again"
gives each ingress object a cluster-unique Uid that can be
leveraged by ingress providers.
In the process, supplement the testing of configMap updates to
ensure that the updates are propagated prior to any ingress
object being created. Configmap key/vals for Uid and ProviderUid
must exist at time of Ingress creation.
Automatic merge from submit-queue (batch tested with PRs 41984, 41682, 41924, 41928)
Move node problem detector test into node e2e.
Move current NPD e2e test into node e2e.
In fact, current NPD e2e test is only a functionality test for NPD. It creates test NPD pod, sets test configuration, generates test logs and verifies test result.
It doesn't actually test the NPD really deployed in the cluster.
So it doesn't actually need to run in cluster e2e. Running it in node e2e will:
1) Make it easier to run the test.
2) Make it more light weight to introduce this as a pre/post submit test in NPD repo in the future.
Except this, I'm working on a cluster e2e to run some basic functionality test and benchmark test against the real NPD deployed in the cluster. Will send the PR later.
/cc @dchen1107 @kubernetes/node-problem-detector-reviewers
Automatic merge from submit-queue (batch tested with PRs 41984, 41682, 41924, 41928)
Add options to kubefed telling it to generate HTTP Basic and/or token credentials for the Federated API server
fixes#41265.
**Release notes**:
```release-note
Adds two options to kubefed, `-apiserver-enable-basic-auth` and `-apiserver-enable-token-auth`, which generate an HTTP Basic username/password and a token respectively for the Federated API server.
```
Automatic merge from submit-queue (batch tested with PRs 41984, 41682, 41924, 41928)
RC/RS: Fully Respect ControllerRef
**What this PR does / why we need it**:
This is part of the completion of the [ControllerRef](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/controller-ref.md) proposal. It brings ReplicaSet and ReplicationController into full compliance with ControllerRef. See the individual commit messages for details.
**Which issue this PR fixes**:
Although RC/RS had partially implemented ControllerRef, they didn't use it to determine which controller to sync, or to update expectations. This could lead to instability or controllers getting stuck.
Ref: https://github.com/kubernetes/kubernetes/issues/24433
**Special notes for your reviewer**:
**Release note**:
```release-note
```
cc @erictune @kubernetes/sig-apps-pr-reviews
Automatic merge from submit-queue (batch tested with PRs 42128, 42064, 42253, 42309, 42322)
Add storage.k8s.io/v1 API
This is combined version of reverted #40088 (first 4 commits) and #41646. The difference is that all controllers and tests use old `storage.k8s.io/v1beta1` API so in theory all tests can pass on GKE.
Release note:
```release-note
StorageClassName attribute has been added to PersistentVolume and PersistentVolumeClaim objects and should be used instead of annotation `volume.beta.kubernetes.io/storage-class`. The beta annotation is still working in this release, however it will be removed in a future release.
```
Automatic merge from submit-queue (batch tested with PRs 41980, 42192, 42223, 41822, 42048)
Adjust parameters of GCL cluster logging load tests
This PR increases the amount of logs produced in load tests to match the number of nodes and provide the predictable load of 100 KB/sec on each node.
Also this PR reduces in half amount of time, given for ingesting logs.
Automatic merge from submit-queue (batch tested with PRs 41980, 42192, 42223, 41822, 42048)
Take into account number of restarts in cluster logging tests
Before, in cluster logging tests, we only measured e2e number of lines delivered to the backend.
Also, befure https://github.com/kubernetes/kubernetes/pull/41795 was merged, from the k8s perspective, fluentd was always working properly, even if it's crashlooping inside.
Now we can detect whether fluentd is truly working properly, experiencing no, or almost no OOMs duing its operation.
Automatic merge from submit-queue (batch tested with PRs 41980, 42192, 42223, 41822, 42048)
Modified kubemark startup scripts to restore master on reboot
Fixes#41735
As discussed in the issue, modified the scripts to satisfy the conditions of restoring master env, running non-idempotent operations only for the first time and persist important data like pki/auth files on a PD.
Also attached `start-kubemark-master.sh` as startup-script metadata to master instance (on GCE) so that it is called automatically on each boot.
cc @kubernetes/sig-scalability-misc @wojtek-t @gmarek
Automatic merge from submit-queue (batch tested with PRs 41931, 39821, 41841, 42197, 42195)
Admission Controller: Add Pod Preset
Based off the proposal in https://github.com/kubernetes/community/pull/254
cc @pmorie @pwittrock
TODO:
- [ ] tests
**What this PR does / why we need it**: Implements the Pod Injection Policy admission controller
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
**Special notes for your reviewer**:
**Release note**:
```release-note
Added new Api `PodPreset` to enable defining cross-cutting injection of Volumes and Environment into Pods.
```
Automatic merge from submit-queue (batch tested with PRs 41644, 42020, 41753, 42206, 42212)
Ingress-glbc upgrade tests
Basically #41676 but with some fixes and added comments. @bprashanth has been away this week and it's desirable to have this in before code freeze.
export functions from pkg/api/validation
add settings API
add settings to pkg/registry
add settings api to pkg/master/master.go
add admission control plugin for pod preset
add new admission control plugin to kube-apiserver
add settings to import_known_versions.go
add settings to codegen
add validation tests
add settings to client generation
add protobufs generation for settings api
update linted packages
add settings to testapi
add settings install to clientset
add start of e2e
add pod preset plugin to config-test.sh
Signed-off-by: Jess Frazelle <acidburn@google.com>
Automatic merge from submit-queue
Extensible Userspace Proxy
This PR refactors the userspace proxy to allow for custom proxy socket implementations.
It changes the the ProxySocket interface to ensure that other packages can properly implement it (making sure all arguments are publicly exposed types, etc), and adds in a mechanism for an implementation to create an instance of the userspace proxy with a non-standard ProxySocket.
Custom ProxySockets are useful to inject additional logic into the actual proxying. For example, our idling proxier uses a custom proxy socket to hold connections and notify the cluster that idled scalable resources need to be woken up.
Also-Authored-By: Ben Bennett bbennett@redhat.com
Automatic merge from submit-queue
Add apps/v1beta1 deployments with new defaults
This pull introduces deployments under `apps/v1beta1` and fixes#23597 and #23304.
TODO:
* [x] - create new type `apps/v1beta1.Deployment`
* [x] - update kubectl (stop, scale)
* [ ] - ~~new `kubectl run` generator~~ - this will only duplicate half of generator code, I suggest replacing current to use new endpoint
* [ ] - ~~create extended tests~~ - I've added integration and cmd tests verifying new endpoints
* [ ] - ~~create `hack/test-update-storage-objects.sh`~~ - see above
This is currently blocked by https://github.com/kubernetes/kubernetes/pull/38071, due to conflicting name `v1beta1.Deployment`.
```release-note
Introduce apps/v1beta1.Deployments resource with modified defaults compared to extensions/v1beta1.Deployments.
```
@kargakis @mfojtik @kubernetes/sig-apps-misc
Automatic merge from submit-queue (batch tested with PRs 42316, 41618, 42201, 42113, 42191)
[Kubemark] Add init container in hollow node for setting inotify limit of node to 200
Fixes#41713
Along with adding the init container, I also changed the manifest to a yaml as otherwise the entire init container annotation would have to be in a single line (with escaped characters), as json doesn't allow multi-line strings.
cc @kubernetes/sig-scalability-misc @wojtek-t @gmarek @Random-Liu
Automatic merge from submit-queue
Extend experimental support to multiple Nvidia GPUs
Extended from #28216
```release-note
`--experimental-nvidia-gpus` flag is **replaced** by `Accelerators` alpha feature gate along with support for multiple Nvidia GPUs.
To use GPUs, pass `Accelerators=true` as part of `--feature-gates` flag.
Works only with Docker runtime.
```
1. Automated testing for this PR is not possible since creation of clusters with GPUs isn't supported yet in GCP.
1. To test this PR locally, use the node e2e.
```shell
TEST_ARGS='--feature-gates=DynamicKubeletConfig=true' FOCUS=GPU SKIP="" make test-e2e-node
```
TODO:
- [x] Run manual tests
- [x] Add node e2e
- [x] Add unit tests for GPU manager (< 100% coverage)
- [ ] Add unit tests in kubelet package
Automatic merge from submit-queue (batch tested with PRs 41597, 42185, 42075, 42178, 41705)
auto discovery CA for extension API servers
This is what the smaller pulls were leading to. Only the last commit is unique and I expect I'll still tweak some pod definitions, but this is where I was going.
@sttts @liggitt
Automatic merge from submit-queue (batch tested with PRs 42216, 42136, 42183, 42149, 36828)
[Federation] Remove federat{ed,ion} prefixes from e2e file names since they are all now scoped under the e2e_federation package.
This is purely cosmetic.
```release-note
NONE
```
cc @kubernetes/sig-federation-pr-reviews
Automatic merge from submit-queue (batch tested with PRs 42200, 39535, 41708, 41487, 41335)
Add support for statefulset spreading to the scheduler
**What this PR does / why we need it**:
The scheduler SelectorSpread priority funtion didn't have the code to spread pods of StatefulSets. This PR adds StatefulSets to the list of controllers that SelectorSpread supports.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes#41513
**Special notes for your reviewer**:
**Release note**:
```release-note
Add the support to the scheduler for spreading pods of StatefulSets.
```
Automatic merge from submit-queue (batch tested with PRs 35094, 42095, 42059, 42143, 41944)
Use chroot for containerized mounts
This PR is to modify the containerized mounter script to use chroot
instead of rkt fly. This will avoid the problem of possible large number
of mounts caused by rkt containers if they are not cleaned up.
Automatic merge from submit-queue (batch tested with PRs 35094, 42095, 42059, 42143, 41944)
add aggregation integration test
Wires up an integration test which runs a full kube-apiserver, the wardle server, and the kube-aggregator and creates the APIservice object for the wardle server. Without services and DNS the aggregator doesn't proxy, but it does ensure we don't have an obvious panic or bring up failure.
@sttts @ncdc
Automatic merge from submit-queue (batch tested with PRs 40746, 41699, 42108, 42174, 42093)
Avoid fake node names in user info
Node usernames should follow the format `system:node:<node-name>`,
but if we don't know the node name, it's worse to put a fake one in.
In the future, we plan to have a dedicated node authorizer, which would
start rejecting requests from a user with a bogus node name like this.
The right approach is to either mint correct credentials per node, or use node bootstrapping so it requests a correct client certificate itself.
Automatic merge from submit-queue
clean up generic apiserver options
Clean up generic apiserver options before we tag any levels. This makes them more in-line with "normal" api servers running on the platform.
Also remove dead example code.
@sttts
Automatic merge from submit-queue (batch tested with PRs 41937, 41151, 42092, 40269, 42135)
[Federation] ReplicaSet e2es should let the API server to generate the names to avoid collision while running tests in parallel.
cc @kubernetes/sig-federation-pr-reviews
Automatic merge from submit-queue (batch tested with PRs 41937, 41151, 42092, 40269, 42135)
Add a unit test for idempotent applys to the TPR entries.
The test in apply_test follows the general pattern of other tests.
We load from a file in test/fixtures and mock the API server in the
function closure in the HttpClient call.
The apply operation expects a last-modified-configuration annotation.
That is written verbatim in the test/fixture file.
References #40841
**What this PR does / why we need it**:
Adds one unit test for TPR's using applies.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
References:
https://github.com/kubernetes/features/issues/95https://github.com/kubernetes/kubernetes/issues/40841#issue-204769102
**Special notes for your reviewer**:
I am not super proud of the tpr-entry name.
But I feel like we need to call the two objects differently.
The one which has Kind:ThirdPartyResource
and the one has Kind:Foo.
Is the name "ThirdPartyResource" used interchangeably for both ? I used tpr-entry for the Kind:Foo object.
Also I !assume! this is testing an idempotent apply because the last-applied-configuration annotation is the same as the object itself.
This is the state I see in the logs of kubectl if I do a proper idempotent apply of a third party resource entry.
I guess I will know more once I start playing around with apply command that change TPR objects.
**Release note**:
```release-note
```
Automatic merge from submit-queue (batch tested with PRs 41234, 42186, 41615, 42028, 41788)
Additional upgrade e2e tests
**What this PR does / why we need it**: Add basic upgrade tests for DaemonSet and Job, and add "during upgrade" testing to ConfigMap test. Add a simple harness for testing upgrade tests.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*:
**Special notes for your reviewer**: continuation of #41296 @krousey please review, thanks
**Release note**: `NONE`
Automatic merge from submit-queue
Enforce Node Allocatable via cgroups
This PR enforces node allocatable across all pods using a top level cgroup as described in https://github.com/kubernetes/community/pull/348
This PR also provides an option to enforce `kubeReserved` and `systemReserved` on user specified cgroups.
This PR will by default make kubelet create top level cgroups even if `kubeReserved` and `systemReserved` is not specified and hence `Allocatable = Capacity`.
```release-note
New Kubelet flag `--enforce-node-allocatable` with a default value of `pods` is added which will make kubelet create a top level cgroup for all pods to enforce Node Allocatable. Optionally, `system-reserved` & `kube-reserved` values can also be specified separated by comma to enforce node allocatable on cgroups specified via `--system-reserved-cgroup` & `--kube-reserved-cgroup` respectively. Note the default value of the latter flags are "".
This feature requires a **Node Drain** prior to upgrade failing which pods will be restarted if possible or terminated if they have a `RestartNever` policy.
```
cc @kubernetes/sig-node-pr-reviews @kubernetes/sig-node-feature-requests
TODO:
- [x] Adjust effective Node Allocatable to subtract hard eviction thresholds
- [x] Add unit tests
- [x] Complete pending e2e tests
- [x] Manual testing
- [x] Get the proposal merged
@dashpole is working on adding support for evictions for enforcing Node allocatable more gracefully. That work will show up in a subsequent PR for v1.6
Automatic merge from submit-queue (batch tested with PRs 41205, 42196, 42068, 41588, 41271)
Implements an upgrade test for Job
**What this PR does / why we need it**:
This PR implements a cluster upgrade test for Job. Some functionality for Job testing has been moved from the e2e package to the framework package to facilitate code reuse between the e2e package and the upgrade package without introducing cyclic dependencies.
We need this PR to help automate the testing of cluster upgrades between versions.
**Release note**
```release-note
NONE
```
This changes the userspace proxy so that it cleans up its conntrack
settings when a service is removed (as the iptables proxy already
does). This could theoretically cause problems when a UDP service
as deleted and recreated quickly (with the same IP address). As
long as packets from the same UDP source IP and port were going to
the same destination IP and port, the the conntrack would apply and
the packets would be sent to the old destination.
This is astronomically unlikely if you did not specify the IP address
to use in the service, and even then, only happens with an "established"
UDP connection. However, in cases where a service could be "switched"
between using the iptables proxy and the userspace proxy, this case
becomes much more frequent.
Updates the dnsmasq cache/mux layer to be managed by dnsmasq-nanny.
dnsmasq-nanny manages dnsmasq based on values from the
kube-system:kube-dns configmap:
"stubDomains": {
"acme.local": ["1.2.3.4"]
},
is a map of domain to list of nameservers for the domain. This is used
to inject private DNS domains into the kube-dns namespace. In the above
example, any DNS requests for *.acme.local will be served by the
nameserver 1.2.3.4.
"upstreamNameservers": ["8.8.8.8", "8.8.4.4"]
is a list of upstreamNameservers to use, overriding the configuration
specified in /etc/resolv.conf.
The tests in apply_test follows the general pattern of other tests.
We load from a file in test/fixtures and mock the API server in the
function closure in the HttpClient call.
In PATCH request rount-tripper we check that the kubectl apply
implementation worked as expected.
References #40841