Automatic merge from submit-queue
optimize conditions of ServiceReplenishmentUpdateFunc to replenish service
Originally, the replenishQuota method didn't focus on the third parameter object even if others transfered to it, i think the function is not efficient and perfect. then i use the third param to get MatchResources, it will be more exact. for example, if the old pod was quota tracked and the new was not, the replenishQuota only focus on usage resource of the old pod, still if the third parameter object is nil, the process will be same as before
Automatic merge from submit-queue
refactor quota calculation for re-use
Refactors quota calculation to allow reuse. This will allow us to do "punch through" calculation inside of admission if a particular quota needs usage stats and allows downstream re-use by moving calculation closer to the evaluators and separating "needs calculation" logic from "do calculation".
@derekwaynecarr
Automatic merge from submit-queue
Rewrite service controller to apply best controller pattern
This PR is a long term solution for #21625:
We apply the same pattern like replication controller to service controller to avoid the potential process order messes in service controller, the change includes:
1. introduce informer controller to watch service changes from kube-apiserver, so that every changes on same service will be kept in serviceStore as the only element.
2. put the service name to be processed to working queue
3. when process service, always get info from serviceStore to ensure the info is up-to-date
4. keep the retry mechanism, sleep for certain interval and add it back to queue.
5. remote the logic of reading last service info from kube-apiserver before processing the LB info as we trust the info from serviceStore.
The UT has been passed, manual test passed after I hardcode the cloud provider as FakeCloud, however I am not able to boot a k8s cluster with any available cloudprovider, so e2e test is not done.
Submit this PR first for review and for triggering a e2e test.
Automatic merge from submit-queue
Fix deployment e2e test: waitDeploymentStatus should error when entering an invalid state
Follow up #28162
1. We should check that max unavailable and max surge aren't violated at all times in e2e tests (didn't check this in deployment scaled rollout yet, but we should wait for it to become valid and then continue do the check until it finishes)
2. Fix some minor bugs in e2e tests
@kubernetes/deployment
Automatic merge from submit-queue
Stabilize volume unit tests by waiting for exact state
Wait for specific final state instead of waiting for specific number of
operations in controller unit tests. The tests are more readable and will survive
random goroutine ordering (PV and PVC controller have both their own
goroutine).
@kubernetes/sig-storage
Automatic merge from submit-queue
add union registry for quota
Adds the ability to combine multiple quota registries together. Kube needs this for other types.
@derekwaynecarr
Automatic merge from submit-queue
Error out when any RS has more available pods then its spec replicas
Fixes#29559 (hopefully, if not the bot will open new issues for us)
@kubernetes/deployment
Automatic merge from submit-queue
pkg/various: plug leaky time.New{Timer,Ticker}s
According to the documentation for Go package time, `time.Ticker` and
`time.Timer` are uncollectable by garbage collector finalizers. They
leak until otherwise stopped. This commit ensures that all remaining
instances are stopped upon departure from their relative scopes.
Similar efforts were incrementally done in #29439 and #29114.
```release-note
* pkg/various: plugged various time.Ticker and time.Timer leaks.
```
Automatic merge from submit-queue
pkg/controller/node/nodecontroller: simplify mutex
Similar to #29598, we can rely on the zero-value construction behavior
to embed `sync.Mutex` into parent structs.
/CC: @saad-ali
Automatic merge from submit-queue
make the resource prefix in etcd configurable for cohabitation
This looks big, its not as bad as it seems.
When you have different resources cohabiting, the resource name used for the etcd directory needs to be configurable. HPA in two different groups worked fine before. Now we're looking at something like RC<->RS. They normally store into two different etcd directories. This code allows them to be configured to store into the same location.
To maintain consistency across all resources, I allowed the `StorageFactory` to indicate which `ResourcePrefix` should be used inside `RESTOptions` which already contains storage information.
@lavalamp affects cohabitation.
@smarterclayton @mfojtik prereq for our rc<->rs and d<->dc story.
For non-attachable volumes, do not call GetVolumeName on the plugin and instead
generate a unique name based on the identity of the pod and the name of the volume
within the pod.
Automatic merge from submit-queue
Add an Azure CloudProvider Implementation
This PR adds `Azure` as a cloudprovider provider for Kubernetes. It specifically adds support for native pod networking (via Azure User Defined Routes) and L4 Load Balancing (via Azure Load Balancers).
I did have to add `clusterName` as a parameter to the `LoadBalancers` methods. This is because Azure only allows one "LoadBalancer" object per set of backend machines. This means a single "LoadBalancer" object must be shared across the cluster. The "LoadBalancer" is named via the `cluster-name` parameter passed to `kube-controller-manager` so as to enable multiple clusters per resource group if the user desires such a configuration.
There are few things that I'm a bit unsure about:
1. The implementation of the `Instances` interface. It's not extensively documented, it's not really clear what the different functions are used for, and my questions on the ML didn't get an answer.
2. Counter to the comments on the `LoadBalancers` Interface, I modify the `api.Service` object in `EnsureLoadBalancerDeleted`, but not with the intention of affecting Kube's view of the Service. I simply do it so that I can remove the `Port`s on the `Service` object and then re-use my reconciliation logic that can handle removing stale/deleted Ports.
3. The logging is a bit verbose. I'm looking for guidance on the appropriate log level to use for the chattier bits.
Due to the (current) lack of Instance Metadata Service and lack of Virtual Machine Identity in Azure, the user is required to do a few things to opt-in to this provider. These things are called-out as they are in contrast to AWS/GCE:
1. The user must provision an Azure Active Directory ServicePrincipal with `Contributor` level access to the resource group that the cluster is deployed in. This creation process is documented [by Hashicorp](https://www.packer.io/docs/builders/azure-setup.html) or [on the MSDN Blog](https://blogs.msdn.microsoft.com/arsen/2016/05/11/how-to-create-and-test-azure-service-principal-using-azure-cli/).
2. The user must place a JSON file somewhere on each Node that conforms to the `AzureConfig` struct defined in `azure.go`. (This is automatically done in the Azure flavor of [Kubernetes-Anywhere](https://github.com/kubernetes/kubernetes-anywhere).)
3. The user must specify `--cloud-config=/path/to/azure.json` as an option to `kube-apiserver` and `kube-controller-manager` similarly to how the user would need to pass `--cloud-provider=azure`.
I've been running approximately this code for a month and a half. I only encountered one bug which has since been fixed and covered by a unit test. I've just deployed a new cluster (and a Type=LoadBalancer nginx Service) using this code (via `kubernetes-anywhere`) and have posted [the `kube-controller-manager` logs](https://gist.github.com/colemickens/1bf6a26e7ef9484a72a30b1fcf9fc3cb) for anyone who is interested in seeing the logs of the logic.
If you're interested in this PR, you can use the instructions in my [`azure-kubernetes-demo` repository](https://github.com/colemickens/azure-kubernetes-demo) to deploy a cluster with minimal effort via [`kubernetes-anywhere`](https://github.com/kubernetes/kubernetes-anywhere). (There is currently [a pending PR in `kubernetes-anywhere` that is needed](https://github.com/kubernetes/kubernetes-anywhere/pull/172) in conjuncture with this PR). I also have a pre-built `hyperkube` image: `docker.io/colemickens/hyperkube-amd64:v1.4.0-alpha.0-azure`, which will be kept in sync with the branch this PR stems from.
I'm hoping this can land in the Kubernetes 1.4 timeframe.
CC (potential code reviewers from Azure): @ahmetalpbalkan @brendandixon @paulmey
CC (other interested Azure folk): @brendandburns @johngossman @anandramakrishna @jmspring @jimzim
CC (others who've expressed interest): @codefx9 @edevil @thockin @rootfs
According to the documentation for Go package time, `time.Ticker` and
`time.Timer` are uncollectable by garbage collector finalizers. They
leak until otherwise stopped. This commit ensures that all remaining
instances are stopped upon departure from their relative scopes.
Automatic merge from submit-queue
To break the loop when object found in removeOrphanFinalizer()
To break the loop when object found in removeOrphanFinalizer()
Automatic merge from submit-queue
Allow shareable resources for admission control plugins.
Changes allow admission control plugins to share resources. This is done via new PluginInitialization structure. The structure can be extended for other resources, for now it is an shared informer for namespace plugins (NamespiceLifecycle, NamespaceAutoProvisioning, NamespaceExists).
If a plugins needs some kind of shared resource e.g. client, the client shall be added to PluginInitializer and Wants methods implemented to every plugin which will use it.
Automatic merge from submit-queue
controller/service: minor cleanup
1. always handle short case first for if statement
2. do not capitalize error message
3. put the mutex before the fields it protects
4. prefer switch over if elseif.
Automatic merge from submit-queue
use a separate queue for initial quota calculation
When the quota controller gets backed up on resyncs, it can take a long time to observe the first usage stats which are needed by the admission plugin. This creates a second queue to prioritize the initial calculation.
Automatic merge from submit-queue
Fix "PVC Volume not detached if pod deleted via namespace deletion" issue
Fixes#29051: "PVC Volume not detached if pod deleted via namespace deletion"
This PR:
* Fixes a bug in `desired_state_of_the_world_populator.go` to check the value of `exists` returned by the `podInformer` so that it can delete pods even if the delete event is missed (or fails).
* Reduces the desired state of the world populators sleep period from 5 min to 1 min (reducing the amount of time a volume would remain attached if a volume delete event is missed or fails).
Automatic merge from submit-queue
Allow mounts to run in parallel for non-attachable volumes
This PR:
* Fixes https://github.com/kubernetes/kubernetes/issues/28616
* Enables mount volume operations to run in parallel for non-attachable volume plugins.
* Enables unmount volume operations to run in parallel for all volume plugins.
* Renames `GoRoutineMap` to `GoroutineMap`, resolving a long outstanding request from @thockin: `"Goroutine" is a noun`
When a new rollout with a different size than the previous size of the
deployment is initiated then only the new replica set will notice the
new size. Old replica sets are not updated by the rollout path.
Allow mount volume operations to run in parallel for non-attachable
volume plugins.
Allow unmount volume operations to run in parallel for all volume
plugins.
Automatic merge from submit-queue
Certificate signing controller for TLS bootstrap (alpha)
The controller handles generating and signing certificates when a CertificateSigningRequest has the "Approved" condition. Uses cfssl to support a wide set of possible keys and algorithms. Depends on PR #25562, only the last two commits are relevant to this PR.
cc @mikedanese
[]()
Automatic merge from submit-queue
Improve quota controller performance by eliminating unneeded list calls
Previously, when syncing quota usage, we asked each registered `Evaluator` to determine the usage it knows to track associated with a `GroupKind` even if that particular `GroupKind` had no associated resources under quota.
This fix makes it that when we sync a quota that just had only `Pod` related compute resources, we do not also calculate the usage stats for things like `ConfigMap`, `Secret`, etc. per quota.
This should be a significant performance gain when running large numbers of `Namespace`'s each with `ResourceQuota` that tracks a subset of resources.
/cc @deads2k @kubernetes/rh-cluster-infra
Automatic merge from submit-queue
[GarbageCollector] Let the RC manager set/remove ControllerRef
What's done:
* RC manager sets Controller Ref when creating new pods
* RC manager sets Controller Ref when adopting pods with matching labels but having no controller
* RC manager clears Controller Ref when pod labels change
* RC manager clears pods' Controller Ref when rc's selector changes
* RC manager stops adoption/creating/deleting pods when rc's DeletionTimestamp is set
* RC manager bumps up ObservedGeneration: The [original code](https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/replication/replication_controller_utils.go#L36) will do this.
* Integration tests:
* verifies that changing RC's selector or Pod's Labels triggers adoption/abandoning
* e2e tests (separated to #27151):
* verifies GC deletes the pods created by RC if DeleteOptions.OrphanDependents=false, and orphans the pods if DeleteOptions.OrphanDependents=true.
TODO:
- [x] we need to be able to select Pods that have a specific ControllerRef. Then each time we sync the RC, we will iterate through all the Pods that has a controllerRef pointing the RC, event if the labels of the Pod doesn't match the selector of RC anymore. This will prevent a Pod from stuck with a stale controllerRef, which could be caused by the race between abandoner (the goroutine that removes controllerRef) and worker the goroutine that add controllerRef to pods).
- [ ] use controllerRef instead of calling `getPodController`. This might be carried out by the control-plane team.
- [ ] according to the controllerRef proposal (#25256): "For debugging purposes we want to add an adoptionTime annotation prefixed with kubernetes.io/ which will keep the time of last controller ownership transfer." This might be carried out by the control-plane team.
cc @lavalamp @gmarek
Automatic merge from submit-queue
[garbage collector] add e2e test
This PR also includes some changes to plumb controller-manager's `--enable_garbage_collector` from the environment variable.
The e2e test will not be run by the core suite because it's marked `[Feature:GarbageCollector]`.
The corresponding jenkins job configuration PR is https://github.com/kubernetes/test-infra/pull/132.
Automatic merge from submit-queue
controller: wait for synced old replica sets on Recreate
Partially fixes https://github.com/kubernetes/kubernetes/issues/27362
Any other work on it should be handled in the replica set level (and/or kubelet if it's required)
@kubernetes/deployment PTAL
Automatic merge from submit-queue
Controllers doesn't take any actions when being deleted.
I started doing it for other controllers but it's not always clear to me how it should work. I'll be adding other ones as separate commits to this PR.
cc @caesarxuchao @lavalamp
Automatic merge from submit-queue
Add hooks for cluster health detection
Separate a function that decides if zone is healthy. First real commit for preventing massive pod eviction.
Ref. #28832
cc @davidopp
Automatic merge from submit-queue
Change storeToNodeConditionLister to return []*api.Node instead of api.NodeList for performance
Currently copies that are made while copying/creating api.NodeList are significant part of scheduler profile, and a bunch of them are made in places, that are not-parallelizable.
Ref #28590
Automatic merge from submit-queue
controller/volume: simplify sync logic in syncUnboundClaim
Remove all unnecessary branching logic. No actual logic changes. Code is more readable now.
Wait for specific final state instead of waiting for specific number of
operations in volume unit tests. The tests are more readable and will survive
random goroutine ordering (PV and PVC controller have both their own
goroutine).
Automatic merge from submit-queue
allow lock acquisition injection for quota admission
Allows for custom lock acquisition when composing the quota admission controller.
@derekwaynecarr I'm still experimenting to make sure this satisfies the need downstream, but looking for agreement in principle
Changes:
* moved waiting for synced caches before starting any work
* refactored worker() to really quit on quit
* changed queue to a ratelimiting queue and added retries on errors
* deep-copy deployments before mutating - we still need to deep-copy
replica sets and pods
Automatic merge from submit-queue
Move deployment functions to deployment/util.go not widely used
If the function is not used in multiple areas, move it to deployment/util.go
fixes#26750
Automatic merge from submit-queue
Some scheduler optimizations
Ref #28590
This PR doesn't do anything fancy - it is just reducing amount of memory allocations in scheduler, which in turn significantly speeds up scheduler.
Automatic merge from submit-queue
allow handler to join after the informer has started
This allows an event handler to join after a SharedInformer has started. It can't add any indexes, but it can add its reaction functions.
This works by
1. stopping the flow of events from the reflector (thus stopping updates to our store)
1. registering the new handler
1. sending synthetic "add" events to the new handler only
1. unblocking the flow of events
It would be possible to
1. block
1. list
1. add recorder
1. unblock
1. play list to as-yet unregistered handler
1. block
1. remove recorder
1. play recording
1. add new handler
1. unblock
But that is considerably more complicated. I'd rather not start there since this ought to be the exception rather than the rule.
@wojtek-t who requested this power in the initial review
@smarterclayton @liggitt I think this resolves our all-in-one ordering problem.
@hongchaodeng since this came up on the call
* rolling.go (has all the logic for rolling deployments)
* recreate.go (has all the logic for recreate deployments)
* sync.go (has all the logic for getting and scaling replica sets)
* rollback.go (has all the logic for rolling back a deployment)
* util.go (contains all the utilities used throughout the controller)
Leave back at deployment_controller.go all the necessary bits for
creating, setting up, and running the controller loop.
Also add package documentation.
Automatic merge from submit-queue
Use `CreatedByAnnotation` constant
A nit but didn't want the strings to get out of sync.
Signed-off-by: Doug Davis <dug@us.ibm.com>
Automatic merge from submit-queue
Track object modifications in fake clientset
Fake clientset is used by unit tests extensively but it has some
shortcomings:
- no filtering on namespace and name: tests that want to test objects in
multiple namespaces end up getting all objects from this clientset,
as it doesn't perform any filtering based on name and namespace;
- updates and deletes don't modify the clientset state, so some tests
can get unexpected results if they modify/delete objects using the
clientset;
- it's possible to insert multiple objects with the same
kind/name/namespace, this leads to confusing behavior, as retrieval is
based on the insertion order, but anchors on the last added object as
long as no more objects are added.
This change changes core.ObjectRetriever implementation to track object
adds, updates and deletes.
Some unit tests were depending on the previous (and somewhat incorrect)
behavior. These are fixed in the following few commits.
Automatic merge from submit-queue
Kubelet should mark VolumeInUse before checking if it is Attached
Kubelet should mark VolumeInUse before checking if it is Attached.
Controller should fetch fresh copy of node object before detach instead of relying on node informer cache.
Fixes#27836
Ensure that kublet marks VolumeInUse before checking if it is Attached.
Also ensures that the attach/detach controller always fetches a fresh
copy of the node object before detach (instead ofKubelet relying on node
informer cache).
Using new fake clientset registry exposes the actual flow on pending
namespace finalization: get namespace, create finalizer, list pods and
delete namespace if there are no pods.
Since fake clientset now correctly tracks objects created by deployment
controller, it triggers different controller behavior: controller only
creates replica set once and updates deployment once.
Fake clientset no longer needs to be prepopulated with records: keeping
them in leads to the name conflict on creates. Also, since fake
clientset now respects namespaces, we need to correctly populate them.
Automatic merge from submit-queue
Convert service account token controller to use a work queue
Converts the service account token controller to use a work queue. This allows parallelization of token generation (useful when there are several simultaneous namespaces or service accounts being created). It also lets us requeue failures to be retried sooned than the next sync period (which can be very long).
Fixes an issue seen when a namespace is created with secrets quotaed, and the token controller tries to create a token secret prior to the quota status having been initialized. In that case, the secret is rejected at admission, and the token controller wasn't retrying until the resync period.
Automatic merge from submit-queue
Fix initialization of volume controller caches.
Fix `PersistentVolumeController.initializeCaches()` to pass pointers to volume or claim to `storeObjectUpdate()` and add extra functions to enforce that the right types are checked in the future.
Fixes#28076
Fix PersistentVolumeController.initializeCaches() to pass pointers to volume
or claim to storeObjectUpdate() and add extra functions to enforce that the
right types are checked in the future.
Fixes#28076
Automatic merge from submit-queue
[client-gen]Add Patch to clientset
* add the Patch() method to the clientset.
* I have to rename the existing Patch() method of `Event` to PatchWithEventNamespace() to avoid overriding.
* some minor changes to the fake Patch action.
cc @Random-Liu since he asked for the method
@kubernetes/sig-api-machinery
ref #26580
```release-note
Add the Patch method to the generated clientset.
```
Automatic merge from submit-queue
Fix startup type error in initializeCaches
The following error was getting logged:
PersistentVolumeController can't initialize caches, expected list of volumes, got:
&{TypeMeta:{Kind: APIVersion:} ListMeta:{SelfLink:/api/v1/persistentvolumes ResourceVersion:11} Items:[]}
The tests make extensive use of NewFakeControllerSource which uses api.List
instead of api.PersistentVolumeList. So use reflect to help iterate over the
items then assert the item type.
fixes#27757
Automatic merge from submit-queue
daemon/controller.go: refactor worker
1. function name is better to be verb or verb+noun
2. remove unnecessary func call
Automatic merge from submit-queue
Proportionally scale paused and rolling deployments
Enable paused and rolling deployments to be proportionally scaled.
Also have cleanup policy work for paused deployments.
Fixes#20853Fixes#20966Fixes#20754
@bgrant0607 @janetkuo @ironcladlou @nikhiljindal
<!-- Reviewable:start -->
---
This change is [<img src="http://reviewable.k8s.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](http://reviewable.k8s.io/reviews/kubernetes/kubernetes/20273)
<!-- Reviewable:end -->
The following error was getting logged:
PersistentVolumeController can't initialize caches, expected list of volumes, got:
&{TypeMeta:{Kind: APIVersion:} ListMeta:{SelfLink:/api/v1/persistentvolumes ResourceVersion:11} Items:[]}
Automatic merge from submit-queue
let dynamic client handle non-registered ListOptions
And register v1.ListOptions in the policy group.
Fix#27622
@lavalamp @smarterclayton @krousey
Automatic merge from submit-queue
Prevent detach before node status update
The PR prevents the attach/detach controller from start a detach operation before updating the node status (to remove the volume from the list of attached volumes).
Fixes https://github.com/kubernetes/kubernetes/issues/27836
Automatic merge from submit-queue
AWS/GCE: Spread PetSet volume creation across zones, create GCE volumes in non-master zones
Long term we plan on integrating this into the scheduler, but in the
short term we use the volume name to place it onto a zone.
We hash the volume name so we don't bias to the first few zones.
If the volume name "looks like" a PetSet volume name (ending with
-<number>) then we use the number as an offset. In that case we hash
the base name.
This change implements the desiredStateOfWorld populator to sync up with
the pod informer. It periodically check each pod in the
desiredStateOfworld and verify whether it is still in pod informer
cache. If it not, remove it from the desiredStateOfWorld
Automatic merge from submit-queue
Deployment controller's cleanupUnhealthyReplicas should respect minReadySeconds
```release-note
Fixed an issue that Deployment may be scaled down further than allowed by maxUnavailable when minReadySeconds is set.
```
Fixes#26834
Detected by a flake in deployment rollover e2e test (the only test that specifies `minReadySeconds`).
cc @kubernetes/deployment @pwittrock
cc @mqliang who first added `cleanupUnhealthyReplicas` in deployment controller
[]()
Automatic merge from submit-queue
Kubelet Volume Manager Wait For Attach Detach Controller and Backoff on Error
* Closes https://github.com/kubernetes/kubernetes/issues/27483
* Modified Attach/Detach controller to report `Node.Status.AttachedVolumes` on successful attach (unique volume name along with device path).
* Modified Kubelet Volume Manager wait for Attach/Detach controller to report success before proceeding with attach.
* Closes https://github.com/kubernetes/kubernetes/issues/27492
* Implemented an exponential backoff mechanism for for volume manager and attach/detach controller to prevent operations (attach/detach/mount/unmount/wait for controller attach/etc) from executing back to back unchecked.
* Closes https://github.com/kubernetes/kubernetes/issues/26679
* Modified volume `Attacher.WaitForAttach()` methods to uses the device path reported by the Attach/Detach controller in `Node.Status.AttachedVolumes` instead of calling out to cloud providers.
Modify attach/detach controller to keep track of volumes to report
attached in Node VolumeToAttach status.
Modify kubelet volume manager to wait for volume to show up in Node
VolumeToAttach status.
Implement exponential backoff for errors in volume manager and attach
detach controller
Automatic merge from submit-queue
Allow disabling of dynamic provisioning
Allow administrators to opt-out of dynamic provisioning. Provisioning is still on by default, which is the current behavior.
Per a conversation with @jsafrane, a boolean toggle was added and plumbed through into the controller. Deliberate disabling will simply return nil from `provisionClaim` whereas a misconfigured provisioner will continue on and generate error events for the PVC.
@kubernetes/rh-storage @saad-ali @thockin @abhgupta
Automatic merge from submit-queue
Fill PV.Status.Message with deleter/recycler errors.
Instead of empty `Message` `kubectl describe pv` now shows:
```
Name: nfs
Labels: <none>
Status: Failed
Claim: default/nfs
Reclaim Policy: Recycle
Access Modes: RWX
Capacity: 1Mi
Message: Recycler failed: Pod was active on the node longer than specified deadline
Source:
Type: NFS (an NFS mount that lasts the lifetime of a pod)
Server: 10.999.999.999
Path: /
ReadOnly: false
```
This is actually a regression since 1.2
@kubernetes/sig-storage
Long term we plan on integrating this into the scheduler, but in the
short term we use the volume name to place it onto a zone.
We hash the volume name so we don't bias to the first few zones.
If the volume name "looks like" a PetSet volume name (ending with
-<number>) then we use the number as an offset. In that case we hash
the base name.
Fixes#27256
Automatic merge from submit-queue
fix updatePod() of RS and RC controllers
Fix updatePod of replication controller manager and replica set controller to handle pod label updates that match no RC or RS.
Fix#27405
This commit adds a new volume manager in kubelet that synchronizes
volume mount/unmount (and attach/detach, if attach/detach controller
is not enabled).
This eliminates the race conditions between the pod creation loop
and the orphaned volumes loops. It also removes the unmount/detach
from the `syncPod()` path so volume clean up never blocks the
`syncPod` loop.
Automatic merge from submit-queue
processor listener: fix locking in pop()
Currently the lock in processorListener is used to guard pendingNotifications. But in pop, it also locks around on select chan. This will block the goroutine with lock acquired.
This PR changes the lock to guard the correct section only.
Automatic merge from submit-queue
volume controller: Convert PersistentVolumes from Kubernetes 1.2
In Kubernetes 1.2 we used template PersistentVolume for provisioning. When a claim for dynamic volume was detected, Kubernetes did:
- create template PV for the claim with dummy pointer to storage asset
- allocate storage asset such as AWS EBS
- fill real pointer to the created storage asset to the template PV
In refactored volume provisioner, Kubernetes allocates the storage asset first and then creates a Kubernetes PV instance already with the correct pointer to the storage asset.
To support seamles upgrade from 1.2 to 1.3 we need to remove these unprovisioned template PVs. The new controller does not use them, it will see PVC for dynamic provisioning and create real PV instead.
See https://github.com/pmorie/pv-haxxz/pull/3 for pseudocode.
Automatic merge from submit-queue
Wait for all volumes/claims to get synced in unit test.
Controller.HasSynced() returns true when all initial claims/volumes were sent
to appropriate goroutines, not when the goroutine has actually processed them.
Fixes#26712
In Kubernetes 1.2 we used template PersistentVolume for provisioning. When a
claim for dynamic volume was detected, Kubernetes did:
- create template PV for the claim with dummy pointer to storage asset
- allocate storage asset such as AWS EBS
- fill real pointer to the created storage asset to the template PV
In refactored volume provisioner, Kubernetes allocates the storage asset first
and then creates a Kubernetes PV instance already with the correct pointer
to the storage asset.
To support seamles upgrade from 1.2 to 1.3 we need to remove these
unprovisioned template PVs. The new controller does not use them, it will see
PVC for dynamic provisioning and create real PV instead.
Automatic merge from submit-queue
Fix typo and linewrap comments in PV controller
Fix some typos and linewrap long comments that I found while going over this code investigating something.
Controller.HasSynced() returns true when all initial claims/volumes were sent
to appropriate goroutines, not when the goroutine has actually processed them.
Automatic merge from submit-queue
Attach/Detach Controller Kubelet Changes
This PR contains changes to enable attach/detach controller proposed in #20262.
Specifically it:
* Introduces a new `enable-controller-attach-detach` kubelet flag to enable control by attach/detach controller. Default enabled.
* Removes all references `SafeToDetach` annotation from controller.
* Adds the new `VolumesInUse` field to the Node Status API object.
* Modifies the controller to use `VolumesInUse` instead of `SafeToDetach` annotation to gate detachment.
* Modifies kubelet to set `VolumesInUse` before Mount and after Unmount.
* There is a bug in the `node-problem-detector` binary that causes `VolumesInUse` to get reset to nil every 30 seconds. Issue https://github.com/kubernetes/node-problem-detector/issues/9#issuecomment-221770924 opened to fix that.
* There is a bug here in the mount/unmount code that prevents resetting `VolumeInUse in some cases, this will be fixed by mount/unmount refactor.
* Have controller process detaches before attaches so that volumes referenced by pods that are rescheduled to a different node are detached first.
* Fix misc bugs in controller.
* Modify GCE attacher to: remove retries, remove mutex, and not fail if volume is already attached or already detached.
Fixes#14642, #19953
```release-note
Kubernetes v1.3 introduces a new Attach/Detach Controller. This controller manages attaching and detaching volumes on-behalf of nodes that have the "volumes.kubernetes.io/controller-managed-attach-detach" annotation.
A kubelet flag, "enable-controller-attach-detach" (default true), controls whether a node sets the "controller-managed-attach-detach" or not.
```
Automatic merge from submit-queue
Fill controller caches on startup
The controller needs to fill its caches before it starts binding/recycling/ deleting or provisioning volumes and claims. This was done using blocking initial 'xxx added' from going through syncClaim/syncVolume. However, when the caches were full, the controller waited for the next sync period to do actual binding/recycling etc.
In this patch, the controller fills its caches directly from etcd and then processes initial 'xxx added' events to reconcile the world and bind/recycle/ delete/provision stuff, resulting in faster binding after startup.
Fixes#25967 (properly)
This PR contains Kubelet changes to enable attach/detach controller control.
* It introduces a new "enable-controller-attach-detach" kubelet flag to
enable control by controller. Default enabled.
* It removes all references "SafeToDetach" annoation from controller.
* It adds the new VolumesInUse field to the Node Status API object.
* It modifies the controller to use VolumesInUse instead of SafeToDetach
annotation to gate detachment.
* There is a bug in node-problem-detector that causes VolumesInUse to
get reset every 30 seconds. Issue https://github.com/kubernetes/node-problem-detector/issues/9
opened to fix that.
Automatic merge from submit-queue
Fix data race in volume controller unit test.
Reactor must be locked when fiddling with reactor.volumes and reactor.claims. Therefore add new functions to add/delete volume/claim with sending an event.
Fixes#26345
Event recorder should wait for some time to get all expected events, the event
may be written by another goroutine that just have finished.
It should not slow down the test in most cases, only when there is a bug and
expected event is not sent.
Reactor must be locked when fiddling with reactor.volumes and reactor.claims.
Therefore add new functions to add/delete volume/claim with sending an event.
Automatic merge from submit-queue
Stabilize controller unit tests.
Remove test "5-1", it's flaky as it depends on order of execution of goroutines. When the controller starts, existing claim is enqueued as "initial sync event" and a new volume is enqueued to separate goroutine. It is not deterministic which goroutine processes its events first and there is no way how to tell that the claim event was processed.
Also, force resync of the controllers after the test to make sure all events are processed.
Fixes unit test flakes.
@kubernetes/sig-storage
Remove test "5-1", it's flaky as it depends on order of execution of
goroutines. When the controller starts, existing claim is enqueued as
"initial sync event" and a new volume is enqueued to separate goroutine.
It is not deterministic which goroutine processes its events first and
there is no way how to tell that the claim event was processed.
Also, force resync of the controllers after the test to make sure all
events are processed.
Automatic merge from submit-queue
daemonset handle DeletedFinalStateUnknown
During an e2e run in OpenShift we ran into the DS controller panic when handling `DeletedFinalStateUnknown`. This PR checks for `DeletedFinalStateUnknown` and queues the embedded object if it is a `DaemonSet`.
@mikedanese - would you mind taking a look?
@deads2k
```
panic: interface conversion: interface is cache.DeletedFinalStateUnknown, not *extensions.DaemonSet
goroutine 4369 [running]:
k8s.io/kubernetes/pkg/controller/daemon.func·005(0x2f8a0c0, 0xc20b559680)
/data/src/github.com/openshift/origin/Godeps/_workspace/src/k8s.io/kubernetes/pkg/controller/daemon/controller.go:160 +0x50
k8s.io/kubernetes/pkg/controller/framework.ResourceEventHandlerFuncs.OnDelete(0xc20a0ae090, 0xc20a0ae0a0, 0xc20a0ae0b0, 0x2f8a0c0, 0xc20b559680)
/data/src/github.com/openshift/origin/Godeps/_workspace/src/k8s.io/kubernetes/pkg/controller/framework/controller.go:178 +0x41
k8s.io/kubernetes/pkg/controller/framework.(*ResourceEventHandlerFuncs).OnDelete(0xc20b8ebf20, 0x2f8a0c0, 0xc20b559680)
<autogenerated>:25 +0xb5
k8s.io/kubernetes/pkg/controller/framework.func·001(0x2f8a280, 0xc20b5522e0, 0x0, 0x0)
/data/src/github.com/openshift/origin/Godeps/_workspace/src/k8s.io/kubernetes/pkg/controller/framework/controller.go:248 +0x4be
k8s.io/kubernetes/pkg/controller/framework.(*Controller).processLoop(0xc20bb727e0)
/data/src/github.com/openshift/origin/Godeps/_workspace/src/k8s.io/kubernetes/pkg/controller/framework/controller.go:122 +0x6f
k8s.io/kubernetes/pkg/controller/framework.*Controller.(k8s.io/kubernetes/pkg/controller/framework.processLoop)·fm()
/data/src/github.com/openshift/origin/Godeps/_workspace/src/k8s.io/kubernetes/pkg/controller/framework/controller.go:97 +0x27
k8s.io/kubernetes/pkg/util/wait.func·001()
/data/src/github.com/openshift/origin/Godeps/_workspace/src/k8s.io/kubernetes/pkg/util/wait/wait.go:66 +0x61
k8s.io/kubernetes/pkg/util/wait.JitterUntil(0xc209f8cfb8, 0x3b9aca00, 0x0, 0xc2080543c0)
/data/src/github.com/openshift/origin/Godeps/_workspace/src/k8s.io/kubernetes/pkg/util/wait/wait.go:67 +0x8f
k8s.io/kubernetes/pkg/util/wait.Until(0xc209f8cfb8, 0x3b9aca00, 0xc2080543c0)
/data/src/github.com/openshift/origin/Godeps/_workspace/src/k8s.io/kubernetes/pkg/util/wait/wait.go:47 +0x4a
k8s.io/kubernetes/pkg/controller/framework.(*Controller).Run(0xc20bb727e0, 0xc2080543c0)
/data/src/github.com/openshift/origin/Godeps/_workspace/src/k8s.io/kubernetes/pkg/controller/framework/controller.go:97 +0x1fb
created by k8s.io/kubernetes/pkg/controller/daemon.(*DaemonSetsController).Run
/data/src/github.com/openshift/origin/Godeps/_workspace/src/k8s.io/kubernetes/pkg/controller/daemon/controller.go:212 +0xae
```
https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_check/1002/artifact/origin/artifacts/test-cmd/logs/openshift.log
Automatic merge from submit-queue
Reduce volume controller sync period
fixes#24236 and most probably also fixes#25294.
Needs #25881! With the cache, binder is not affected by sync period. Without the cache, binding of 1000 PVCs takes more than 5 minutes (instead of ~70 seconds).
15 seconds were chosen by fair 2d10 roll :-)
The controller needs to fill its caches before it starts binding/recycling/
deleting or provisioning volumes and claims. This was done using blocking
initial 'xxx added' from going through syncClaim/syncVolume. However, when
the caches were full, the controller waited for the next sync period to do
actual binding/recycling etc.
In this patch, the controller fills its caches directly from etcd and then
processes initial 'xxx added' events to reconcile the world and bind/recycle/
delete/provision stuff, resulting in faster binding after startup.
Fixes#25967 (properly)
Automatic merge from submit-queue
volume controller: Add cache with the latest version of PVs and PVCs
When the controller binds a PV to PVC, it saves both objects to etcd. However, there is still an old version of these objects in the controller Informer cache. So, when a new PVC comes, the PV is still seen as available and may get bound to the new PVC. This will be blocked by etcd, still, it creates unnecessary traffic that slows everything down.
To make everything worse, when periodic sync with the old PVC is performed, this PVC is seen by the controller as Pending (while it's already Bound on etcd) and will be bound to a different PV. Writing to this PV won't be blocked by etcd, only subsequent write of the PVC fails. So, the controller will need to roll back the PV in another transaction(s). The controller can keep itself pretty busy this way.
Also, we save bound PVs (and PVCs) as two transactions - we save say PV.Spec first and then .Status. The controller gets "PV.Spec updated" event from etcd and tries to fix the Status, as it seems to the controller it's outdated. This write again fails - there already is a correct version in etcd.
As we can't influence the Informer cache, it is read-only to the controller, this patch introduces second cache in the controller, which holds latest and greatest version on PVs and PVCs to prevent these useless writes to etcd . It gets updated with events from etcd *and* after etcd confirms successful save of PV/PVC modified by the controller.
The cache stores only *pointers* to PVs/PVCs, so in ideal case it shares the actual object data with the informer cache. They will diverge only for a short time when the controller modifies something and the informer cache did not get update events yet.
@kubernetes/sig-storage
Automatic merge from submit-queue
Add a NodeCondition "NetworkUnavaiable" to prevent scheduling onto a node until the routes have been created
This is new version of #26267 (based on top of that one).
The new workflow is:
- we have an "NetworkNotReady" condition
- Kubelet when it creates a node, it sets it to "true"
- RouteController will set it to "false" when the route is created
- Scheduler is scheduling only on nodes that doesn't have "NetworkNotReady ==true" condition
@gmarek @bgrant0607 @zmerlynn @cjcullen @derekwaynecarr @danwinship @dcbw @lavalamp @vishh
Automatic merge from submit-queue
prevent namespace cleanup hotloop
Found chasing a sentry report. Looks like we hot-loop on namespace deletion failures.
@derekwaynecarr ptal
Automatic merge from submit-queue
Attach Detach Controller Business Logic
This PR adds the meat of the attach/detach controller proposed in #20262.
The PR splits the in-memory cache into a desired and actual state of the world.
Automatic merge from submit-queue
volume controller: use better operation names
Using volume/claim.UID in the operation name is not really useful, as UIDs are not logged by rest of the controller. On the other hand, volume.Name and claim.Namespace/Name is logged pretty often and it would help to log these also in operation name. Still, I'd prefer to have the operation name really unique to be protected from users deleting a volume and quickly creating another one with the same name, so UID is still part of the operation name.
This has been already proven to be very useful in controller debugging.
Automatic merge from submit-queue
Add some extra checking in the tests to prevent flakes.
Attempts to fix https://github.com/kubernetes/kubernetes/issues/25967
The hypothesis is that somehow waitTest() catches an idle that occurs before all changes have been applied. This will block until the expected number of changes have arrived.
Automatic merge from submit-queue
use monotonic now in TestDelNode
Fixes https://github.com/kubernetes/kubernetes/issues/24971.
Briefly, the rate_limited_queue uses a `container/heap` to store values, and use this data structure to ensure we can always fetch the value with the minimum `processAt`. However, in some extreme condition, the continuous call to `time.Now()` would get the same value, which causes some unpredictable order in the queue, this fix uses a monotonic `now()` to avoid that.
@smarterclayton please take a look.
Split controller cache into actual and desired state of world.
Controller will only operate on volumes scheduled to nodes that
have the "volumes.kubernetes.io/controller-managed-attach" annotation.
Automatic merge from submit-queue
ResourceQuota controller uses rate limiter to prevent hot-loops in error situations
Have resource quota controller use a rate limited queue to prevent hot-looping in error situations.
Automatic merge from submit-queue
volume recycler: Don't start a new recycler pod if one already exists.
Recycling is a long duration process and when the recycler controller is restarted in the meantime, it should not start a new recycler pod if there is one already running.
This means that the recycler pod must have deterministic name based on name of the recycled PV, we then get name conflicts when creating the pod.
Two things need to be changed:
- recycler controller and recycler plugins must pass the PV.Name to place, where the pod is created. This is most of the patch and it should be pretty straightforward.
- create recycler pod with deterministic name and check "already exists" error.
When at it, remove useless 'resourceVersion' argument and make log messages starting with lowercase.
There is an unit test to check the behavior + there is an e2e test that checks that regular recycling is not broken (it does not try to run two recycler pods in parallel as the recycler is single-threaded now).
Automatic merge from submit-queue
NodeController doesn't evict Pods if no Nodes are Ready
Fix#13412#24597
When NodeControllers don't see any Ready Node it goes into "network segmentation mode". In this mode it cancels all evictions and don't evict any Pods.
It leaves network segmentation mode when it sees at least one Ready Node. When leaving it resets all timers, so each Node has full grace period to reconnect to the cluster.
cc @lavalamp @davidopp @mml @wojtek-t @fgrzadkowski
The binder sorts all available volumes first, then it filters out volumes
that cannot be bound by processing each volume in a loop and then finds
the smallest matching volume by binary search.
So, if we process every available volume in a loop, we can also remember the
smallest matching one and save us potentially long sorting (and quick binary
search).
When the controller binds a PV to PVC, it saves both objects to etcd.
However, there is still an old version of these objects in the controller
Informer cache. So, when a new PVC comes, the PV is still seen as available
and may get bound to the new PVC. This will be blocked by etcd, still, it
creates unnecessary traffic that slows everything down.
Also, we save bound PV/PVC as two transactions - we save PV/PVC.Spec first
and then .Status. The controller gets "PV/PVC.Spec updated" event from etcd
and tries to fix the Status, as it seems to the controller it's outdated.
This write again fails - there already is a correct version in etcd.
We can't influence the Informer cache, it is read-only to the controller.
To prevent these useless writes to etcd, this patch introduces second cache
in the controller, which holds latest and greatest version on PVs and PVCs.
It gets updated with events from etcd *and* after etcd confirms successful
save of PV/PVC modified by the controller.
The cache stores only *pointers* to PVs/PVCs, so in ideal case it shares the
actual object data with the informer cache. They will diverge only when
the controller modifies something and the informer cache did not get update
events yet.
Using volume/claim.UID in the operation name is not really useful, as UIDs
are not logged by rest of the controller. On the other hand, volume.Name and
claim.Namespace/Name is logged pretty often and it would help to log these
also in operation name.
This has been already proven to be very useful in controller debugging.
Recycling is a long duration process and when the recycler controller is
restarted in the meantime, it should not start a new recycler pod if there is
one already running.
This means that the recycler pod must have deterministic name based on name
of the recycled PV, we then get name conflicts when creating the pod.
Two things need to be changed:
- recycler controller and recycler plugins must pass the PV.Name to place,
where the pod is created.
- create recycler pod with deterministic name and check "already exists" error.
When at it, remove useless 'resourceVersion' argument and make log messages
starting with lowercase.
Automatic merge from submit-queue
Refactor persistent volume controller
Here is complete persistent controller as designed in https://github.com/pmorie/pv-haxxz/blob/master/controller.go
It's feature complete and compatible with current binder/recycler/provisioner. No new features, it *should* be much more stable and predictable.
Testing
--
The unit test framework is quite complicated, still it was necessary to reach reasonable coverage (78% in `persistentvolume_controller.go`). The untested part are error cases, which are quite hard to test in reasonable way - sure, I can inject a VersionConflictError on any object update and check the error bubbles up to appropriate places, but the real test would be to run `syncClaim`/`syncVolume` again and check it recovers appropriately from the error in the next periodic sync. That's the hard part.
Organization
---
The PR starts with `rm -rf kubernetes/pkg/controller/persistentvolume`. I find it easier to read when I see only the new controller without old pieces scattered around.
[`types.go` from the old controller is reused to speed up matching a bit, the code looks solid and has 95% unit test coverage].
I tried to split the PR into smaller patches, let me know what you think.
~~TODO~~
--
* ~~Missing: provisioning, recycling~~.
* ~~Fix integration tests~~
* ~~Fix e2e tests~~
@kubernetes/sig-storage
<!-- Reviewable:start -->
---
This change is [<img src="http://reviewable.k8s.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](http://reviewable.k8s.io/reviews/kubernetes/kubernetes/24331)
<!-- Reviewable:end -->
Fixes#15632
Automatic merge from submit-queue
Make name validators return string slices
Part of the larger validation PR, broken out for easier review and merge. Builds on previous PRs in the series.
- remove persistentvolume_ prefix from all files
- split controller.go into controller.go and controller_base.go (to have them
under 1500 lines for github)
This fixes e2e test for provisioning - it expects that provisioned volumes
are bound quickly.
Majority of this patch is update of test framework needs to initialize the
controller appropriately.
- Add reclaim policy to newVolume() call.
- Implement reactor Volumes().Get().
- Implement mock volume plugin.
- Add recycler tests.
- Add a synchronization condition to controller.scheduleOperation
- we need to pause the controller here, let the test to do some bad things
to the controller and test error cases in recycleVolumeOperation.
Test framework gets more and more complicated... But this is the last piece,
I promise.
We need to keep list of running recyclers, deleters and provisioners in
memory in order not to start a new recycling/deleting/provisioning twice
for the same volume/claim.
This will be eventually replaced by GoRoutineMap from PR #24838.
Automatic merge from submit-queue
prevent nil pointer when starting controllers before running the shar…
Fixes https://github.com/kubernetes/kubernetes/issues/25643.
https://github.com/kubernetes/kubernetes/pull/23795 changed initialization order, so the controller isn't guaranteed to be present at startup.
@mqliang @wojtek-t I'm pretty sure that we're not guaranteed to get back the correct `cache.Indexer` or `cache.Store` either. I'll look at re-plumbing the `AddIndexer` path to use the same instance so that its safe to use again.
Automatic merge from submit-queue
Horizontal autoscalaer tolerance breaching verifier
@piosz @jszczepkowski This is the original HPA test without the other modifications, it makes explicit the mathematics of the logic in case changes ever occur.
<!-- Reviewable:start -->
---
This change is [<img src="http://reviewable.k8s.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](http://reviewable.k8s.io/reviews/kubernetes/kubernetes/24736)
<!-- Reviewable:end -->