Split controller cache into actual and desired state of world.
Controller will only operate on volumes scheduled to nodes that
have the "volumes.kubernetes.io/controller-managed-attach" annotation.
Automatic merge from submit-queue
ResourceQuota controller uses rate limiter to prevent hot-loops in error situations
Have resource quota controller use a rate limited queue to prevent hot-looping in error situations.
Automatic merge from submit-queue
volume recycler: Don't start a new recycler pod if one already exists.
Recycling is a long duration process and when the recycler controller is restarted in the meantime, it should not start a new recycler pod if there is one already running.
This means that the recycler pod must have deterministic name based on name of the recycled PV, we then get name conflicts when creating the pod.
Two things need to be changed:
- recycler controller and recycler plugins must pass the PV.Name to place, where the pod is created. This is most of the patch and it should be pretty straightforward.
- create recycler pod with deterministic name and check "already exists" error.
When at it, remove useless 'resourceVersion' argument and make log messages starting with lowercase.
There is an unit test to check the behavior + there is an e2e test that checks that regular recycling is not broken (it does not try to run two recycler pods in parallel as the recycler is single-threaded now).
Automatic merge from submit-queue
NodeController doesn't evict Pods if no Nodes are Ready
Fix#13412#24597
When NodeControllers don't see any Ready Node it goes into "network segmentation mode". In this mode it cancels all evictions and don't evict any Pods.
It leaves network segmentation mode when it sees at least one Ready Node. When leaving it resets all timers, so each Node has full grace period to reconnect to the cluster.
cc @lavalamp @davidopp @mml @wojtek-t @fgrzadkowski
The binder sorts all available volumes first, then it filters out volumes
that cannot be bound by processing each volume in a loop and then finds
the smallest matching volume by binary search.
So, if we process every available volume in a loop, we can also remember the
smallest matching one and save us potentially long sorting (and quick binary
search).
When the controller binds a PV to PVC, it saves both objects to etcd.
However, there is still an old version of these objects in the controller
Informer cache. So, when a new PVC comes, the PV is still seen as available
and may get bound to the new PVC. This will be blocked by etcd, still, it
creates unnecessary traffic that slows everything down.
Also, we save bound PV/PVC as two transactions - we save PV/PVC.Spec first
and then .Status. The controller gets "PV/PVC.Spec updated" event from etcd
and tries to fix the Status, as it seems to the controller it's outdated.
This write again fails - there already is a correct version in etcd.
We can't influence the Informer cache, it is read-only to the controller.
To prevent these useless writes to etcd, this patch introduces second cache
in the controller, which holds latest and greatest version on PVs and PVCs.
It gets updated with events from etcd *and* after etcd confirms successful
save of PV/PVC modified by the controller.
The cache stores only *pointers* to PVs/PVCs, so in ideal case it shares the
actual object data with the informer cache. They will diverge only when
the controller modifies something and the informer cache did not get update
events yet.
Using volume/claim.UID in the operation name is not really useful, as UIDs
are not logged by rest of the controller. On the other hand, volume.Name and
claim.Namespace/Name is logged pretty often and it would help to log these
also in operation name.
This has been already proven to be very useful in controller debugging.
Recycling is a long duration process and when the recycler controller is
restarted in the meantime, it should not start a new recycler pod if there is
one already running.
This means that the recycler pod must have deterministic name based on name
of the recycled PV, we then get name conflicts when creating the pod.
Two things need to be changed:
- recycler controller and recycler plugins must pass the PV.Name to place,
where the pod is created.
- create recycler pod with deterministic name and check "already exists" error.
When at it, remove useless 'resourceVersion' argument and make log messages
starting with lowercase.
Automatic merge from submit-queue
Refactor persistent volume controller
Here is complete persistent controller as designed in https://github.com/pmorie/pv-haxxz/blob/master/controller.go
It's feature complete and compatible with current binder/recycler/provisioner. No new features, it *should* be much more stable and predictable.
Testing
--
The unit test framework is quite complicated, still it was necessary to reach reasonable coverage (78% in `persistentvolume_controller.go`). The untested part are error cases, which are quite hard to test in reasonable way - sure, I can inject a VersionConflictError on any object update and check the error bubbles up to appropriate places, but the real test would be to run `syncClaim`/`syncVolume` again and check it recovers appropriately from the error in the next periodic sync. That's the hard part.
Organization
---
The PR starts with `rm -rf kubernetes/pkg/controller/persistentvolume`. I find it easier to read when I see only the new controller without old pieces scattered around.
[`types.go` from the old controller is reused to speed up matching a bit, the code looks solid and has 95% unit test coverage].
I tried to split the PR into smaller patches, let me know what you think.
~~TODO~~
--
* ~~Missing: provisioning, recycling~~.
* ~~Fix integration tests~~
* ~~Fix e2e tests~~
@kubernetes/sig-storage
<!-- Reviewable:start -->
---
This change is [<img src="http://reviewable.k8s.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](http://reviewable.k8s.io/reviews/kubernetes/kubernetes/24331)
<!-- Reviewable:end -->
Fixes#15632
Automatic merge from submit-queue
Make name validators return string slices
Part of the larger validation PR, broken out for easier review and merge. Builds on previous PRs in the series.
- remove persistentvolume_ prefix from all files
- split controller.go into controller.go and controller_base.go (to have them
under 1500 lines for github)
This fixes e2e test for provisioning - it expects that provisioned volumes
are bound quickly.
Majority of this patch is update of test framework needs to initialize the
controller appropriately.
- Add reclaim policy to newVolume() call.
- Implement reactor Volumes().Get().
- Implement mock volume plugin.
- Add recycler tests.
- Add a synchronization condition to controller.scheduleOperation
- we need to pause the controller here, let the test to do some bad things
to the controller and test error cases in recycleVolumeOperation.
Test framework gets more and more complicated... But this is the last piece,
I promise.
We need to keep list of running recyclers, deleters and provisioners in
memory in order not to start a new recycling/deleting/provisioning twice
for the same volume/claim.
This will be eventually replaced by GoRoutineMap from PR #24838.
Automatic merge from submit-queue
prevent nil pointer when starting controllers before running the shar…
Fixes https://github.com/kubernetes/kubernetes/issues/25643.
https://github.com/kubernetes/kubernetes/pull/23795 changed initialization order, so the controller isn't guaranteed to be present at startup.
@mqliang @wojtek-t I'm pretty sure that we're not guaranteed to get back the correct `cache.Indexer` or `cache.Store` either. I'll look at re-plumbing the `AddIndexer` path to use the same instance so that its safe to use again.
Automatic merge from submit-queue
Horizontal autoscalaer tolerance breaching verifier
@piosz @jszczepkowski This is the original HPA test without the other modifications, it makes explicit the mathematics of the logic in case changes ever occur.
<!-- Reviewable:start -->
---
This change is [<img src="http://reviewable.k8s.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](http://reviewable.k8s.io/reviews/kubernetes/kubernetes/24736)
<!-- Reviewable:end -->
Automatic merge from submit-queue
WIP v0 NVIDIA GPU support
```release-note
* Alpha support for scheduling pods on machines with NVIDIA GPUs whose kubelets use the `--experimental-nvidia-gpus` flag, using the alpha.kubernetes.io/nvidia-gpu resource
```
Implements part of #24071 for #23587
I am not familiar with the scheduler enough to know what to do with the scores. Mostly punting for now.
Missing items from the implementation plan: limitranger, rkt support, kubectl
support and docs
cc @erictune @davidopp @dchen1107 @vishh @Hui-Zhi @gopinatht
Automatic merge from submit-queue
Remove myself from a bunch of OWNERS files
For the time being I am too overloaded to do non scheduler/admission related reviews that aren't explicitly assigned to me.
cc/ @brendandburns
Automatic merge from submit-queue
Move internal types of hpa from pkg/apis/extensions to pkg/apis/autoscaling
ref #21577
@lavalamp could you please review or delegate to someone from CSI team?
@janetkuo could you please take a look into the kubelet changes?
cc @fgrzadkowski @jszczepkowski @mwielgus @kubernetes/autoscaling
Automatic merge from submit-queue
Let the dynamic client take runtime.Object instead of v1.ListOptions
so that I can pass whatever version of ListOptions to the List/Watch/DeleteCollection methods.
cc @krousey
Implements part of #24071
I am not familiar with the scheduler enough to know what to do with the scores. Punting for now.
Missing items from the implementation plan: limitranger, rkt support, kubectl
support and user docs
Automatic merge from submit-queue
Add data structure for storing attach detach controller state.
This PR introduces the data structure for maintaining the in-memory state for the new attach/detach controller (#20262).
Automatic merge from submit-queue
add namespace index for cache
@wojtek-t
Implement in this approach make the change of lister.go small, but we should replace all `NewInformer()` to `NewIndexInformer()`, even when someone not want to filter by namespace(eg. gc_controller and scheduler). Any suggestion?
Automatic merge from submit-queue
Petset controller
Took longer than I expected. Main parts of this pr are:
1. Identity generation based on petset spec (volumes are mapped per discussion in #18016)
2. Ensure that we create/delete pets in sequence
3. Ensuring that we create, wait for healthy, create; or delete, wait for terminationGrace, delete
4. Controller that watches apiserver and drives actual -> desired
PVCs are not deleted, yet.
Add accessor methods that implement pkg/api/unversioned.ObjectKind,
pkg/api/meta.Object, pkg/api/meta.Type and pkg/api/meta.List.
Removed the convenience fields since writing to them was not reflected
in serialized JSON.
Automatic merge from submit-queue
Promote Pod Hostname & Subdomain to fields (were annotations)
Deprecating the podHostName, subdomain and PodHostnames annotations and created corresponding new fields for them on PodSpec and Endpoints types.
Annotation doc: #22564
Annotation code: #20688
Automatic merge from submit-queue
Generated clients can return their RESTClients, RESTClient can return its RateLimiter
cc @lavalamp @krousey @wojtek-t @smarterclayton @timothysc
Ref. #22421
Automatic merge from submit-queue
RateLimitedQueue TestTryOrdering could fail under load
Remove the possibility of contention in the test by providing a
synthetic Now() function.
Fixes#24125
Automatic merge from submit-queue
update controllers watching all pods to share an informer
This plumbs the shared pod informer through the various controllers to avoid duplicated watches.
Automatic merge from submit-queue
Additional go vet fixes
Mostly:
- pass lock by value
- bad syntax for struct tag value
- example functions not formatted properly
Automatic merge from submit-queue
Move typed clients into clientset folder
Move typed clients from `pkg/client/typed/` to `pkg/client/clientset_generated/${clientset_name}/typed`.
The first commit changes the client-gen, the last commit updates the doc, other commits are just moving things around.
@lavalamp @krousey
Automatic merge from submit-queue
Check claimRef UID when processing a recycled PV, take 2
Reorder code a bit so it doesn't allow a case where you get some error other than "not found"
combined with a non-nil Claim.
Add test case.
cc @kubernetes/rh-cluster-infra @kubernetes/rh-storage @liggitt
This is a better abstraction than passing in specific pieces of the
Service that each of the cloudproviders may or may not need. For
instance, many of the providers don't need a region, yet this is passed
in. Similarly many of the providers want a string IP for the load
balancer, but it passes in a converted net ip. Affinity is unused by
AWS. A provider change may also require adding a new parameter which has
an effect on all other cloud provider implementations.
Further, this will simplify adding provider specific load balancer
options, such as with labels or some other metadata. For example, we
could add labels for configuring the details of an AWS elastic load
balancer, such as idle timeout on connections, whether it is
internal or external, cross-zone load balancing, and so on.
Authors: @chbatey, @jsravn
Here are a list of changes along with an explanation of how they work:
1. Add a new string field called TargetSelector to the external version of
extensions Scale type (extensions/v1beta1.Scale). This is a serialized
version of either the map-based selector (in case of ReplicationControllers)
or the unversioned.LabelSelector struct (in case of Deployments and
ReplicaSets).
2. Change the selector field in the internal Scale type (extensions.Scale) to
unversioned.LabelSelector.
3. Add conversion functions to convert from two external selector fields to a
single internal selector field. The rules for conversion are as follows:
i. If the target resource that this scale targets supports LabelSelector
(Deployments and ReplicaSets), then serialize the LabelSelector and
store the string in the TargetSelector field in the external version
and leave the map-based Selector field as nil.
ii. If the target resource only supports a map-based selector
(ReplicationControllers), then still serialize that selector and
store the serialized string in the TargetSelector field. Also,
set the the Selector map field in the external Scale type.
iii. When converting from external to internal version, parse the
TargetSelector string into LabelSelector struct if the string isn't
empty. If it is empty, then check if the Selector map is set and just
assign that map to the MatchLabels component of the LabelSelector.
iv. When converting from internal to external version, serialize the
LabelSelector and store it in the TargetSelector field. If only
the MatchLabel component is set, then also copy that value to
the Selector map field in the external version.
4. HPA now just converts the LabelSelector field to a Selector interface
type to list the pods.
5. Scale Get and Update etcd methods for Deployments and ReplicaSets now
return extensions.Scale instead of autoscaling.Scale.
6. Consequently, SubresourceGroupVersion override and is "autoscaling"
enabled check is now removed from pkg/master/master.go
7. Other small changes to labels package, fuzzer and LabelSelector
helpers to piece this all together.
8. Add unit tests to HPA targeting Deployments and ReplicaSets.
9. Add an e2e test to HPA targeting ReplicaSets.
Due to rounding down for maxUnavailable, we may end up with deployments
that have zero surge and unavailable pods something that 1) is not allowed
as per validation, 2) blocks deployments. If we end up in such a situation
set maxUnavailable to 1 on the theory that surge might not work due to
quota.