Add live list of pods to PVC protection controller, as opposed to doing
only a cache-based list through the Informer. Both lists are performed
while processing a PVC with deletionTimestamp set to check whether Pods
using the PVC exist and remove the finalizer to enable deletion of the
PVC if that's not the case. Prior to this commit only the cache-based
list was done but that's unreliable because a pod using the PVC might
exist but not be in the cache just yet. On the other hand, the live
list is 100% reliable.
Note that it would be enough to do only the live list. Instead, this
commit adds it after the cache-based list and performs it only if the
latter finds no Pod blocking deletion of the PVC being processed. The
rationale is that live lists are expensive and it's desirable to
minimize them. The drawback is that if at the time of the cache-based
list the cache has not been notified yet of the deletion of a Pod using
the PVC the PVC is kept. Correctness is not compromised because the
finalizer will be removed when the Pod deletion notification is
received, but this means PVC deletion is delayed. Reducing live lists
was valued more than deleting PVCs slightly faster.
Also, add a unit test that fails without the change introduced by this
commit and revamp old unit tests. The latter is needed because expected
behavior is described in terms of API calls the controller makes, and
this commit introduces new API calls (the live lists).
before this change, an object would be added to `attemptToDelete` queue only if `gc` detected the transition, simply by check if `deletionTimestamp` was set. After the change, it will check if `foregroundDeletion` finalizer has been set before adding the item to the queue.
During a Deployment update there may be more Pods in the scale target
ref status than in the spec. This test verifies that we do not scale
to the status value. Instead we should stay at the spec value.
Fails before #79035 and passes after.
This is used when the cloudprovider layer is not implementing loadBalancer service.
Implementation will be in a different controller running on master.
Make the PVC protection controller robust to cases where a Pod X is deleted,
then a Pod Y with the same namespaced name is created and the two events are
delivered via a single update notification. Both pods should be processed,
because X might be blocking deletion of a PVC which is not referenced by Y.
Prior to this commit only the newer pod is processed, which means that it
is possible to leak PVCs.
Also, add unit tests to reflect the change.
Add support for scaling to zero pods
minReplicas is allowed to be zero
condition is set once
Based on https://github.com/kubernetes/kubernetes/pull/61423
set original valid condition
add scale to/from zero and invalid metric tests
Scaling up from zero pods ignores tolerance
validate metrics when minReplicas is 0
Document HPA behaviour when minReplicas is 0
Documented minReplicas field in autoscaling APIs
All controllers in controller-manager that deal with objects generically
work with those objects without needing the full object. Update the GC
and quota controller to use PartialObjectMetadata input objects which
is faster and more efficient.
The metadata client uses protobuf and returns only a subset of object
data (the metadata) which allows operations that act only on objects
generically to work much faster. Use the metadata client in the
namespace controller to reduce the amount of work the namespace controller
has to do in large namespaces.
This change adds pending pods to the ignored set first before
selecting pods missing metrics. Pending pods are always ignored when
calculating scale.
When the HPA decides which pods and metric values to take into account
when scaling, it divides the pods into three disjoint subsets: 1)
ready 2) missing metrics and 3) ignored. First the HPA selects pods
which are missing metrics. Then it selects pods should be ignored
because they are not ready yet, or are still consuming CPU during
initialization. All the remaining pods go into the ready set. After
the HPA has decided what direction it wants to scale based on the
ready pods, it considers what might have happened if it had the
missing metrics. It makes a conservative guess about what the missing
metrics might have been, 0% if it wants to scale up--100% if it wants
to scale down. This is a good thing when scaling up, because newly
added pods will likely help reduce the usage ratio, even though their
metrics are missing at the moment. The HPA should wait to see the
results of its previous scale decision before it makes another
one. However when scaling down, it means that many missing metrics can
pin the HPA at high scale, even when load is completely removed. In
particular, when there are many unschedulable pods due to insufficient
cluster capacity, the many missing metrics (assumed to be 100%) can
cause the HPA to avoid scaling down indefinitely.
As the benchmark shows it speeds up the method~x4 and reduces memory
consumption ~x20.
```
benchmark old ns/op new ns/op delta
BenchmarkGetPodMapForDeployment-12 276121 72591 -73.71%
benchmark old allocs new allocs delta
BenchmarkGetPodMapForDeployment-12 241 238 -1.24%
benchmark old bytes new bytes delta
BenchmarkGetPodMapForDeployment-12 554025 28956 -94.77%
```
current scale. Two important ones are when missing metrics might
change the direction of scaling, and when the recommended scale is
within tolerance of the current scale.
The way that ReplicaCalculator signals it's desire to not change the
current scale is by returning the current scale. However the current
scale is from scale.Status.Replicas and can be larger than
scale.Spec.Replicas (e.g. during Deployment rollout with configured
surge). This causes a positive feedback loop because
scale.Status.Replicas is written back into scale.Spec.Replicas,
further increasing the current scale.
This PR fixes the feedback loop by plumbing the replica count from
spec through horizontal.go and replica_calculator.go so the calculator
can punt with the right value.