Signed-off-by: wangyysde <net_use@bzhy.com>
Generation swagger.json.
Use v2 path for hpa_cpu_field.
run update-codegen.sh
Signed-off-by: wangyysde <net_use@bzhy.com>
When a pod is deleted, it is given a deletion timestamp. However the
pod might still run for some time during graceful shutdown. During
this time it might still produce CPU utilization metrics and be in a
Running phase.
Currently the HPA replica calculator attempts to ignore deleted pods
by skipping over them. However by not adding them to the ignoredPods
set, their metrics are not removed from the average utilization
calculation. This allows pods in the process of shutting down to drag
down the recommmended number of replicas by producing near 0%
utilization metrics.
In fact the ignoredPods set is misnomer. Those pods are not fully
ignored. When the replica calculator recommends to scale up, 0%
utilization metrics are filled in for those pods to limit the scale
up. This prevents overscaling when pods take some time to startup. In
fact, there should be 4 sets considered (readyPods, unreadyPods,
missingPods, ignoredPods) not just 3.
This change renames ignoredPods as unreadyPods and leaves the scaleup
limiting semantics. Another set (actually) ignoredPods is added to
which delete pods are added instead of being skipped during
grouping. Both ignoredPods and unreadyPods have their metrics removed
from consideration. But only unreadyPods have 0% utilization metrics
filled in upon scaleup.
Add support for scaling to zero pods
minReplicas is allowed to be zero
condition is set once
Based on https://github.com/kubernetes/kubernetes/pull/61423
set original valid condition
add scale to/from zero and invalid metric tests
Scaling up from zero pods ignores tolerance
validate metrics when minReplicas is 0
Document HPA behaviour when minReplicas is 0
Documented minReplicas field in autoscaling APIs
This change adds pending pods to the ignored set first before
selecting pods missing metrics. Pending pods are always ignored when
calculating scale.
When the HPA decides which pods and metric values to take into account
when scaling, it divides the pods into three disjoint subsets: 1)
ready 2) missing metrics and 3) ignored. First the HPA selects pods
which are missing metrics. Then it selects pods should be ignored
because they are not ready yet, or are still consuming CPU during
initialization. All the remaining pods go into the ready set. After
the HPA has decided what direction it wants to scale based on the
ready pods, it considers what might have happened if it had the
missing metrics. It makes a conservative guess about what the missing
metrics might have been, 0% if it wants to scale up--100% if it wants
to scale down. This is a good thing when scaling up, because newly
added pods will likely help reduce the usage ratio, even though their
metrics are missing at the moment. The HPA should wait to see the
results of its previous scale decision before it makes another
one. However when scaling down, it means that many missing metrics can
pin the HPA at high scale, even when load is completely removed. In
particular, when there are many unschedulable pods due to insufficient
cluster capacity, the many missing metrics (assumed to be 100%) can
cause the HPA to avoid scaling down indefinitely.
current scale. Two important ones are when missing metrics might
change the direction of scaling, and when the recommended scale is
within tolerance of the current scale.
The way that ReplicaCalculator signals it's desire to not change the
current scale is by returning the current scale. However the current
scale is from scale.Status.Replicas and can be larger than
scale.Spec.Replicas (e.g. during Deployment rollout with configured
surge). This causes a positive feedback loop because
scale.Status.Replicas is written back into scale.Spec.Replicas,
further increasing the current scale.
This PR fixes the feedback loop by plumbing the replica count from
spec through horizontal.go and replica_calculator.go so the calculator
can punt with the right value.
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md.
Change CPU sample sanitization in HPA.
**What this PR does / why we need it**:
Change CPU sample sanitization in HPA.
Ignore samples if:
- Pod is beeing initalized - 5 minutes from start defined by flag
- pod is unready
- pod is ready but full window of metric hasn't been colected since
transition
- Pod is initialized - 5 minutes from start defined by flag:
- Pod has never been ready after initial readiness period.
**Release notes:**
```release-note
Improve CPU sample sanitization in HPA by taking metric's freshness into account.
```
Ignore samples if:
- Pod is beeing initalized - 5 minutes from start defined by flag
- pod is unready
- pod is ready but full window of metric hasn't been colected since
transition
- Pod is initialized - 5 minutes from start defined by flag:
- Pod has never been ready after initial readiness period.
Automatic merge from submit-queue (batch tested with PRs 67067, 67947). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Do not count soft-deleted pods for scaling purposes in HPA controller
**What this PR does / why we need it**:
The metrics of "soft-deleted" pods in general to be deleted should probably not matter for scaling purposes, since they'll be gone "soon", whether they're nodelost or just normally delete.
As long as soft-deleted pods still exist, they prevent normal scale up.
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes https://github.com/kubernetes/kubernetes/issues/62845
**Special notes for your reviewer**:
**Release note**:
```release-note
Stop counting soft-deleted pods for scaling purposes in HPA controller to avoid soft-deleted pods incorrectly affecting scale up replica count calculation.
```
Duration of initialization taint on CPU and window of initial readiness
setting controlled by flags.
Adding API violation exceptions following example of e50340ee23
After my previous changes HPA wasn't behaving correctly in the following
situation:
- Pods use a lot of CPU during initilization, become ready right after they initialize,
- Scale up triggers,
- When new pods become ready HPA counts their usage (even though it's not related to any work that needs doing),
- Another scale up, even though existing pods can handle work, no problem.
Instead discard metric values for pods that are unready and have never
been ready (they may report misleading values, the original reason for
introducing scale up forbidden window).
Use per pod metric when pod is:
- Ready, or
- Not ready but creation timestamp and last readiness change are more
than 10s apart.
In the latter case we asume the pod was ready but later became unready.
We want to use metrics for such pods because sometimes such pods are
unready because they were getting too much load.
Similar to the change we made for `GetObjectMetricReplicas` in the
previous commit. Ensure that `GetExternalMetricReplicas` does not
include unready pods when its determining how many replica it desires.
Including unready pods can lead to over-scaling.
We did not change the behavior of `GetExternalPerPodMetricReplicas`, as
it is slightly less clear what is the desired behavior. We did make some
small naming refactorings to this method, which will make it easier to
ignore unready pods if we decide we want to.
Previously, when `GetObjectMetricReplicas` calculated the desired
replica count, it multiplied the usage ratio by the current number of replicas.
This method caused over-scaling when there were pods that were not ready
for a long period of time. For example, if there were pods A, B, and C,
and only pod A was ready, and the usage ratio was 500%, we would
previously specify 15 pods as the desired replicas (even though really
only one pod was handling the load).
After this change, we now multiple the usage
ratio by the number of ready pods for `GetObjectMetricReplicas`.
In the example above, we'd only desire 5 replica pods.
This change gives `GetObjectMetricReplicas` the same behavior as the
other replica calculator methods. Only `GetExternalMetricReplicas` and
`GetExternalPerPodMetricRepliacs` still allow unready pods to impact the
number of desired replicas. I will fix this issue in the following
commit.
Currently, when performing a scale up, any failed pods (which can be present for example in case of evictions performed by kubelet) will be treated as unready. Unready pods are treated as if they had 0% utilization which will slow down or even block scale up.
After this change, failed pods are ignored in all calculations. This way they do not influence neither scale up nor scale down replica calculations.
Fix#18155
Make HPA tolerance configurable as a flag. This change allows us to use
different tolerance values in production/testing.
Signed-off-by: mattjmcnaughton <mattjmcnaughton@gmail.com>