Previously, when `GetObjectMetricReplicas` calculated the desired
replica count, it multiplied the usage ratio by the current number of replicas.
This method caused over-scaling when there were pods that were not ready
for a long period of time. For example, if there were pods A, B, and C,
and only pod A was ready, and the usage ratio was 500%, we would
previously specify 15 pods as the desired replicas (even though really
only one pod was handling the load).
After this change, we now multiple the usage
ratio by the number of ready pods for `GetObjectMetricReplicas`.
In the example above, we'd only desire 5 replica pods.
This change gives `GetObjectMetricReplicas` the same behavior as the
other replica calculator methods. Only `GetExternalMetricReplicas` and
`GetExternalPerPodMetricRepliacs` still allow unready pods to impact the
number of desired replicas. I will fix this issue in the following
commit.
Currently, when performing a scale up, any failed pods (which can be present for example in case of evictions performed by kubelet) will be treated as unready. Unready pods are treated as if they had 0% utilization which will slow down or even block scale up.
After this change, failed pods are ignored in all calculations. This way they do not influence neither scale up nor scale down replica calculations.
There have been a couple of recent bugs in the "normalizing" part of the
`reconcileAutoscaler` method. This part of the code base is responsible
for, among other things, taking the suggested desired replicas based on
the metrics, ensuring it conforms to certain conditions, and updating it
if it does not. Isolate the part that converts the desired replicas
based on a given set of rules into its own function.
We are refactoring this part of the code base to make the logic simpler
and to make it easier to write unit tests.
This updates the HPA controller to use the polymorphic scale client from
client-go. This should enable HPAs to work with arbitrary scalable
resources, instead of just those in the extensions API group (meaning we
can deprecate the copy of ReplicationController in extensions/v1beta1).
It also means that the HPA controller now pays attention to the
APIVersion field in `scaleTargetRef` (more specifically, the group part
of it).
Note that currently, discovery information on which resources are
available where is only fetched once (the first time that it's
requested). In the future, we may want a refreshing discovery REST
mapper.
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Make HPA tolerance a flag
**What this PR does / why we need it**:
Make HPA tolerance configurable as a flag. This change allows us to use
different tolerance values in production/testing.
**Which issue this PR fixes**:
Fixes#18155
**Release note:**
```release-note
Control HPA tolerance through the `horizontal-pod-autoscaler-tolerance` flag.
```
Signed-off-by: mattjmcnaughton <mattjmcnaughton@gmail.com>
Fix#53670
Fix a bug where `desiredReplicas` could be greater than `maxReplicas`
if the original value for `desiredReplicas > scaleUpLimit` and
`scaleUpLimit > maxReplicas`. Previously, when that happened, we would
scale up to `scaleUpLimit`, and then in the next auto-scaling run, scale
down to `maxReplicas`. Address this issue and introduce a regression
test.
Previously
`pkg.controller.podautoscaler.UnsafeConvertToVersion` was
exported. However, it was never used outside of the `podautoscaler`
package. Make it private to prevent confusion.
Additionally, move the two private functions in `horizontal.go` to be
with the other private functions at the bottom of the file - imho its
more readable than having them directly at the top of the file, before
the public type and function definitions.
Fix#18155
Make HPA tolerance configurable as a flag. This change allows us to use
different tolerance values in production/testing.
Signed-off-by: mattjmcnaughton <mattjmcnaughton@gmail.com>
Address `golint` errors in `pkg/controller/podautoscaler`. Note,
I did not address issues around exported types/functions missing
comments, because I'm not sure what the convention within the k8s project is.
Signed-off-by: mattjmcnaughton <mattjmcnaughton@gmail.com>
Automatic merge from submit-queue (batch tested with PRs 51956, 50708)
Move autoscaling/v2 from alpha1 to beta1
This graduates autoscaling/v2alpha1 to autoscaling/v2beta1. The move is more-or-less just a straightforward rename.
Part of kubernetes/features#117
```release-note
v2 of the autoscaling API group, including improvements to the HorizontalPodAutoscaler, has moved from alpha1 to beta1.
```
This commit only sends updates if the status has actually changed.
Since the HPA runs at a regular interval, this should reduce the volume
of writes, especially on short HPA intervals with relatively constant
metrics.
This commit causes the HPA controller to set a variety of status
conditions using the new `Status.Conditions` field of
autoscaling/v2alpha1. These provide insight into the current state
of the HPA, and generally correspond to similar events being emitted.
The new fake client properly represents the resource of `PodMetrics` as
"pods" and the resource of `NodeMetrics` as "nodes". Previously, it
used "podmetricses" and "nodemetrics", respectively.
This fixes up `horizontal_test.go` and `replica_calc_test.go` to use the
new names.
Since the HPA controller pulls information from an external source that
makes no guarantees about consistency, it's possible for the HPA
to get into an infinite update loop -- if the metrics change with
every query, the HPA controller will run it's normal reconcilation,
post a status update, see that status update itself, fetch new metrics,
and if those metrics are different, post another status update, and
repeat. This can lead to continuously updating a single HPA.
By rate-limiting each HPA to once per sync interval, we prevent this
from happening.
This commit switches over the HPA controller to use the custom metrics
API. It also converts the HPA controller to use the generated client
in k8s.io/metrics for the resource metrics API.
In order to enable support, you must enable
`--horizontal-pod-autoscaler-use-rest-clients` on the
controller-manager, which will switch the HPA controller's MetricsClient
implementation over to use the standard rest clients for both custom
metrics and resource metrics. This requires that at the least resource
metrics API is registered with kube-aggregator, and that the controller
manager is pointed at kube-aggregator. For this to work, Heapster
must be serving the new-style API server (`--api-server=true`).
This commit converts the HPA controller over to using the new version of
the HorizontalPodAutoscaler object found in autoscaling/v2alpha1. Note
that while the autoscaler will accept requests for object metrics, the
scale client will return an error on attempts to get object metrics
(since that requires the new custom metrics API, which is not yet
implemented).
This also enables the HPA object in v2alpha1 as a retrievable API
version by default.
Automatic merge from submit-queue
rescale immediately if the basic constraints are not satisfied
refactor reconcileAutoscaler.
If the basic constraints are not satisfied, we should rescale the target ref immediately.
Currently, the HPA considers unready pods the same as ready pods when
looking at their CPU and custom metric usage. However, pods frequently
use extra CPU during initialization, so we want to consider them
separately.
This commit causes the HPA to consider unready pods as having 0 CPU
usage when scaling up, and ignores them when scaling down. If, when
scaling up, factoring the unready pods as having 0 CPU would cause a
downscale instead, we simply choose not to scale. Otherwise, we simply
scale up at the reduced amount caculated by factoring the pods in at
zero CPU usage.
The effect is that unready pods cause the autoscaler to be a bit more
conservative -- large increases in CPU usage can still cause scales,
even with unready pods in the mix, but will not cause the scale factors
to be as large, in anticipation of the new pods later becoming ready and
handling load.
Similarly, if there are pods for which no metrics have been retrieved,
these pods are treated as having 100% of the requested metric when
scaling down, and 0% when scaling up. As above, this cannot change the
direction of the scale.
This commit also changes the HPA to ignore superfluous metrics -- as
long as metrics for all ready pods are present, the HPA we make scaling
decisions. Currently, this only works for CPU. For custom metrics, we
cannot identify which metrics go to which pods if we get superfluous
metrics, so we abort the scale.
Automatic merge from submit-queue
Additional go vet fixes
Mostly:
- pass lock by value
- bad syntax for struct tag value
- example functions not formatted properly
Here are a list of changes along with an explanation of how they work:
1. Add a new string field called TargetSelector to the external version of
extensions Scale type (extensions/v1beta1.Scale). This is a serialized
version of either the map-based selector (in case of ReplicationControllers)
or the unversioned.LabelSelector struct (in case of Deployments and
ReplicaSets).
2. Change the selector field in the internal Scale type (extensions.Scale) to
unversioned.LabelSelector.
3. Add conversion functions to convert from two external selector fields to a
single internal selector field. The rules for conversion are as follows:
i. If the target resource that this scale targets supports LabelSelector
(Deployments and ReplicaSets), then serialize the LabelSelector and
store the string in the TargetSelector field in the external version
and leave the map-based Selector field as nil.
ii. If the target resource only supports a map-based selector
(ReplicationControllers), then still serialize that selector and
store the serialized string in the TargetSelector field. Also,
set the the Selector map field in the external Scale type.
iii. When converting from external to internal version, parse the
TargetSelector string into LabelSelector struct if the string isn't
empty. If it is empty, then check if the Selector map is set and just
assign that map to the MatchLabels component of the LabelSelector.
iv. When converting from internal to external version, serialize the
LabelSelector and store it in the TargetSelector field. If only
the MatchLabel component is set, then also copy that value to
the Selector map field in the external version.
4. HPA now just converts the LabelSelector field to a Selector interface
type to list the pods.
5. Scale Get and Update etcd methods for Deployments and ReplicaSets now
return extensions.Scale instead of autoscaling.Scale.
6. Consequently, SubresourceGroupVersion override and is "autoscaling"
enabled check is now removed from pkg/master/master.go
7. Other small changes to labels package, fuzzer and LabelSelector
helpers to piece this all together.
8. Add unit tests to HPA targeting Deployments and ReplicaSets.
9. Add an e2e test to HPA targeting ReplicaSets.
The HPA controller had previously used a single Client
object to act as three different Namespacers. To improve
ease of extensibility and to make it clearer what the HPA
controller actually needs to use from the client, it should
use separate Namespacers for each of its needs (Scales, HPAs,
and Events).
This commit makes the HPA metrics client configurable in where
it looks for heapster instead of hard coding it to
"kube-system/heapster". The values of "kube-system/heapster"
are still recorded as constants in the metrics client package
for use as default values.