During EndpointSlice reconcilation, EndpointSliceTracker is supposed to
track expected EndpointSlice resource versions so that external changes
to them can be detected. But it actually tracked the stale resource
version and resulted in every Service was handled twice as it always
received an EndpointSlice update with a different resource version but
was actually created/updated by itself during the first processing.
Using t.Run() has some advantages:
- individual sub-tests can be selected with -run
- test failures are automatically associated with
the scenario, therefore the test name no longer
needs to be passed around
A "run" function is used primarily to avoid large indention changes,
but also because this can be used to extend the test with additional
parameters later on.
Add logic in service_controller to skip create/update
if finalizer from a different controller is found.
The newly added finalizer will be checked by other controllers
implementing ILB services to determine if a given service is
already being managed by service_controller.
Moved finalizer check into cloudprovider code.
added unit test to verify new finalizer.
Modified existing unit test to create a fake service so that
attach/remove finalizer step can be tested.
A transient issue might occur that causes an error to be returned by
InstanceID(). When this is ignored, the external cloud provider taint
will be removed and neither AddCloudNode() nor UpdateCloudNode() will
try to set a provider ID in the future.
By returning the error we can ensure that the external cloud provider
taint is not removed prematurely, allowing the operation to be retried
(until the provider ID can be set).
Preserve support for external cloud providers that do not use IDs by
continuing if a NotImplemented error is returned, making a distinction
between lack of support for provider IDs and an actual error.
Introduce pair of unit tests that show a provider ID will eventually
be set if an error is returned, unless that error is a NotImplemented,
in which case the external cloud provider taint will be removed.
The `recorder.PastEventf` method wasn't actually working as advertised.
It was supposed to accept a timestamp, which would be used when
generating the event. However, as the
[source code](https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/record/event.go#L316)
shows, this `timestamp` was never actually used.
In other words, `PastEventf` is identical to `Eventf`.
We have two options: one would be to fix `PastEventf` so that it works
as advertised. The other would be to delete `PastEventf` and only
support `Eventf`.
Ultimately, I could only find one use of `PastEventf` in the code base,
so I propose we just delete `PastEventf` and convert all uses to
`Eventf`.
This adds a new EndpointSlice tracker to keep track of the expected resource versions of EndpointSlices associated with each Service managed by the EndpointSlice controller. This should prevent a potential race where a syncService call could happen with an incomplete view of EndpointSlices if additions or deletions hadn't fully propagated to the cache yet. Additionally, this ensures that external changes to EndpointSlices will be handled by the EndpointSlice controller.
This ended up causing far more problems than it was worth, especially
given that it just attempted to provide backwards compatibility with
the alpha release.
This patch removes pkg/util/mount completely, and replaces it with the
mount package now located at k8s.io/utils/mount. The code found at
k8s.io/utils/mount was moved there from pkg/util/mount, so the code is
identical, just no longer in-tree to k/k.
update tests
add comment
amend var name
update comment
add check for empty slice
fix tests
fix mask size in test
review feedback
add ipv4 and ipv6 flag for mask sizes
add to violation exception list
remove import alias
run update-openapi-spec
review feedback
run update-bazel
review feedback
review feedback
This patch removes mount.Exec entirely and instead uses the common
utility from k8s.io/utils/exec.
The fake exec implementation found in k8s.io/utils/exec differs a bit
than mount.Exec, with the ability to pre-script expected calls to
Command.CombinedOutput(), so tests that previously relied on a callback
mechanism to produce specific output have been updated to use that
mechanism.
This adds a new Label to EndpointSlices that will ensure that multiple
controllers or entities can manage subsets of EndpointSlices. This
label provides a way to indicate the controller or entity responsible
for managing an EndpointSlice.
To provide a seamless upgrade from the alpha release of EndpointSlices
that did not support this label, a temporary annotation has been added
on Services to indicate that this label has been initially set on
EndpointSlices. That annotation will be set automatically by the
EndpointSlice controller with this commit once appropriate Labels have
been added on the corresponding EndpointSlices.
This was an oversight in the initial EndpointSlice release. This update
will ensure that Endpoints and EndpointSlices use the same logic to set
the Hostname attribute.
The Service spec includes a PublishNotReadyAddresses field which has
been used by Endpoints to report all matching resources ready. This may
or may not have been the initial purpose of the field, but given the
desire to provide backwards compatibility with the Endpoints API here,
it seems to make sense to continue to provide the same functionality.
Instead of reporting an event or displaying an error, simply exit
when the namespace is being terminated. This reduces the amount of
controller churn on namespace shutdown. Unlike other controllers, we
drop the replica set create error very late (in the queue handleErr)
in order to avoid changing the structure of the controller
substantially.
Instead of reporting an event or displaying an error, simply exit
when the namespace is being terminated. This reduces the amount of
controller churn on namespace shutdown. While we could technically
exit the entire processing loop early for very large jobs,
we should wait for more evidence that is an issue before changing
that logic substantially.
Instead of reporting an event or displaying an error, simply exit
when the namespace is being terminated. This reduces the amount of
controller churn on namespace shutdown. While we could technically
exit the entire processing loop early for very large daemon sets,
we should wait for more evidence that is an issue before changing
that logic substantially.
Instead of reporting an event or displaying an error, simply exit
when the namespace is being terminated. This reduces the amount of
controller churn on namespace shutdown. While we could technically
exit the entire processing loop early for very large replica sets,
we should wait for more evidence that is an issue before changing
that logic substantially.
In some scenarios the service account and token controllers can
race with namespace deletion, causing a burst of errors as they
attempt to recreate secrets being deleted.
Instead, detect these errors and do not retry.
When scaling down a ReplicaSet, delete doubled up replicas first, where a
"doubled up replica" is defined as one that is on the same node as an
active replica belonging to a related ReplicaSet. ReplicaSets are
considered "related" if they have a common controller (typically a
Deployment).
The intention of this change is to make a rolling update of a Deployment
scale down the old ReplicaSet as it scales up the new ReplicaSet by
deleting pods from the old ReplicaSet that are colocated with ready pods of
the new ReplicaSet. This change in the behavior of rolling updates can be
combined with pod affinity rules to preserve the locality of a Deployment's
pods over rollout.
A specific scenario that benefits from this change is when a Deployment's
pods are exposed by a Service that has type "LoadBalancer" and external
traffic policy "Local". In this scenario, the load balancer uses health
checks to determine whether it should forward traffic for the Service to a
particular node. If the node has no local endpoints for the Service, the
health check will fail for that node. Eventually, the load balancer will
stop forwarding traffic to that node. In the meantime, the service proxy
drops traffic for that Service. Thus, in order to reduce risk of dropping
traffic during a rolling update, it is desirable preserve node locality of
endpoints.
* pkg/controller/controller_utils.go (ActivePodsWithRanks): New type to
sort pods using a given ranking.
* pkg/controller/controller_utils_test.go (TestSortingActivePodsWithRanks):
New test for ActivePodsWithRanks.
* pkg/controller/replicaset/replica_set.go
(getReplicaSetsWithSameController): New method. Given a ReplicaSet, return
all ReplicaSets that have the same owner.
(manageReplicas): Call getIndirectlyRelatedPods, and pass its result to
getPodsToDelete.
(getIndirectlyRelatedPods): New method. Given a ReplicaSet, return all
pods that are owned by any ReplicaSet with the same owner.
(getPodsToDelete): Add an argument for related pods. Use related pods and
the new getPodsRankedByRelatedPodsOnSameNode function to take into account
whether a pod is doubled up when sorting pods for deletion.
(getPodsRankedByRelatedPodsOnSameNode): New function. Return an
ActivePodsWithRanks value that wraps the given slice of pods and computes
ranks where each pod's rank is equal to the number of active related pods
that are colocated on the same node.
* pkg/controller/replicaset/replica_set_test.go (newReplicaSet): Set
OwnerReferences on the ReplicaSet.
(newPod): Set a unique UID on the pod.
(byName): New type to sort pods by name.
(TestGetReplicaSetsWithSameController): New test for
getReplicaSetsWithSameController.
(TestRelatedPodsLookup): New test for getIndirectlyRelatedPods.
(TestGetPodsToDelete): Augment the "various pod phases and conditions, diff
= len(pods)" test case to ensure that scale-down still selects doubled-up
pods if there are not enough other pods to scale down. Add a "various pod
phases and conditions, diff = len(pods), relatedPods empty" test case to
verify that getPodsToDelete works even if related pods could not be
determined. Add a "ready and colocated with another ready pod vs not
colocated, diff < len(pods)" test case to verify that a doubled-up pod gets
preferred for deletion. Augment the "various pod phases and conditions,
diff < len(pods)" test case to ensure that not-ready pods are preferred
over ready but doubled-up pods.
* pkg/controller/replicaset/BUILD: Regenerate.
* test/e2e/apps/deployment.go
(testRollingUpdateDeploymentWithLocalTrafficLoadBalancer): New end-to-end
test. Create a deployment with a rolling update strategy and affinity
rules and a load balancer with "Local" external traffic policy, and verify
that set of nodes with local endponts for the service remains unchanged
during rollouts.
(setAffinity): New helper, used by
testRollingUpdateDeploymentWithLocalTrafficLoadBalancer.
* test/e2e/framework/service/jig.go (GetEndpointNodes): Factor building the
set of node names out...
(GetEndpointNodeNames): ...into this new method.
In addition create a similar method that doesn't copy objects.
Benchmark for the new no-copy method vs the old one:
```
benchmark old ns/op new ns/op delta
BenchmarkGetControllerOf-12 214 14.8 -93.08%
benchmark old allocs new allocs delta
BenchmarkGetControllerOf-12 1 0 -100.00%
benchmark old bytes new bytes delta
BenchmarkGetControllerOf-12 80 0 -100.00%
```
Benchamrk for the new (copy) method vs the old one:
```
benchmark old ns/op new ns/op delta
BenchmarkGetControllerOf-12 128 114 -10.94%
benchmark old allocs new allocs delta
BenchmarkGetControllerOf-12 1 1 +0.00%
benchmark old bytes new bytes delta
BenchmarkGetControllerOf-12 80 80 +0.00%
```
Overall there is a 10% improvement for the old vs new (copy) method and
huge improvent (x10) for the old vs new (no-copy).
I changed the IsControlledBy and a few other methods to use the new (no-copy) method.
syncService shouldn't return error if the service doesn't exist which
means it's triggered by service deletion, otherwise the service would be
enqueued repeatedly even its cleanup has been executed successfully.
This patch makes syncService return nil if the error is NotFound when
getting the service, like the other controllers do.
This should fix a bug that could break masters when the EndpointSlice
feature gate was enabled. This was all tied to how the apiserver creates
and manages it's own services and endpoints (or in this case endpoint
slices). Consumers of endpoint slices also need to know about the
corresponding service. Previously we were trying to set an owner
reference here for this purpose, but that came with potential downsides
and increased complexity. This commit changes behavior of the apiserver
endpointslice integration to set the service name label instead of owner
references, and simplifies consumer logic to reference that (both are
set by the EndpointSlice controller).
Additionally, this should fix a bug with the EndpointSlice GenerateName
value that had previously been set with a "." as a suffix.
This tests the PDB status update path in DisruptionController and
asserts that conflicting writes (with eviciton handler) are handled
gracefully.
This adds the client-go fake.Clientset into our tests, because that is
the layer required for injecting update failures.
This also adds a TestMain so that DisruptionController logs can be
enabled during test. e.g.,
go test ./pkg/controller/disruption -v -args -v=4
This changes the retry logic in DisruptionController so that it
reconciles update conflicts. In the old behavior, any pdb status update
failure was retried with the same status, regardless of error.
Now there is no retry logic with the status update. The error is passed
up the stack where the PDB can be requeued for processing.
If the PDB status update error is a conflict error, there are some new
special cases:
- failSafe is not triggered, since this is considered a retryable error
- the PDB is requeued immediately (ignoring the rate limiter) because we
assume that conflict can be resolved by getting the latest version
The current mechanism for excluding "master" nodes based on node names
is fragile and should be fixed by using a label exclusion similar to
service load balancers. The legacy code path is preserved behind a
defaulted-on gate and will be removed in the future.
The service load balancer controller should honor the
LegacyNodeRoleBehavior feature gate for checks that use node-roles,
switch to using a non alpha annotation behind the gate, and prepare
to graduate to beta.
Add live list of pods to PVC protection controller, as opposed to doing
only a cache-based list through the Informer. Both lists are performed
while processing a PVC with deletionTimestamp set to check whether Pods
using the PVC exist and remove the finalizer to enable deletion of the
PVC if that's not the case. Prior to this commit only the cache-based
list was done but that's unreliable because a pod using the PVC might
exist but not be in the cache just yet. On the other hand, the live
list is 100% reliable.
Note that it would be enough to do only the live list. Instead, this
commit adds it after the cache-based list and performs it only if the
latter finds no Pod blocking deletion of the PVC being processed. The
rationale is that live lists are expensive and it's desirable to
minimize them. The drawback is that if at the time of the cache-based
list the cache has not been notified yet of the deletion of a Pod using
the PVC the PVC is kept. Correctness is not compromised because the
finalizer will be removed when the Pod deletion notification is
received, but this means PVC deletion is delayed. Reducing live lists
was valued more than deleting PVCs slightly faster.
Also, add a unit test that fails without the change introduced by this
commit and revamp old unit tests. The latter is needed because expected
behavior is described in terms of API calls the controller makes, and
this commit introduces new API calls (the live lists).
before this change, an object would be added to `attemptToDelete` queue only if `gc` detected the transition, simply by check if `deletionTimestamp` was set. After the change, it will check if `foregroundDeletion` finalizer has been set before adding the item to the queue.
During a Deployment update there may be more Pods in the scale target
ref status than in the spec. This test verifies that we do not scale
to the status value. Instead we should stay at the spec value.
Fails before #79035 and passes after.
This is used when the cloudprovider layer is not implementing loadBalancer service.
Implementation will be in a different controller running on master.
Make the PVC protection controller robust to cases where a Pod X is deleted,
then a Pod Y with the same namespaced name is created and the two events are
delivered via a single update notification. Both pods should be processed,
because X might be blocking deletion of a PVC which is not referenced by Y.
Prior to this commit only the newer pod is processed, which means that it
is possible to leak PVCs.
Also, add unit tests to reflect the change.
Add support for scaling to zero pods
minReplicas is allowed to be zero
condition is set once
Based on https://github.com/kubernetes/kubernetes/pull/61423
set original valid condition
add scale to/from zero and invalid metric tests
Scaling up from zero pods ignores tolerance
validate metrics when minReplicas is 0
Document HPA behaviour when minReplicas is 0
Documented minReplicas field in autoscaling APIs
All controllers in controller-manager that deal with objects generically
work with those objects without needing the full object. Update the GC
and quota controller to use PartialObjectMetadata input objects which
is faster and more efficient.
The metadata client uses protobuf and returns only a subset of object
data (the metadata) which allows operations that act only on objects
generically to work much faster. Use the metadata client in the
namespace controller to reduce the amount of work the namespace controller
has to do in large namespaces.
This change adds pending pods to the ignored set first before
selecting pods missing metrics. Pending pods are always ignored when
calculating scale.
When the HPA decides which pods and metric values to take into account
when scaling, it divides the pods into three disjoint subsets: 1)
ready 2) missing metrics and 3) ignored. First the HPA selects pods
which are missing metrics. Then it selects pods should be ignored
because they are not ready yet, or are still consuming CPU during
initialization. All the remaining pods go into the ready set. After
the HPA has decided what direction it wants to scale based on the
ready pods, it considers what might have happened if it had the
missing metrics. It makes a conservative guess about what the missing
metrics might have been, 0% if it wants to scale up--100% if it wants
to scale down. This is a good thing when scaling up, because newly
added pods will likely help reduce the usage ratio, even though their
metrics are missing at the moment. The HPA should wait to see the
results of its previous scale decision before it makes another
one. However when scaling down, it means that many missing metrics can
pin the HPA at high scale, even when load is completely removed. In
particular, when there are many unschedulable pods due to insufficient
cluster capacity, the many missing metrics (assumed to be 100%) can
cause the HPA to avoid scaling down indefinitely.
As the benchmark shows it speeds up the method~x4 and reduces memory
consumption ~x20.
```
benchmark old ns/op new ns/op delta
BenchmarkGetPodMapForDeployment-12 276121 72591 -73.71%
benchmark old allocs new allocs delta
BenchmarkGetPodMapForDeployment-12 241 238 -1.24%
benchmark old bytes new bytes delta
BenchmarkGetPodMapForDeployment-12 554025 28956 -94.77%
```
current scale. Two important ones are when missing metrics might
change the direction of scaling, and when the recommended scale is
within tolerance of the current scale.
The way that ReplicaCalculator signals it's desire to not change the
current scale is by returning the current scale. However the current
scale is from scale.Status.Replicas and can be larger than
scale.Spec.Replicas (e.g. during Deployment rollout with configured
surge). This causes a positive feedback loop because
scale.Status.Replicas is written back into scale.Spec.Replicas,
further increasing the current scale.
This PR fixes the feedback loop by plumbing the replica count from
spec through horizontal.go and replica_calculator.go so the calculator
can punt with the right value.