Instead of reporting an event or displaying an error, simply exit
when the namespace is being terminated. This reduces the amount of
controller churn on namespace shutdown. Unlike other controllers, we
drop the replica set create error very late (in the queue handleErr)
in order to avoid changing the structure of the controller
substantially.
Instead of reporting an event or displaying an error, simply exit
when the namespace is being terminated. This reduces the amount of
controller churn on namespace shutdown. While we could technically
exit the entire processing loop early for very large jobs,
we should wait for more evidence that is an issue before changing
that logic substantially.
Instead of reporting an event or displaying an error, simply exit
when the namespace is being terminated. This reduces the amount of
controller churn on namespace shutdown. While we could technically
exit the entire processing loop early for very large daemon sets,
we should wait for more evidence that is an issue before changing
that logic substantially.
Instead of reporting an event or displaying an error, simply exit
when the namespace is being terminated. This reduces the amount of
controller churn on namespace shutdown. While we could technically
exit the entire processing loop early for very large replica sets,
we should wait for more evidence that is an issue before changing
that logic substantially.
In some scenarios the service account and token controllers can
race with namespace deletion, causing a burst of errors as they
attempt to recreate secrets being deleted.
Instead, detect these errors and do not retry.
When scaling down a ReplicaSet, delete doubled up replicas first, where a
"doubled up replica" is defined as one that is on the same node as an
active replica belonging to a related ReplicaSet. ReplicaSets are
considered "related" if they have a common controller (typically a
Deployment).
The intention of this change is to make a rolling update of a Deployment
scale down the old ReplicaSet as it scales up the new ReplicaSet by
deleting pods from the old ReplicaSet that are colocated with ready pods of
the new ReplicaSet. This change in the behavior of rolling updates can be
combined with pod affinity rules to preserve the locality of a Deployment's
pods over rollout.
A specific scenario that benefits from this change is when a Deployment's
pods are exposed by a Service that has type "LoadBalancer" and external
traffic policy "Local". In this scenario, the load balancer uses health
checks to determine whether it should forward traffic for the Service to a
particular node. If the node has no local endpoints for the Service, the
health check will fail for that node. Eventually, the load balancer will
stop forwarding traffic to that node. In the meantime, the service proxy
drops traffic for that Service. Thus, in order to reduce risk of dropping
traffic during a rolling update, it is desirable preserve node locality of
endpoints.
* pkg/controller/controller_utils.go (ActivePodsWithRanks): New type to
sort pods using a given ranking.
* pkg/controller/controller_utils_test.go (TestSortingActivePodsWithRanks):
New test for ActivePodsWithRanks.
* pkg/controller/replicaset/replica_set.go
(getReplicaSetsWithSameController): New method. Given a ReplicaSet, return
all ReplicaSets that have the same owner.
(manageReplicas): Call getIndirectlyRelatedPods, and pass its result to
getPodsToDelete.
(getIndirectlyRelatedPods): New method. Given a ReplicaSet, return all
pods that are owned by any ReplicaSet with the same owner.
(getPodsToDelete): Add an argument for related pods. Use related pods and
the new getPodsRankedByRelatedPodsOnSameNode function to take into account
whether a pod is doubled up when sorting pods for deletion.
(getPodsRankedByRelatedPodsOnSameNode): New function. Return an
ActivePodsWithRanks value that wraps the given slice of pods and computes
ranks where each pod's rank is equal to the number of active related pods
that are colocated on the same node.
* pkg/controller/replicaset/replica_set_test.go (newReplicaSet): Set
OwnerReferences on the ReplicaSet.
(newPod): Set a unique UID on the pod.
(byName): New type to sort pods by name.
(TestGetReplicaSetsWithSameController): New test for
getReplicaSetsWithSameController.
(TestRelatedPodsLookup): New test for getIndirectlyRelatedPods.
(TestGetPodsToDelete): Augment the "various pod phases and conditions, diff
= len(pods)" test case to ensure that scale-down still selects doubled-up
pods if there are not enough other pods to scale down. Add a "various pod
phases and conditions, diff = len(pods), relatedPods empty" test case to
verify that getPodsToDelete works even if related pods could not be
determined. Add a "ready and colocated with another ready pod vs not
colocated, diff < len(pods)" test case to verify that a doubled-up pod gets
preferred for deletion. Augment the "various pod phases and conditions,
diff < len(pods)" test case to ensure that not-ready pods are preferred
over ready but doubled-up pods.
* pkg/controller/replicaset/BUILD: Regenerate.
* test/e2e/apps/deployment.go
(testRollingUpdateDeploymentWithLocalTrafficLoadBalancer): New end-to-end
test. Create a deployment with a rolling update strategy and affinity
rules and a load balancer with "Local" external traffic policy, and verify
that set of nodes with local endponts for the service remains unchanged
during rollouts.
(setAffinity): New helper, used by
testRollingUpdateDeploymentWithLocalTrafficLoadBalancer.
* test/e2e/framework/service/jig.go (GetEndpointNodes): Factor building the
set of node names out...
(GetEndpointNodeNames): ...into this new method.
In addition create a similar method that doesn't copy objects.
Benchmark for the new no-copy method vs the old one:
```
benchmark old ns/op new ns/op delta
BenchmarkGetControllerOf-12 214 14.8 -93.08%
benchmark old allocs new allocs delta
BenchmarkGetControllerOf-12 1 0 -100.00%
benchmark old bytes new bytes delta
BenchmarkGetControllerOf-12 80 0 -100.00%
```
Benchamrk for the new (copy) method vs the old one:
```
benchmark old ns/op new ns/op delta
BenchmarkGetControllerOf-12 128 114 -10.94%
benchmark old allocs new allocs delta
BenchmarkGetControllerOf-12 1 1 +0.00%
benchmark old bytes new bytes delta
BenchmarkGetControllerOf-12 80 80 +0.00%
```
Overall there is a 10% improvement for the old vs new (copy) method and
huge improvent (x10) for the old vs new (no-copy).
I changed the IsControlledBy and a few other methods to use the new (no-copy) method.
syncService shouldn't return error if the service doesn't exist which
means it's triggered by service deletion, otherwise the service would be
enqueued repeatedly even its cleanup has been executed successfully.
This patch makes syncService return nil if the error is NotFound when
getting the service, like the other controllers do.
This should fix a bug that could break masters when the EndpointSlice
feature gate was enabled. This was all tied to how the apiserver creates
and manages it's own services and endpoints (or in this case endpoint
slices). Consumers of endpoint slices also need to know about the
corresponding service. Previously we were trying to set an owner
reference here for this purpose, but that came with potential downsides
and increased complexity. This commit changes behavior of the apiserver
endpointslice integration to set the service name label instead of owner
references, and simplifies consumer logic to reference that (both are
set by the EndpointSlice controller).
Additionally, this should fix a bug with the EndpointSlice GenerateName
value that had previously been set with a "." as a suffix.
This tests the PDB status update path in DisruptionController and
asserts that conflicting writes (with eviciton handler) are handled
gracefully.
This adds the client-go fake.Clientset into our tests, because that is
the layer required for injecting update failures.
This also adds a TestMain so that DisruptionController logs can be
enabled during test. e.g.,
go test ./pkg/controller/disruption -v -args -v=4
This changes the retry logic in DisruptionController so that it
reconciles update conflicts. In the old behavior, any pdb status update
failure was retried with the same status, regardless of error.
Now there is no retry logic with the status update. The error is passed
up the stack where the PDB can be requeued for processing.
If the PDB status update error is a conflict error, there are some new
special cases:
- failSafe is not triggered, since this is considered a retryable error
- the PDB is requeued immediately (ignoring the rate limiter) because we
assume that conflict can be resolved by getting the latest version
The current mechanism for excluding "master" nodes based on node names
is fragile and should be fixed by using a label exclusion similar to
service load balancers. The legacy code path is preserved behind a
defaulted-on gate and will be removed in the future.
The service load balancer controller should honor the
LegacyNodeRoleBehavior feature gate for checks that use node-roles,
switch to using a non alpha annotation behind the gate, and prepare
to graduate to beta.
Add live list of pods to PVC protection controller, as opposed to doing
only a cache-based list through the Informer. Both lists are performed
while processing a PVC with deletionTimestamp set to check whether Pods
using the PVC exist and remove the finalizer to enable deletion of the
PVC if that's not the case. Prior to this commit only the cache-based
list was done but that's unreliable because a pod using the PVC might
exist but not be in the cache just yet. On the other hand, the live
list is 100% reliable.
Note that it would be enough to do only the live list. Instead, this
commit adds it after the cache-based list and performs it only if the
latter finds no Pod blocking deletion of the PVC being processed. The
rationale is that live lists are expensive and it's desirable to
minimize them. The drawback is that if at the time of the cache-based
list the cache has not been notified yet of the deletion of a Pod using
the PVC the PVC is kept. Correctness is not compromised because the
finalizer will be removed when the Pod deletion notification is
received, but this means PVC deletion is delayed. Reducing live lists
was valued more than deleting PVCs slightly faster.
Also, add a unit test that fails without the change introduced by this
commit and revamp old unit tests. The latter is needed because expected
behavior is described in terms of API calls the controller makes, and
this commit introduces new API calls (the live lists).
before this change, an object would be added to `attemptToDelete` queue only if `gc` detected the transition, simply by check if `deletionTimestamp` was set. After the change, it will check if `foregroundDeletion` finalizer has been set before adding the item to the queue.
During a Deployment update there may be more Pods in the scale target
ref status than in the spec. This test verifies that we do not scale
to the status value. Instead we should stay at the spec value.
Fails before #79035 and passes after.