The test was flaky because it required the job succeeds 3 times with
pseudorandom 50% failure chance within 15 minutes, while there is an
exponential back-off delay (10s, 20s, 40s …) capped at 6 minutes before
recreating failed pods. As 7 consecutive failures (1/128 chance) could
take 20+ minutes, exceeding the timeout, the test failed intermittently
because of "timed out waiting for the condition".
This PR forces the Pods of a Job to be scheduled to a single node and
uses a hostPath volume instead of an emptyDir to persist data across new
Pods.
The redis version has been bumped to version 5.0.5, but the maximum version supported on
Windows is 3.2. This can lead to failing tests, the output and behaviour can be different
(see #80516). In order to prevent such failures, the amount of times the Redis image is
used can be reduced.
This commit uses the previously added agnhost guestbook subcommand as a replacement for the
Guestbook application created by the test "should create and stop a working application".
Adds AgnhostPrivate to test/utils/image/manifest. Some tests are trying to pull
the agnhost image from the private registry, meaning that we would need to
always build and push the agnhost image to both e2e and private registry
whenever we bump its version. Decoupling them would mean that we only need
to push the image to the e2e registry.
This PR moves functions from test/e2e/framework.util.go for making e2e
core framework small and simple:
- RestartKubeProxy: Moved to e2e network package
- CheckConnectivityToHost: Moved to e2e network package
- RemoveAvoidPodsOffNode: Move to e2e scheduling package
- AddOrUpdateAvoidPodOnNode: Move to e2e scheduling package
- UpdateDaemonSetWithRetries: Move to e2e apps package
- CheckForControllerManagerHealthy: Moved to e2e storage package
- ParseKVLines: Removed because of e9345ae5f0
- AddOrUpdateLabelOnNodeAndReturnOldValue: Removed because of ff7b07c43c
Tests should provide more verbose output when their core functionality
fails; in this case we were only logging that the job count did not
match but this is not sufficient for understanding a failure.
Many TestJig methods made the caller pass a serviceName argument, even
though the jig already has a name, and every caller was passing the
same name to each function as they had passed to NewTestJig().
Likewise, many methods made the caller pass a namespace argument, but
only a single test used more than one namespace, and it can easily be
rewritten to use two test jigs as well.
When scaling down a ReplicaSet, delete doubled up replicas first, where a
"doubled up replica" is defined as one that is on the same node as an
active replica belonging to a related ReplicaSet. ReplicaSets are
considered "related" if they have a common controller (typically a
Deployment).
The intention of this change is to make a rolling update of a Deployment
scale down the old ReplicaSet as it scales up the new ReplicaSet by
deleting pods from the old ReplicaSet that are colocated with ready pods of
the new ReplicaSet. This change in the behavior of rolling updates can be
combined with pod affinity rules to preserve the locality of a Deployment's
pods over rollout.
A specific scenario that benefits from this change is when a Deployment's
pods are exposed by a Service that has type "LoadBalancer" and external
traffic policy "Local". In this scenario, the load balancer uses health
checks to determine whether it should forward traffic for the Service to a
particular node. If the node has no local endpoints for the Service, the
health check will fail for that node. Eventually, the load balancer will
stop forwarding traffic to that node. In the meantime, the service proxy
drops traffic for that Service. Thus, in order to reduce risk of dropping
traffic during a rolling update, it is desirable preserve node locality of
endpoints.
* pkg/controller/controller_utils.go (ActivePodsWithRanks): New type to
sort pods using a given ranking.
* pkg/controller/controller_utils_test.go (TestSortingActivePodsWithRanks):
New test for ActivePodsWithRanks.
* pkg/controller/replicaset/replica_set.go
(getReplicaSetsWithSameController): New method. Given a ReplicaSet, return
all ReplicaSets that have the same owner.
(manageReplicas): Call getIndirectlyRelatedPods, and pass its result to
getPodsToDelete.
(getIndirectlyRelatedPods): New method. Given a ReplicaSet, return all
pods that are owned by any ReplicaSet with the same owner.
(getPodsToDelete): Add an argument for related pods. Use related pods and
the new getPodsRankedByRelatedPodsOnSameNode function to take into account
whether a pod is doubled up when sorting pods for deletion.
(getPodsRankedByRelatedPodsOnSameNode): New function. Return an
ActivePodsWithRanks value that wraps the given slice of pods and computes
ranks where each pod's rank is equal to the number of active related pods
that are colocated on the same node.
* pkg/controller/replicaset/replica_set_test.go (newReplicaSet): Set
OwnerReferences on the ReplicaSet.
(newPod): Set a unique UID on the pod.
(byName): New type to sort pods by name.
(TestGetReplicaSetsWithSameController): New test for
getReplicaSetsWithSameController.
(TestRelatedPodsLookup): New test for getIndirectlyRelatedPods.
(TestGetPodsToDelete): Augment the "various pod phases and conditions, diff
= len(pods)" test case to ensure that scale-down still selects doubled-up
pods if there are not enough other pods to scale down. Add a "various pod
phases and conditions, diff = len(pods), relatedPods empty" test case to
verify that getPodsToDelete works even if related pods could not be
determined. Add a "ready and colocated with another ready pod vs not
colocated, diff < len(pods)" test case to verify that a doubled-up pod gets
preferred for deletion. Augment the "various pod phases and conditions,
diff < len(pods)" test case to ensure that not-ready pods are preferred
over ready but doubled-up pods.
* pkg/controller/replicaset/BUILD: Regenerate.
* test/e2e/apps/deployment.go
(testRollingUpdateDeploymentWithLocalTrafficLoadBalancer): New end-to-end
test. Create a deployment with a rolling update strategy and affinity
rules and a load balancer with "Local" external traffic policy, and verify
that set of nodes with local endponts for the service remains unchanged
during rollouts.
(setAffinity): New helper, used by
testRollingUpdateDeploymentWithLocalTrafficLoadBalancer.
* test/e2e/framework/service/jig.go (GetEndpointNodes): Factor building the
set of node names out...
(GetEndpointNodeNames): ...into this new method.