Prefer to delete doubled-up pods of a ReplicaSet

When scaling down a ReplicaSet, delete doubled up replicas first, where a "doubled up replica" is defined as one that is on the same node as an active replica belonging to a related ReplicaSet. ReplicaSets are considered "related" if they have a common controller (typically a Deployment). The intention of this change is to make a rolling update of a Deployment scale down the old ReplicaSet as it scales up the new ReplicaSet by deleting pods from the old ReplicaSet that are colocated with ready pods of the new ReplicaSet. This change in the behavior of rolling updates can be combined with pod affinity rules to preserve the locality of a Deployment's pods over rollout. A specific scenario that benefits from this change is when a Deployment's pods are exposed by a Service that has type "LoadBalancer" and external traffic policy "Local". In this scenario, the load balancer uses health checks to determine whether it should forward traffic for the Service to a particular node. If the node has no local endpoints for the Service, the health check will fail for that node. Eventually, the load balancer will stop forwarding traffic to that node. In the meantime, the service proxy drops traffic for that Service. Thus, in order to reduce risk of dropping traffic during a rolling update, it is desirable preserve node locality of endpoints. * pkg/controller/controller_utils.go (ActivePodsWithRanks): New type to sort pods using a given ranking. * pkg/controller/controller_utils_test.go (TestSortingActivePodsWithRanks): New test for ActivePodsWithRanks. * pkg/controller/replicaset/replica_set.go (getReplicaSetsWithSameController): New method. Given a ReplicaSet, return all ReplicaSets that have the same owner. (manageReplicas): Call getIndirectlyRelatedPods, and pass its result to getPodsToDelete. (getIndirectlyRelatedPods): New method. Given a ReplicaSet, return all pods that are owned by any ReplicaSet with the same owner. (getPodsToDelete): Add an argument for related pods. Use related pods and the new getPodsRankedByRelatedPodsOnSameNode function to take into account whether a pod is doubled up when sorting pods for deletion. (getPodsRankedByRelatedPodsOnSameNode): New function. Return an ActivePodsWithRanks value that wraps the given slice of pods and computes ranks where each pod's rank is equal to the number of active related pods that are colocated on the same node. * pkg/controller/replicaset/replica_set_test.go (newReplicaSet): Set OwnerReferences on the ReplicaSet. (newPod): Set a unique UID on the pod. (byName): New type to sort pods by name. (TestGetReplicaSetsWithSameController): New test for getReplicaSetsWithSameController. (TestRelatedPodsLookup): New test for getIndirectlyRelatedPods. (TestGetPodsToDelete): Augment the "various pod phases and conditions, diff = len(pods)" test case to ensure that scale-down still selects doubled-up pods if there are not enough other pods to scale down. Add a "various pod phases and conditions, diff = len(pods), relatedPods empty" test case to verify that getPodsToDelete works even if related pods could not be determined. Add a "ready and colocated with another ready pod vs not colocated, diff < len(pods)" test case to verify that a doubled-up pod gets preferred for deletion. Augment the "various pod phases and conditions, diff < len(pods)" test case to ensure that not-ready pods are preferred over ready but doubled-up pods. * pkg/controller/replicaset/BUILD: Regenerate. * test/e2e/apps/deployment.go (testRollingUpdateDeploymentWithLocalTrafficLoadBalancer): New end-to-end test. Create a deployment with a rolling update strategy and affinity rules and a load balancer with "Local" external traffic policy, and verify that set of nodes with local endponts for the service remains unchanged during rollouts. (setAffinity): New helper, used by testRollingUpdateDeploymentWithLocalTrafficLoadBalancer. * test/e2e/framework/service/jig.go (GetEndpointNodes): Factor building the set of node names out... (GetEndpointNodeNames): ...into this new method.
2019-07-10 18:56:19 -04:00
parent 865c3c5670
commit 980b6406b2
7 changed files with 668 additions and 32 deletions
--- a/pkg/controller/controller_utils.go
+++ b/pkg/controller/controller_utils.go
@@ -771,6 +771,94 @@ func (s ActivePods) Less(i, j int) bool {
 	return false
 }

+// ActivePodsWithRanks is a sortable list of pods and a list of corresponding
+// ranks which will be considered during sorting.  The two lists must have equal
+// length.  After sorting, the pods will be ordered as follows, applying each
+// rule in turn until one matches:
+//
+// 1. If only one of the pods is assigned to a node, the pod that is not
+//    assigned comes before the pod that is.
+// 2. If the pods' phases differ, a pending pod comes before a pod whose phase
+//    is unknown, and a pod whose phase is unknown comes before a running pod.
+// 3. If exactly one of the pods is ready, the pod that is not ready comes
+//    before the ready pod.
+// 4. If the pods' ranks differ, the pod with greater rank comes before the pod
+//    with lower rank.
+// 5. If both pods are ready but have not been ready for the same amount of
+//    time, the pod that has been ready for a shorter amount of time comes
+//    before the pod that has been ready for longer.
+// 6. If one pod has a container that has restarted more than any container in
+//    the other pod, the pod with the container with more restarts comes
+//    before the other pod.
+// 7. If the pods' creation times differ, the pod that was created more recently
+//    comes before the older pod.
+//
+// If none of these rules matches, the second pod comes before the first pod.
+//
+// The intention of this ordering is to put pods that should be preferred for
+// deletion first in the list.
+type ActivePodsWithRanks struct {
+	// Pods is a list of pods.
+	Pods []*v1.Pod
+
+	// Rank is a ranking of pods.  This ranking is used during sorting when
+	// comparing two pods that are both scheduled, in the same phase, and
+	// having the same ready status.
+	Rank []int
+}
+
+func (s ActivePodsWithRanks) Len() int {
+	return len(s.Pods)
+}
+
+func (s ActivePodsWithRanks) Swap(i, j int) {
+	s.Pods[i], s.Pods[j] = s.Pods[j], s.Pods[i]
+	s.Rank[i], s.Rank[j] = s.Rank[j], s.Rank[i]
+}
+
+// Less compares two pods with corresponding ranks and returns true if the first
+// one should be preferred for deletion.
+func (s ActivePodsWithRanks) Less(i, j int) bool {
+	// 1. Unassigned < assigned
+	// If only one of the pods is unassigned, the unassigned one is smaller
+	if s.Pods[i].Spec.NodeName != s.Pods[j].Spec.NodeName && (len(s.Pods[i].Spec.NodeName) == 0 || len(s.Pods[j].Spec.NodeName) == 0) {
+		return len(s.Pods[i].Spec.NodeName) == 0
+	}
+	// 2. PodPending < PodUnknown < PodRunning
+	m := map[v1.PodPhase]int{v1.PodPending: 0, v1.PodUnknown: 1, v1.PodRunning: 2}
+	if m[s.Pods[i].Status.Phase] != m[s.Pods[j].Status.Phase] {
+		return m[s.Pods[i].Status.Phase] < m[s.Pods[j].Status.Phase]
+	}
+	// 3. Not ready < ready
+	// If only one of the pods is not ready, the not ready one is smaller
+	if podutil.IsPodReady(s.Pods[i]) != podutil.IsPodReady(s.Pods[j]) {
+		return !podutil.IsPodReady(s.Pods[i])
+	}
+	// 4. Doubled up < not doubled up
+	// If one of the two pods is on the same node as one or more additional
+	// ready pods that belong to the same replicaset, whichever pod has more
+	// colocated ready pods is less
+	if s.Rank[i] != s.Rank[j] {
+		return s.Rank[i] > s.Rank[j]
+	}
+	// TODO: take availability into account when we push minReadySeconds information from deployment into pods,
+	//       see https://github.com/kubernetes/kubernetes/issues/22065
+	// 5. Been ready for empty time < less time < more time
+	// If both pods are ready, the latest ready one is smaller
+	if podutil.IsPodReady(s.Pods[i]) && podutil.IsPodReady(s.Pods[j]) && !podReadyTime(s.Pods[i]).Equal(podReadyTime(s.Pods[j])) {
+		return afterOrZero(podReadyTime(s.Pods[i]), podReadyTime(s.Pods[j]))
+	}
+	// 6. Pods with containers with higher restart counts < lower restart counts
+	if maxContainerRestarts(s.Pods[i]) != maxContainerRestarts(s.Pods[j]) {
+		return maxContainerRestarts(s.Pods[i]) > maxContainerRestarts(s.Pods[j])
+	}
+	// 7. Empty creation time pods < newer pods < older pods
+	if !s.Pods[i].CreationTimestamp.Equal(&s.Pods[j].CreationTimestamp) {
+		return afterOrZero(&s.Pods[i].CreationTimestamp, &s.Pods[j].CreationTimestamp)
+	}
+	return false
+}
+
 // afterOrZero checks if time t1 is after time t2; if one of them
 // is zero, the zero time is seen as after non-zero time.
 func afterOrZero(t1, t2 *metav1.Time) bool {