This updates the EndpointSlice controller to make use of the
EndpointSlice tracker to identify when expected changes are not present
in the cache yet. If this is detected, the controller will wait to sync
until all expected updates have been received. This should help avoid
race conditions that would result in duplicate EndpointSlices or failed
attempts to update stale EndpointSlices. To simplify this logic, this
also moves the EndpointSlice tracker from relying on resource versions
to generations.
A CounterVector with status as label may create unnecessary overhead
and using the success case with the empty label value wasn't
easy. It's better to have two seperate counters, one for total number
of calls and one for failed calls.
As discussed during the production readiness review, a metric for the
PVC create operations is useful. The "ephemeral_volume" workqueue
metrics were already added in the initial implementation.
The new code follows the example set by the endpoints controller.
When the feature is disabled either in the scheduler or the CSIDriver,
the scheduler is expected to schedule pods without considering whether
storage capacity is available.
The nodeShouldRunDaemonPod method does not need to return an error
because there are no scenarios under which it fails. Remove the
error return path for its direct calls as well.
In order to maintain the correct invariants, the existing maxUnavailable
logic calculated the same data several times in different ways. Leverage
the simpler structure from maxSurge and calculate pod availability only
once, as well as perform only a single pass over all the pods in the
daemonset. This changed no behavior of the current controller, and
has a structure that is almost identical to maxSurge.
If MaxSurge is set, the controller will attempt to double up nodes
up to the allowed limit with a new pod, and then when the most recent
(by hash) pod is ready, trigger deletion on the old pod. If the old
pod goes unready before the new pod is ready, the old pod is immediately
deleted. If an old pod goes unready before a new pod is placed on that
node, a new pod is immediately added for that node even past the MaxSurge
limit.
The backoff clock is used consistently throughout the daemonset controller
as an injectable clock for the purposes of testing.
It is too easy to omit checking the return value for the
syncAndValidateDaemonSet test in large suites. Switch the method
type to be a test helper and fatal/error directly. Also rename
a method that referenced the old name 'Rollback' instead of
'RollingUpdate'.
This is part of the goal for scheduling to remove dependencies on internal
packages for the scheduling framework. It also provides these functions in an
external location for other components and projects to import.
The goal of this move is related to issue 89930, to break the dependence
of scheduling plugins on internal helpers. This function can easily move to
component-helpers where it will be used by other components as well.
The HPA controller keeps a flat history of recommendations for
stabilization. However when both up and down scale stabilization are
configured, the interpretation of the history changes depending on the
direction of movement. What we want is to keep the stabilized
recommendation within the envelope of the minimum and maximum over
configured stabilization windows. We should only move when the
envelope forces a move.
The range allocator in pkg/controller/nodeipam/ipam/range_allocator.go
may call Occupy() on the same range twice:
1. Just before subscribing to the NodeInformer
2. From a callback given to the NodeInformer soon after registration
Adds unit tests covering the problematic scenarios identified
around conflicting data in child owner references
Before After
package level 51% 68%
garbagecollector.go 60% 75%
graph_builder.go 50% 81%
graph.go 50% 68%
Added/improved coverage of key functions that had lacking unit test coverage:
* attemptToDeleteWorker
* attemptToDeleteItem
* processGraphChanges (added coverage of all added code)
If a cluster-scoped dependent references a namespace-scoped owner,
this is an invalid relationship, and the lookup will never succeed in attemptToDelete.
Short-circuit requeueing in attemptToDelete and log.
When we observe valid coordinates for a previously virtual node,
if there are dependents that do not agree with those coordinates,
add them to the attemptToDelete queue.
This queue will check the dependent's ownerReferences using the coordinates specified by the dependent.
If all of the owners can be verified absent, the dependent will be deleted.
If some are still present, or if there are errors looking them up, the dependent will not be deleted.
If the verified owner is namespaced, and the dependent is not in the same namespace,
an event will be recorded for user visibility, since cross-namespace ownerReferences are not supported.
If a virtual delete event is received for a node whose dependents disagree on the parent's coordinates:
1. propagate the delete to children that matched the verified absent coordinates
2. if the existing node is virtual, select a new set of coordinates from the remaining dependents
3. do not delete the parent node from the graph if the parent node is non-virtual,
or if there are dependents that do not agree with the virtual delete event coordinates
When adding a dependent to the graph, we ensure there is a node representing each owner reference,
and add the dependent to each parent node.
If the parent node already exists, and the dependent's ownerReference
coordinates disagree with the verified coordinates, add the dependent to the attemptToDelete queue.
This queue will check the dependent's ownerReferences using the coordinates specified by the dependent.
If all of the owners can be verified absent, the dependent will be deleted.
If some are still present, or if there are errors looking them up, the dependent will not be deleted.
If the parent node has been observed via informer event (so we know the coordinates are accurate),
and the verified owner is namespaced, and the dependent is not in the same namespace,
an event will be recorded for user visibility, since cross-namespace ownerReferences are not supported.
Virtual nodes are added to the attemptToDelete queue, and continue getting requeued
until they are successfully verified absent or are observed via informer.
In the meantime, if the real object associated with that UID is observed via informer,
or is observed to be deleted via informer, the graph node for that UID can be removed
or marked as observed. In that case, we should stop retrying to get the virtual node coordinates.
If the graph contains a virtual node (because some child object referenced it in an OwnerRef),
and a real informer event is observed for that uid at different coordinates,
we want to fix the coordinates of the node in the graph to match the actual coordinates.
The safe way to do this is to clone the node, replace the identity in the clone,
then replace the node with the clone.
Modifying the identity directly is not safe because it is accessed lock-free from many code paths.
Replacing the node in the graph from processGraphChanges is safe because it is the only graph writer.
Virtual nodes can be added to the GC graph in order to represent objects
which have not been observed via an informer, but are referenced via ownerReferences.
These virtual nodes are requeued into attemptToDelete until they are observed via an informer,
or successfully verified absent via a live lookup. Previously, both of those code paths
called markObserved() to stop requeuing into attemptToDelete.
Because it is useful to know whether a particular node has been observed via
a real informer event, this commit does the following:
* adds a `virtual bool` attribute to graph events so we know which ones came from a real informer
* limits the markObserved() call to the code path where a real informer event is observed
* uses an alternative mechanism to stop requeueing into attemptToDelete when a virtual node is verified absent via a live lookup
Before deleting an object based on absent owners, GC verifies absence of those owners with a live lookup.
The coordinates used to perform that live lookup are the ones specified in the ownerReference of the child.
In order to performantly delete multiple children from the same parent (e.g. 1000 pods from a replicaset),
a 404 response to a lookup is cached in absentOwnerCache.
Previously, the cache was a simple uid set. However, since children can disagree on the coordinates
that should be used to look up a given uid, the cache should record the exact coordinates verified absent.
This is a [apiVersion, kind, namespace, name, uid] tuple.
- Remove feature gate consideration from EndpointSlice validation
- Deprecate topology field, note that it will be removed in future
release
- Update kube-proxy to check for NodeName if feature gate is enabled
- Add comments indicating the feature gates that can be used to enable
alpha API fields
- Add comments explaining use of deprecated address type in tests
* Rename const for topology.../zone
* Rename const for topology.../region
* Rename const for failure-domain.../zone
* Rename const for failure-domain.../region
* Restore old names for compat