This is part of the goal for scheduling to remove dependencies on internal
packages for the scheduling framework. It also provides these functions in an
external location for other components and projects to import.
The goal of this move is related to issue 89930, to break the dependence
of scheduling plugins on internal helpers. This function can easily move to
component-helpers where it will be used by other components as well.
The HPA controller keeps a flat history of recommendations for
stabilization. However when both up and down scale stabilization are
configured, the interpretation of the history changes depending on the
direction of movement. What we want is to keep the stabilized
recommendation within the envelope of the minimum and maximum over
configured stabilization windows. We should only move when the
envelope forces a move.
The range allocator in pkg/controller/nodeipam/ipam/range_allocator.go
may call Occupy() on the same range twice:
1. Just before subscribing to the NodeInformer
2. From a callback given to the NodeInformer soon after registration
Adds unit tests covering the problematic scenarios identified
around conflicting data in child owner references
Before After
package level 51% 68%
garbagecollector.go 60% 75%
graph_builder.go 50% 81%
graph.go 50% 68%
Added/improved coverage of key functions that had lacking unit test coverage:
* attemptToDeleteWorker
* attemptToDeleteItem
* processGraphChanges (added coverage of all added code)
If a cluster-scoped dependent references a namespace-scoped owner,
this is an invalid relationship, and the lookup will never succeed in attemptToDelete.
Short-circuit requeueing in attemptToDelete and log.
When we observe valid coordinates for a previously virtual node,
if there are dependents that do not agree with those coordinates,
add them to the attemptToDelete queue.
This queue will check the dependent's ownerReferences using the coordinates specified by the dependent.
If all of the owners can be verified absent, the dependent will be deleted.
If some are still present, or if there are errors looking them up, the dependent will not be deleted.
If the verified owner is namespaced, and the dependent is not in the same namespace,
an event will be recorded for user visibility, since cross-namespace ownerReferences are not supported.
If a virtual delete event is received for a node whose dependents disagree on the parent's coordinates:
1. propagate the delete to children that matched the verified absent coordinates
2. if the existing node is virtual, select a new set of coordinates from the remaining dependents
3. do not delete the parent node from the graph if the parent node is non-virtual,
or if there are dependents that do not agree with the virtual delete event coordinates
When adding a dependent to the graph, we ensure there is a node representing each owner reference,
and add the dependent to each parent node.
If the parent node already exists, and the dependent's ownerReference
coordinates disagree with the verified coordinates, add the dependent to the attemptToDelete queue.
This queue will check the dependent's ownerReferences using the coordinates specified by the dependent.
If all of the owners can be verified absent, the dependent will be deleted.
If some are still present, or if there are errors looking them up, the dependent will not be deleted.
If the parent node has been observed via informer event (so we know the coordinates are accurate),
and the verified owner is namespaced, and the dependent is not in the same namespace,
an event will be recorded for user visibility, since cross-namespace ownerReferences are not supported.
Virtual nodes are added to the attemptToDelete queue, and continue getting requeued
until they are successfully verified absent or are observed via informer.
In the meantime, if the real object associated with that UID is observed via informer,
or is observed to be deleted via informer, the graph node for that UID can be removed
or marked as observed. In that case, we should stop retrying to get the virtual node coordinates.
If the graph contains a virtual node (because some child object referenced it in an OwnerRef),
and a real informer event is observed for that uid at different coordinates,
we want to fix the coordinates of the node in the graph to match the actual coordinates.
The safe way to do this is to clone the node, replace the identity in the clone,
then replace the node with the clone.
Modifying the identity directly is not safe because it is accessed lock-free from many code paths.
Replacing the node in the graph from processGraphChanges is safe because it is the only graph writer.
Virtual nodes can be added to the GC graph in order to represent objects
which have not been observed via an informer, but are referenced via ownerReferences.
These virtual nodes are requeued into attemptToDelete until they are observed via an informer,
or successfully verified absent via a live lookup. Previously, both of those code paths
called markObserved() to stop requeuing into attemptToDelete.
Because it is useful to know whether a particular node has been observed via
a real informer event, this commit does the following:
* adds a `virtual bool` attribute to graph events so we know which ones came from a real informer
* limits the markObserved() call to the code path where a real informer event is observed
* uses an alternative mechanism to stop requeueing into attemptToDelete when a virtual node is verified absent via a live lookup
Before deleting an object based on absent owners, GC verifies absence of those owners with a live lookup.
The coordinates used to perform that live lookup are the ones specified in the ownerReference of the child.
In order to performantly delete multiple children from the same parent (e.g. 1000 pods from a replicaset),
a 404 response to a lookup is cached in absentOwnerCache.
Previously, the cache was a simple uid set. However, since children can disagree on the coordinates
that should be used to look up a given uid, the cache should record the exact coordinates verified absent.
This is a [apiVersion, kind, namespace, name, uid] tuple.
- Remove feature gate consideration from EndpointSlice validation
- Deprecate topology field, note that it will be removed in future
release
- Update kube-proxy to check for NodeName if feature gate is enabled
- Add comments indicating the feature gates that can be used to enable
alpha API fields
- Add comments explaining use of deprecated address type in tests
* Rename const for topology.../zone
* Rename const for topology.../region
* Rename const for failure-domain.../zone
* Rename const for failure-domain.../region
* Restore old names for compat
The main goal was to cover retrieval of a PVC from the apiserver when
it isn't known yet. This is achieved by adding PVCs and (for the sake
of completeness) PVs to the reactor, but not the controller, when a
special annotation is set. The approach with a special annotation was
chosen because it doesn't affect other tests.
The other test cases were added while checking the existing tests
because (at least at first glance) the situations seemed to be not
covered.
Normally, the PV controller knows about the PVC that triggers the
creation of a PV before it sees the PV, because the PV controller must
set the volume.beta.kubernetes.io/storage-provisioner annotation that
tells an external provisioner to create the PV.
When restarting, the PV controller first syncs its caches, so that
case is also covered.
However, the creator of a PVC might decided to set that annotation
itself to speed up volume creation. While unusual, it's not forbidden
and thus part of the external Kubernetes API. Whether it makes sense
depends on the intentions of the user.
When that is done and there is heavy load, an external provisioner
might see the PVC and create a PV before the PV controller sees the
PVC. If the PV controller then encounters the PV before the PVC, it
incorrectly concludes that the PV needs to be deleted instead of being
bound.
The same issue occurred earlier for external binding and the existing
code for looking up a PVC in the cache or in the apiserver solves the
issue also for volume provisioning, it just needs to be enabled also
for PVs without the pv.kubernetes.io/bound-by-controller annotation.
* api: structure change
* api: defaulting, conversion, and validation
* [FIX] validation: auto remove second ip/family when service changes to SingleStack
* [FIX] api: defaulting, conversion, and validation
* api-server: clusterIPs alloc, printers, storage and strategy
* [FIX] clusterIPs default on read
* alloc: auto remove second ip/family when service changes to SingleStack
* api-server: repair loop handling for clusterIPs
* api-server: force kubernetes default service into single stack
* api-server: tie dualstack feature flag with endpoint feature flag
* controller-manager: feature flag, endpoint, and endpointSlice controllers handling multi family service
* [FIX] controller-manager: feature flag, endpoint, and endpointSlicecontrollers handling multi family service
* kube-proxy: feature-flag, utils, proxier, and meta proxier
* [FIX] kubeproxy: call both proxier at the same time
* kubenet: remove forced pod IP sorting
* kubectl: modify describe to include ClusterIPs, IPFamilies, and IPFamilyPolicy
* e2e: fix tests that depends on IPFamily field AND add dual stack tests
* e2e: fix expected error message for ClusterIP immutability
* add integration tests for dualstack
the third phase of dual stack is a very complex change in the API,
basically it introduces Dual Stack services. Main changes are:
- It pluralizes the Service IPFamily field to IPFamilies,
and removes the singular field.
- It introduces a new field IPFamilyPolicyType that can take
3 values to express the "dual-stack(mad)ness" of the cluster:
SingleStack, PreferDualStack and RequireDualStack
- It pluralizes ClusterIP to ClusterIPs.
The goal is to add coverage to the services API operations,
taking into account the 6 different modes a cluster can have:
- single stack: IP4 or IPv6 (as of today)
- dual stack: IPv4 only, IPv6 only, IPv4 - IPv6, IPv6 - IPv4
* [FIX] add integration tests for dualstack
* generated data
* generated files
Co-authored-by: Antonio Ojea <aojea@redhat.com>
When a pod is deleted, it is given a deletion timestamp. However the
pod might still run for some time during graceful shutdown. During
this time it might still produce CPU utilization metrics and be in a
Running phase.
Currently the HPA replica calculator attempts to ignore deleted pods
by skipping over them. However by not adding them to the ignoredPods
set, their metrics are not removed from the average utilization
calculation. This allows pods in the process of shutting down to drag
down the recommmended number of replicas by producing near 0%
utilization metrics.
In fact the ignoredPods set is misnomer. Those pods are not fully
ignored. When the replica calculator recommends to scale up, 0%
utilization metrics are filled in for those pods to limit the scale
up. This prevents overscaling when pods take some time to startup. In
fact, there should be 4 sets considered (readyPods, unreadyPods,
missingPods, ignoredPods) not just 3.
This change renames ignoredPods as unreadyPods and leaves the scaleup
limiting semantics. Another set (actually) ignoredPods is added to
which delete pods are added instead of being skipped during
grouping. Both ignoredPods and unreadyPods have their metrics removed
from consideration. But only unreadyPods have 0% utilization metrics
filled in upon scaleup.
Also mark reason for lint errors in:
pkg/controller/endpoint/config/v1alpha1,
pkg/controller/endpointslice/config/v1alpha1
pkg/controller/endpointslicemirroring/config/v1alpha1
fixed syntax, wrote a test
fixed a test
.
1
Update staging/src/k8s.io/apimachinery/pkg/util/intstr/intstr_test.go
Co-Authored-By: Joel Speed <Joel.speed@hotmail.co.uk>
added test
.
fix
fix test
fixed a test
gofmt
lint
fix
function name
validation fix
.
godocs added
.
Implement, in the endpoint slice controller, the same logic
used for labels in the legacy endpoints controller.
The labels in the endpoint and in the parent must be equivalent.
Headless services add the well-known IsHeadlessService label.
Slices must have two well known labels: LabelServiceName and
LabelManagedBy.
The provided DialContext wraps existing clients' DialContext in an attempt to
preserve any existing timeout configuration. In some cases, we may replace
infinite timeouts with golang defaults.
- scaleio: tcp connect/keepalive values changed from 0/15 to 30/30
- storageos: no change
This fixes a bug that occurred when a Service was rapidly recreated.
This relied on an unfortunate series of events:
1. When the Service is deleted, the EndpointSlice controller removes it
from the EndpointSliceTracker along with any associated EndpointSlices.
2. When the Service is recreated, the EndpointSlice controller sees that
there are still appropriate EndpointSlices for the Service and does
nothing. (They have not yet been garbage collected).
3. When the EndpointSlice is deleted, the EndpointSlice controller
checks with the EndpointSliceTracker to see if it thinks we should have
this EndpointSlice. This check was intended to ensure we wouldn't
requeue a Service every time we delete an EndpointSlice for it.
This adds a check in reconciler to ensure that EndpointSlices it is
working with are owned by a Service with a matching UID. If not, it will
mark those EndpointSlices for deletion (assuming they're about to be
garbage collected anyway) and create new EndpointSlices.
Pod with PVC will not be scheduled if the PVC is being deleted.
This can happen when the PVC has finalizers of storage plugins.
Such a pod becomes pending. Unfortunately, after the finalizer
finishes and PVC is deleted, the pod remains pending forever.
The StatefulSet controller does nothing for this pending pod.
This commit prevents the StatefulSet controller from creating
such pods when PVC is to be deleted.
Previously the controllers would proceed with additional creates,
updates, or deletes if 1 failed. That could potentially result in
scenarios where an EndpointSlice create or update failing while a delete
worked. This updates the logic so that removals will not happen if
additions fail.
The KEP specifies that the controller will "mirror all labels from the
Endpoints resource and all endpoints and ports from the corresponding subset".
I'd missed that in my initial implementation, this should fix that.
This mirrors an earlier fix to the EndpointSlice controller. I'll make a
follow up PR to move this component to a shared package, but that seems
beyond the scope of a bug fix PR.
EndpointController was accidentally requiring all headless services to
be IPv4-only in clusters with IPv6DualStack enabled.
This still leaves "legacy" (ie, IPFamily-less) headless services as
always IPv4-only because the controller doesn't currently have easy
access to the information that would allow it to fix that.
(EndpointSliceController had the same problem already, and still
does.) This can be fixed, if needed, by manually setting IPFamily,
and the proposed API for 1.20 will handle this situation better.
Rewrite some of the test helpers to better support single-stack IPv4
vs single-stack IPv6 vs dual-stack IPv4 primary vs dual-stack IPv6
primary, and update TestPodToEndpointAddressForService to test some
more cases.
The endpoint controllers responded to Pod changes by trying to figure
out if the generated endpoint resource would change, rather than just
checking if the Pod had changed, but since the set of Pod fields that
need to be checked depend on the Service and Node as well, the code
ended up only checking for a subset of the changes it should have.
In particular, EndpointSliceController ended up only looking at IPv4
Pod IPs when processing Pod update events, so when a Pod went from
having no IP to having only an IPv6 IP, EndpointSliceController would
think it hadn't changed.
This was introduced by commit: f04ce3cfba
Since this func is simple and clear enough, just not comment it anymore.
Signed-off-by: Zhou Peng <p@ctriple.cn>
The implementation consists of
- identifying all places where VolumeSource.PersistentVolumeClaim has
a special meaning and then ensuring that the same code path is taken
for an ephemeral volume, with the ownership check
- adding a controller that produces the PVCs for each embedded
VolumeSource.EphemeralVolume
- relaxing the PVC protection controller such that it removes
the finalizer already before the pod is deleted (only
if the GenericEphemeralVolume feature is enabled): this is
needed to break a cycle where foreground deletion of the pod
blocks on removing the PVC, which waits for deletion of the pod
The controller was derived from the endpointslices controller.
endpointSliceTracker creates a set of resource versions for each
service, the resource versions in the set could be deleted when
endpointslices are deleted, but the set and its key in the map is never
deleted, leading to memory leak.
This patch deletes the set if the service is deleted, and stops
initializing an empty set when "read-only" methods "Has" and "Stale" are
called.
And give ownership to pkg/scheduler/framework/plugins/volumebinding
Signed-off-by: Aldo Culquicondor <acondor@google.com>
Change-Id: I4bd89b1745a2be0e458601056ab905bdd6692195
This uses the information provided by a CSI driver deployment for
checking whether a node has access to enough storage to create the
currently unbound volumes, if the CSI driver opts into that checking
with CSIDriver.Spec.VolumeCapacity != false.
This resolves a TODO from commit 95b530366a.
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
The disruption controller is resyncing all ssets every 30 seconds, this is not necessary, and make the depth of disruption workqueue longer and can cause delays processing actual updates when large amounts of disruptions exist.
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Disruption controllers no longer force a resync every 30 seconds when nothing has changed.
refactor and add the following metrics to the cidr_sets used by the range
allocator:, under the subsystem: node_ipam_controller
cidrset_cidrs_allocations_total
cidrset_cidrs_releases_total
cidrset_usage_cidrs
cidrset_allocation_tries_per_request
Since the parameter 'podMap' has been removed by commit 831a2d1129
("deployment: remove unused parameter 'podMap'"), the related annotation
also should be removed.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
specifically:
- cmd/kubeadm/.import-restrictions
- we don't need to explicitly allow k8s.io repos (external or published)
- rm pkg/controller/.import-restrictions
- pkg/client/unversioned was removed in 59042
- pkg/kubectl/.import-restrictions
- pkg/printers is no longer used
- pkg/api was masking all of the pkg/apis prefixes
- rm staging/src/k8s.io/code-generator/cmd/lister-gen/.import-restrictions
- noop / empty file
- test/e2e/framework/.import-restrictions
- we don't need to explicitly allow k8s.io repos (external or published)
yaml has comments, so we can explain why we have certain rules or
certain prefixes
for those files that weren't already commented yaml, I converted them to
yaml and took a best guess at comments based on the PRs that introduced
or updated them
This is to avoid unnecessary GCE API calls done by getInstanceByName
helper, which is iterating over all zones to find in which zone the
VM exists.
ProviderID already contains all the information - it's in the form:
gce://<VM URL> (VM URL contains project, zone, VM name).
ProviderID is propagated by Kubelet on node registration and in case
of bugs backfilled by node-controller.
- PV has a dangling reference to a PVC
- PVC is trying to bind to a PV that already references a different PVC
Change-Id: Ic509d39808763149b02b4dd52347edb74a8803fd
* move well-known kubelet cloud provider annotations to k8s.io/cloud-provider
Signed-off-by: andrewsykim <kim.andrewsy@gmail.com>
* cloud provider: rename AnnotationProvidedIPAddr to AnnotationAlphaProvidedIPAddr to indicate alpha status
Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
This makes debugging individual test cases much easier. For example:
go test -p 1 ./pkg/controller/cronjob -run TestSyncOne_Status/prev_ran_but_done,_is_time,_past_deadline -v
The EndpointSlice controller has the potential to manage a large number of resources that are updated frequently. Without proper backoffs in place, there is potential for it to unnecessarily overload the API Server with requests. This makes two significant changes: Increasing the base backoff from 5ms to 1s and making all syncs triggered by EndpointSlice changes delayed by at least 1 second to enable batching.
Most of these could have been refactored automatically but it wouldn't
have been uglier. The unsophisticated tooling left lots of unnecessary
struct -> pointer -> struct transitions.
This is gross but because NewDeleteOptions is used by various parts of
storage that still pass around pointers, the return type can't be
changed without significant refactoring within the apiserver. I think
this would be good to cleanup, but I want to minimize apiserver side
changes as much as possible in the client signature refactor.
The scheduler doesn't really need to know in detail which reasons
rendered a node unusable for a node. All it needs from the volume
binder is a list of reasons that it then can present to the user.
This seems a bit cleaner. But the main reason for the change is that
it simplifies the checking of CSI inline volumes and perhaps later
capacity checking. Both will lead to new failure reasons, which then
can be added without changing the interface.
The StatefulSet controller cleans up ControllerRevisions at the end of
the reconcile loop. If something goes wrong during reconcile, it bails
out without actually performing this step. This commit moves the cleanup
to a deferred function call to guarantee it will be executed.
Fixes issue: https://github.com/kubernetes/kubernetes/issues/85690
During EndpointSlice reconcilation, EndpointSliceTracker is supposed to
track expected EndpointSlice resource versions so that external changes
to them can be detected. But it actually tracked the stale resource
version and resulted in every Service was handled twice as it always
received an EndpointSlice update with a different resource version but
was actually created/updated by itself during the first processing.
Using t.Run() has some advantages:
- individual sub-tests can be selected with -run
- test failures are automatically associated with
the scenario, therefore the test name no longer
needs to be passed around
A "run" function is used primarily to avoid large indention changes,
but also because this can be used to extend the test with additional
parameters later on.
Add logic in service_controller to skip create/update
if finalizer from a different controller is found.
The newly added finalizer will be checked by other controllers
implementing ILB services to determine if a given service is
already being managed by service_controller.
Moved finalizer check into cloudprovider code.
added unit test to verify new finalizer.
Modified existing unit test to create a fake service so that
attach/remove finalizer step can be tested.
A transient issue might occur that causes an error to be returned by
InstanceID(). When this is ignored, the external cloud provider taint
will be removed and neither AddCloudNode() nor UpdateCloudNode() will
try to set a provider ID in the future.
By returning the error we can ensure that the external cloud provider
taint is not removed prematurely, allowing the operation to be retried
(until the provider ID can be set).
Preserve support for external cloud providers that do not use IDs by
continuing if a NotImplemented error is returned, making a distinction
between lack of support for provider IDs and an actual error.
Introduce pair of unit tests that show a provider ID will eventually
be set if an error is returned, unless that error is a NotImplemented,
in which case the external cloud provider taint will be removed.
The `recorder.PastEventf` method wasn't actually working as advertised.
It was supposed to accept a timestamp, which would be used when
generating the event. However, as the
[source code](https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/record/event.go#L316)
shows, this `timestamp` was never actually used.
In other words, `PastEventf` is identical to `Eventf`.
We have two options: one would be to fix `PastEventf` so that it works
as advertised. The other would be to delete `PastEventf` and only
support `Eventf`.
Ultimately, I could only find one use of `PastEventf` in the code base,
so I propose we just delete `PastEventf` and convert all uses to
`Eventf`.
This adds a new EndpointSlice tracker to keep track of the expected resource versions of EndpointSlices associated with each Service managed by the EndpointSlice controller. This should prevent a potential race where a syncService call could happen with an incomplete view of EndpointSlices if additions or deletions hadn't fully propagated to the cache yet. Additionally, this ensures that external changes to EndpointSlices will be handled by the EndpointSlice controller.
This ended up causing far more problems than it was worth, especially
given that it just attempted to provide backwards compatibility with
the alpha release.