The KEP specifies that the controller will "mirror all labels from the
Endpoints resource and all endpoints and ports from the corresponding subset".
I'd missed that in my initial implementation, this should fix that.
This mirrors an earlier fix to the EndpointSlice controller. I'll make a
follow up PR to move this component to a shared package, but that seems
beyond the scope of a bug fix PR.
EndpointController was accidentally requiring all headless services to
be IPv4-only in clusters with IPv6DualStack enabled.
This still leaves "legacy" (ie, IPFamily-less) headless services as
always IPv4-only because the controller doesn't currently have easy
access to the information that would allow it to fix that.
(EndpointSliceController had the same problem already, and still
does.) This can be fixed, if needed, by manually setting IPFamily,
and the proposed API for 1.20 will handle this situation better.
Rewrite some of the test helpers to better support single-stack IPv4
vs single-stack IPv6 vs dual-stack IPv4 primary vs dual-stack IPv6
primary, and update TestPodToEndpointAddressForService to test some
more cases.
The endpoint controllers responded to Pod changes by trying to figure
out if the generated endpoint resource would change, rather than just
checking if the Pod had changed, but since the set of Pod fields that
need to be checked depend on the Service and Node as well, the code
ended up only checking for a subset of the changes it should have.
In particular, EndpointSliceController ended up only looking at IPv4
Pod IPs when processing Pod update events, so when a Pod went from
having no IP to having only an IPv6 IP, EndpointSliceController would
think it hadn't changed.
This was introduced by commit: f04ce3cfba
Since this func is simple and clear enough, just not comment it anymore.
Signed-off-by: Zhou Peng <p@ctriple.cn>
The implementation consists of
- identifying all places where VolumeSource.PersistentVolumeClaim has
a special meaning and then ensuring that the same code path is taken
for an ephemeral volume, with the ownership check
- adding a controller that produces the PVCs for each embedded
VolumeSource.EphemeralVolume
- relaxing the PVC protection controller such that it removes
the finalizer already before the pod is deleted (only
if the GenericEphemeralVolume feature is enabled): this is
needed to break a cycle where foreground deletion of the pod
blocks on removing the PVC, which waits for deletion of the pod
The controller was derived from the endpointslices controller.
endpointSliceTracker creates a set of resource versions for each
service, the resource versions in the set could be deleted when
endpointslices are deleted, but the set and its key in the map is never
deleted, leading to memory leak.
This patch deletes the set if the service is deleted, and stops
initializing an empty set when "read-only" methods "Has" and "Stale" are
called.
And give ownership to pkg/scheduler/framework/plugins/volumebinding
Signed-off-by: Aldo Culquicondor <acondor@google.com>
Change-Id: I4bd89b1745a2be0e458601056ab905bdd6692195
This uses the information provided by a CSI driver deployment for
checking whether a node has access to enough storage to create the
currently unbound volumes, if the CSI driver opts into that checking
with CSIDriver.Spec.VolumeCapacity != false.
This resolves a TODO from commit 95b530366a.
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
The disruption controller is resyncing all ssets every 30 seconds, this is not necessary, and make the depth of disruption workqueue longer and can cause delays processing actual updates when large amounts of disruptions exist.
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Disruption controllers no longer force a resync every 30 seconds when nothing has changed.
refactor and add the following metrics to the cidr_sets used by the range
allocator:, under the subsystem: node_ipam_controller
cidrset_cidrs_allocations_total
cidrset_cidrs_releases_total
cidrset_usage_cidrs
cidrset_allocation_tries_per_request
Since the parameter 'podMap' has been removed by commit 831a2d1129
("deployment: remove unused parameter 'podMap'"), the related annotation
also should be removed.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
specifically:
- cmd/kubeadm/.import-restrictions
- we don't need to explicitly allow k8s.io repos (external or published)
- rm pkg/controller/.import-restrictions
- pkg/client/unversioned was removed in 59042
- pkg/kubectl/.import-restrictions
- pkg/printers is no longer used
- pkg/api was masking all of the pkg/apis prefixes
- rm staging/src/k8s.io/code-generator/cmd/lister-gen/.import-restrictions
- noop / empty file
- test/e2e/framework/.import-restrictions
- we don't need to explicitly allow k8s.io repos (external or published)
yaml has comments, so we can explain why we have certain rules or
certain prefixes
for those files that weren't already commented yaml, I converted them to
yaml and took a best guess at comments based on the PRs that introduced
or updated them
This is to avoid unnecessary GCE API calls done by getInstanceByName
helper, which is iterating over all zones to find in which zone the
VM exists.
ProviderID already contains all the information - it's in the form:
gce://<VM URL> (VM URL contains project, zone, VM name).
ProviderID is propagated by Kubelet on node registration and in case
of bugs backfilled by node-controller.
- PV has a dangling reference to a PVC
- PVC is trying to bind to a PV that already references a different PVC
Change-Id: Ic509d39808763149b02b4dd52347edb74a8803fd
* move well-known kubelet cloud provider annotations to k8s.io/cloud-provider
Signed-off-by: andrewsykim <kim.andrewsy@gmail.com>
* cloud provider: rename AnnotationProvidedIPAddr to AnnotationAlphaProvidedIPAddr to indicate alpha status
Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
This makes debugging individual test cases much easier. For example:
go test -p 1 ./pkg/controller/cronjob -run TestSyncOne_Status/prev_ran_but_done,_is_time,_past_deadline -v
The EndpointSlice controller has the potential to manage a large number of resources that are updated frequently. Without proper backoffs in place, there is potential for it to unnecessarily overload the API Server with requests. This makes two significant changes: Increasing the base backoff from 5ms to 1s and making all syncs triggered by EndpointSlice changes delayed by at least 1 second to enable batching.
Most of these could have been refactored automatically but it wouldn't
have been uglier. The unsophisticated tooling left lots of unnecessary
struct -> pointer -> struct transitions.
This is gross but because NewDeleteOptions is used by various parts of
storage that still pass around pointers, the return type can't be
changed without significant refactoring within the apiserver. I think
this would be good to cleanup, but I want to minimize apiserver side
changes as much as possible in the client signature refactor.
The scheduler doesn't really need to know in detail which reasons
rendered a node unusable for a node. All it needs from the volume
binder is a list of reasons that it then can present to the user.
This seems a bit cleaner. But the main reason for the change is that
it simplifies the checking of CSI inline volumes and perhaps later
capacity checking. Both will lead to new failure reasons, which then
can be added without changing the interface.
The StatefulSet controller cleans up ControllerRevisions at the end of
the reconcile loop. If something goes wrong during reconcile, it bails
out without actually performing this step. This commit moves the cleanup
to a deferred function call to guarantee it will be executed.
Fixes issue: https://github.com/kubernetes/kubernetes/issues/85690
During EndpointSlice reconcilation, EndpointSliceTracker is supposed to
track expected EndpointSlice resource versions so that external changes
to them can be detected. But it actually tracked the stale resource
version and resulted in every Service was handled twice as it always
received an EndpointSlice update with a different resource version but
was actually created/updated by itself during the first processing.
Using t.Run() has some advantages:
- individual sub-tests can be selected with -run
- test failures are automatically associated with
the scenario, therefore the test name no longer
needs to be passed around
A "run" function is used primarily to avoid large indention changes,
but also because this can be used to extend the test with additional
parameters later on.
Add logic in service_controller to skip create/update
if finalizer from a different controller is found.
The newly added finalizer will be checked by other controllers
implementing ILB services to determine if a given service is
already being managed by service_controller.
Moved finalizer check into cloudprovider code.
added unit test to verify new finalizer.
Modified existing unit test to create a fake service so that
attach/remove finalizer step can be tested.
A transient issue might occur that causes an error to be returned by
InstanceID(). When this is ignored, the external cloud provider taint
will be removed and neither AddCloudNode() nor UpdateCloudNode() will
try to set a provider ID in the future.
By returning the error we can ensure that the external cloud provider
taint is not removed prematurely, allowing the operation to be retried
(until the provider ID can be set).
Preserve support for external cloud providers that do not use IDs by
continuing if a NotImplemented error is returned, making a distinction
between lack of support for provider IDs and an actual error.
Introduce pair of unit tests that show a provider ID will eventually
be set if an error is returned, unless that error is a NotImplemented,
in which case the external cloud provider taint will be removed.
The `recorder.PastEventf` method wasn't actually working as advertised.
It was supposed to accept a timestamp, which would be used when
generating the event. However, as the
[source code](https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/record/event.go#L316)
shows, this `timestamp` was never actually used.
In other words, `PastEventf` is identical to `Eventf`.
We have two options: one would be to fix `PastEventf` so that it works
as advertised. The other would be to delete `PastEventf` and only
support `Eventf`.
Ultimately, I could only find one use of `PastEventf` in the code base,
so I propose we just delete `PastEventf` and convert all uses to
`Eventf`.
This adds a new EndpointSlice tracker to keep track of the expected resource versions of EndpointSlices associated with each Service managed by the EndpointSlice controller. This should prevent a potential race where a syncService call could happen with an incomplete view of EndpointSlices if additions or deletions hadn't fully propagated to the cache yet. Additionally, this ensures that external changes to EndpointSlices will be handled by the EndpointSlice controller.
This ended up causing far more problems than it was worth, especially
given that it just attempted to provide backwards compatibility with
the alpha release.
This patch removes pkg/util/mount completely, and replaces it with the
mount package now located at k8s.io/utils/mount. The code found at
k8s.io/utils/mount was moved there from pkg/util/mount, so the code is
identical, just no longer in-tree to k/k.
update tests
add comment
amend var name
update comment
add check for empty slice
fix tests
fix mask size in test
review feedback
add ipv4 and ipv6 flag for mask sizes
add to violation exception list
remove import alias
run update-openapi-spec
review feedback
run update-bazel
review feedback
review feedback
This patch removes mount.Exec entirely and instead uses the common
utility from k8s.io/utils/exec.
The fake exec implementation found in k8s.io/utils/exec differs a bit
than mount.Exec, with the ability to pre-script expected calls to
Command.CombinedOutput(), so tests that previously relied on a callback
mechanism to produce specific output have been updated to use that
mechanism.
This adds a new Label to EndpointSlices that will ensure that multiple
controllers or entities can manage subsets of EndpointSlices. This
label provides a way to indicate the controller or entity responsible
for managing an EndpointSlice.
To provide a seamless upgrade from the alpha release of EndpointSlices
that did not support this label, a temporary annotation has been added
on Services to indicate that this label has been initially set on
EndpointSlices. That annotation will be set automatically by the
EndpointSlice controller with this commit once appropriate Labels have
been added on the corresponding EndpointSlices.
This was an oversight in the initial EndpointSlice release. This update
will ensure that Endpoints and EndpointSlices use the same logic to set
the Hostname attribute.
The Service spec includes a PublishNotReadyAddresses field which has
been used by Endpoints to report all matching resources ready. This may
or may not have been the initial purpose of the field, but given the
desire to provide backwards compatibility with the Endpoints API here,
it seems to make sense to continue to provide the same functionality.
Instead of reporting an event or displaying an error, simply exit
when the namespace is being terminated. This reduces the amount of
controller churn on namespace shutdown. Unlike other controllers, we
drop the replica set create error very late (in the queue handleErr)
in order to avoid changing the structure of the controller
substantially.
Instead of reporting an event or displaying an error, simply exit
when the namespace is being terminated. This reduces the amount of
controller churn on namespace shutdown. While we could technically
exit the entire processing loop early for very large jobs,
we should wait for more evidence that is an issue before changing
that logic substantially.
Instead of reporting an event or displaying an error, simply exit
when the namespace is being terminated. This reduces the amount of
controller churn on namespace shutdown. While we could technically
exit the entire processing loop early for very large daemon sets,
we should wait for more evidence that is an issue before changing
that logic substantially.
Instead of reporting an event or displaying an error, simply exit
when the namespace is being terminated. This reduces the amount of
controller churn on namespace shutdown. While we could technically
exit the entire processing loop early for very large replica sets,
we should wait for more evidence that is an issue before changing
that logic substantially.
In some scenarios the service account and token controllers can
race with namespace deletion, causing a burst of errors as they
attempt to recreate secrets being deleted.
Instead, detect these errors and do not retry.
When scaling down a ReplicaSet, delete doubled up replicas first, where a
"doubled up replica" is defined as one that is on the same node as an
active replica belonging to a related ReplicaSet. ReplicaSets are
considered "related" if they have a common controller (typically a
Deployment).
The intention of this change is to make a rolling update of a Deployment
scale down the old ReplicaSet as it scales up the new ReplicaSet by
deleting pods from the old ReplicaSet that are colocated with ready pods of
the new ReplicaSet. This change in the behavior of rolling updates can be
combined with pod affinity rules to preserve the locality of a Deployment's
pods over rollout.
A specific scenario that benefits from this change is when a Deployment's
pods are exposed by a Service that has type "LoadBalancer" and external
traffic policy "Local". In this scenario, the load balancer uses health
checks to determine whether it should forward traffic for the Service to a
particular node. If the node has no local endpoints for the Service, the
health check will fail for that node. Eventually, the load balancer will
stop forwarding traffic to that node. In the meantime, the service proxy
drops traffic for that Service. Thus, in order to reduce risk of dropping
traffic during a rolling update, it is desirable preserve node locality of
endpoints.
* pkg/controller/controller_utils.go (ActivePodsWithRanks): New type to
sort pods using a given ranking.
* pkg/controller/controller_utils_test.go (TestSortingActivePodsWithRanks):
New test for ActivePodsWithRanks.
* pkg/controller/replicaset/replica_set.go
(getReplicaSetsWithSameController): New method. Given a ReplicaSet, return
all ReplicaSets that have the same owner.
(manageReplicas): Call getIndirectlyRelatedPods, and pass its result to
getPodsToDelete.
(getIndirectlyRelatedPods): New method. Given a ReplicaSet, return all
pods that are owned by any ReplicaSet with the same owner.
(getPodsToDelete): Add an argument for related pods. Use related pods and
the new getPodsRankedByRelatedPodsOnSameNode function to take into account
whether a pod is doubled up when sorting pods for deletion.
(getPodsRankedByRelatedPodsOnSameNode): New function. Return an
ActivePodsWithRanks value that wraps the given slice of pods and computes
ranks where each pod's rank is equal to the number of active related pods
that are colocated on the same node.
* pkg/controller/replicaset/replica_set_test.go (newReplicaSet): Set
OwnerReferences on the ReplicaSet.
(newPod): Set a unique UID on the pod.
(byName): New type to sort pods by name.
(TestGetReplicaSetsWithSameController): New test for
getReplicaSetsWithSameController.
(TestRelatedPodsLookup): New test for getIndirectlyRelatedPods.
(TestGetPodsToDelete): Augment the "various pod phases and conditions, diff
= len(pods)" test case to ensure that scale-down still selects doubled-up
pods if there are not enough other pods to scale down. Add a "various pod
phases and conditions, diff = len(pods), relatedPods empty" test case to
verify that getPodsToDelete works even if related pods could not be
determined. Add a "ready and colocated with another ready pod vs not
colocated, diff < len(pods)" test case to verify that a doubled-up pod gets
preferred for deletion. Augment the "various pod phases and conditions,
diff < len(pods)" test case to ensure that not-ready pods are preferred
over ready but doubled-up pods.
* pkg/controller/replicaset/BUILD: Regenerate.
* test/e2e/apps/deployment.go
(testRollingUpdateDeploymentWithLocalTrafficLoadBalancer): New end-to-end
test. Create a deployment with a rolling update strategy and affinity
rules and a load balancer with "Local" external traffic policy, and verify
that set of nodes with local endponts for the service remains unchanged
during rollouts.
(setAffinity): New helper, used by
testRollingUpdateDeploymentWithLocalTrafficLoadBalancer.
* test/e2e/framework/service/jig.go (GetEndpointNodes): Factor building the
set of node names out...
(GetEndpointNodeNames): ...into this new method.