- PreemptionByKubeScheduler (Pod preempted by kube-scheduler)
- DeletionByTaintManager (Pod deleted by taint manager due to NoExecute taint)
- EvictionByEvictionAPI (Pod evicted by Eviction API)
- DeletionByPodGC (an orphaned Pod deleted by PodGC)PreemptedByScheduler (Pod preempted by kube-scheduler)
DaemonSetsController adds a "nodeName" index to PodIndexer, which is
redundant with the "spec.nodeName" index of NodeLifecycleController.
However, DaemonSetsController hasn't been using this index since #86730.
This patch removes the redundant and unused index to reduce memory and
CPU spent on it.
Signed-off-by: Quan Tian <qtian@vmware.com>
The field replicaChange in timestampedScaleEvent was wrongly described
as either positive or negative depending on the scale direction. In
fact the change is set as unsigned, positive or 0 even for downscales.
- Run hack/update-codegen.sh
- Run hack/update-generated-device-plugin.sh
- Run hack/update-generated-protobuf.sh
- Run hack/update-generated-runtime.sh
- Run hack/update-generated-swagger-docs.sh
- Run hack/update-openapi-spec.sh
- Run hack/update-gofmt.sh
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
The evictorLock only protects zonePodEvictor and zoneNoExecuteTainter.
processTaintBaseEviction showed indications of increased lock contention
among goroutines (see issue 110341 for more details).
The refactor done is to ensure that all codepaths in that function that
hold the evictorLock AND make API calls under the lock, are now making
API calls outside the lock and the lock is held only for accessing either
zonePodEvictor or zoneNoExecuteTainter or both.
Two other places where the refactor was done is the doEvictionPass and
doNoExecuteTaintingPass functions which make multiple API calls under
the evictorLock.
Signed-off-by: Madhav Jivrajani <madhav.jiv@gmail.com>
Fix a TODO to plumb an update filter from above in the resource quota
monitor code that was handling update events for quota-able objects,
instead of hard-coding the logic in the resource quota monitor.
Signed-off-by: Andy Goldstein <andy.goldstein@redhat.com>
6 minute force-deatch timeout should be used only for nodes that are not
healthy.
In case a CSI driver is being upgraded or it's simply slow, NodeUnstage
can take more than 6 minutes. In that case, Pod is already deleted from the
API server and thus A/D controller will force-detach a mounted volume,
possibly corrupting the volume and breaking CSI - a CSI driver expects
NodeUnstage to succeed before Kubernetes can call ControllerUnpublish.
When a Pod is referencing a Node that doesn't exist on the local
informer cache, the current behavior was to return an error to
retry later and stop processing.
However, this can cause scenarios that a missing node leaves a
Slice stuck, it can no reflect other changes, or be created.
Also, this doesn't respect the publishNotReadyAddresses options
on Services, that considers ok to publish pod Addresses that are
known to not be ready.
The new behavior keeps retrying the problematic Service, but it
keeps processing the updates, reflacting current state on the
EndpointSlice. If the publishNotReadyAddresses is set, a missing
node on a Pod is not treated as an error.
There is always a placeholder slice.
The ServicePortCache logic was considering always one endpointSlice
per Endpoint, but if there are multiple empty Endpoints, we just
use one placeholder slice, not multiple placeholder slices.
Fixes Issue 108231 by checking `slicesToDelete` in the EndpointSlice
reconciler for a pre-existing placeholder slice.
Also adds a helper function for comparing the slices.
in case its status is False or Unknown.
In case the status of the pre-existing condition is true we ignore the new
condition. If there is no pre-existing failed condition, then append
the new failed condition as before.
Also, make the condition comparisons less hacky by ignoring timestamp fields
in tests.
Terminal pods, whose phase its Failed or Succeeded, are guaranteed
to never regress and to be stopped, so their IPs never should
be published on the Endpoints.
* feat: Provide previous replica count for deployment/replica set scale up/down event
Signed-off-by: GitHub <noreply@github.com>
* change format of event
Co-authored-by: Maciej Szulik <soltysh@gmail.com>
Co-authored-by: Maciej Szulik <soltysh@gmail.com>
In some rare race conditions, the job controller might create new pods after the job is declared finished.
Change-Id: I8a00429c8845463259cd7f82bb3c241d0011583c
When calculating the scale-up/scale-down limit, the number of replicas
at the start of the scaling policy period is calculated correctly by
taken into account the number of scaled-up and scaled-down replicas.
Signed-off-by: Olivier Michaelis <38879457+oliviermichaelis@users.noreply.github.com>
Remove the comment "As of v1.22, this field is beta and is controlled
via the CSRDuration feature gate" from the expirationSeconds field's
godoc.
Mark the "CSRDuration" feature gate as GA in 1.24, lock its value to
"true", and remove the various logic which handled when the gate was
"false".
Update conformance test to check that the CertificateSigningRequest's
Spec.ExpirationSeconds field is stored, but do not check if the field
is honored since this functionality is optional.
This patch aims to simplify decoupling "pkg/scheduler/framework/plugins"
from internal "k8s.io/kubernetes" packages. More described in
issue #89930 and PR #102953.
Some helpers from "k8s.io/kubernetes/pkg/controller/volume/persistentvolume"
package moved to "k8s.io/component-helpers/storage/volume" package:
- IsDelayBindingMode
- GetBindVolumeToClaim
- IsVolumeBoundToClaim
- FindMatchingVolume
- CheckVolumeModeMismatches
- CheckAccessModes
- GetVolumeNodeAffinity
Also "CheckNodeAffinity" from "k8s.io/kubernetes/pkg/volume/util"
package moved to "k8s.io/component-helpers/storage/volume" package
to prevent diamond dependency conflict.
Signed-off-by: Konstantin Misyutin <konstantin.misyutin@huawei.com>
- Lock feature gate to true and schedule for deletion in 1.26
- Remove checks on feature gate
- Graduate E2E test to Conformance
Change-Id: I6814819d318edaed5c86dae4055f4b050a4d39fd
All the controllers should use context for signalling termination of communication with API server. Once kcm cancels context all the cert controllers which are started via kcm should cancel the APIServer request in flight instead of hanging around.
The field is not used anywhere and its value may be stale as Endpoints
and EndpointSlice won't be updated if there is only Pod ResourceVersion
change..
- actual_state_of_world_test.go: test the new method GetVolumesToReportAttachedForNode
for an existing node and a non-existing node
- node_status_updater_test.go: test UpdateNodeStatuses and UpdateNodeStatuses in nominal
case with 2 nodes getting one volume each. Test UpdateNodeStatuses with the first call
to node.patch failing but the following one succeeding
- add comment in node_status_updater.go
- fix log line in reconciler.go
- rename variable in actual_state_of_world.go
The UpdateNodeStatuses code stops too early in case there is
an error when calling updateNodeStatus. It will return immediately
which means any remaining node won't have its update status put back
to true.
Looking at the call sites for UpdateNodeStatuses, it appears this is
not the only issue. If the lister call fails with anything but a Not Found
error, it's silently ignored which is wrong in the detach path.
Also the reconciler detach path calls UpdateNodeStatuses but the real intent
is to only update the node currently processed in the loop and not proceed
with the detach call if there is an error updating that specifi node volumesAttached
property. With the current implementation, it will not proceed if there is
an error updating another node (which is not completely bad but not ideal) and
worse it will proceed if there is a lister error on that node which means the
node volumesAttached property won't have been updated.
To fix those issues, introduce the following changes:
- [node_status_updater] introduce UpdateNodeStatusForNode which does what
UpdateNodeStatuses does but only for the provided node
- [node_status_updater] if the node lister call fails for anything but a Not
Found error, we will return an error, not ignore it
- [node_status_updater] if the update of a node volumesAttached properties fails
we continue processing the other nodes
- [actual_state_of_world] introduce GetVolumesToReportAttachedForNode which
does what GetVolumesToReportAttached but for the node whose name is provided
it returns a bool which indicates if the node in question needs an update as
well as the volumesAttached list. It is used by UpdateNodeStatusForNode
- [actual_state_of_world] use write lock in updateNodeStatusUpdateNeeded, we're
modifying the map content
- [reconciler] use UpdateNodeStatusForNode in the detach loop
When comparing EndpointSubsets and Endpoints, we ignore the difference
in ResourceVersion of Pod to avoid unnecessary updates caused by Pod
updates that we don't care, e.g. annotation update.
Otherwise periodic Service resync would intensively update Endpoints or
EndpointSlice whose Pods have irrelevant change between two resyncs,
leading to delay in processing newly created Services. In a scale
cluster with thousands of such Endpoints, we observed 2 minutes of
delay when the resync happens.
Remove `tolerate-unready-endpoints` annotation in Service deprecated
from 1.11, use `Service.spec.publishNotReadyAddresses` instead.
Signed-off-by: He Xiaoxi <tossmilestone@gmail.com>
In the following code pattern, the log message will get logged with v=0 in JSON
output although conceptually it has a higher verbosity:
if klog.V(5).Enabled() {
klog.Info("hello world")
}
Having the actual verbosity in the JSON output is relevant, for example for
filtering out only the important info messages. The solution is to use
klog.V(5).Info or something similar.
Whether the outer if is necessary at all depends on how complex the parameters
are. The return value of klog.V can be captured in a variable and be used
multiple times to avoid the overhead for that function call and to avoid
repeating the verbosity level.
ServerResources function was deprecated and instead ServerGroupsAndResources
function is suggested.
This PR removes ServerResources function and move every place to use ServerGroupsAndResources.
Test 5-7 tries to delete a PVC at the very same time when it detects that
the PV controller started processing the PVC. The controller then sometimes
can't update the PVC and generate an event for it that the test expects.
From PV controller logs (not shown in CI):
> I1221 14:36:34.548160 104481 pv_controller.go:815] updating PersistentVolumeClaim[default/claim5-7] status: set phase Lost failed: cannot update claim claim5-7: claim not found
Typical error in CI:
> FAIL: TestControllerSync (83.22s)
> framework_test.go:202: Event "Warning ClaimLost" not emitted
Therefore wait for the PVC to be fully processed before deleting the PVC to
avoid races.
Add fake Pod and Node watchers to the tests. It only reduces test noise:
Failed to watch *v1.Pod: unhandled watch: testing.WatchActionImpl{ActionImpl:testing.ActionImpl{Namespace:"", Verb:"watch", Resource:schema.GroupVersionResource{Group:"", Version:"v1", Resource:"pods"}, Subresource:""}, WatchRestrictions:testing.WatchRestrictions{Labels:labels.internalSelector(nil), Fields:fields.andTerm{}, ResourceVersion:""}}
CRON_TZ variable slipped in during upgrading github.com/robfig/cron
library. It allows setting a time zone which is a long requested
feature but one that is not officially supported. This adds warning
event since users should not rely on unsupported features.
Signed-off-by: wangyysde <net_use@bzhy.com>
Generation swagger.json.
Use v2 path for hpa_cpu_field.
run update-codegen.sh
Signed-off-by: wangyysde <net_use@bzhy.com>
* fix: 81134: fix unsafe json for ReleaseControllerRevision
1. Ensures that ReleaseControllerRevision returns a proper json by
marshalling an object into bytes. Otherwise, it returns an error.
2. Also, refactors the code to commonize the merge type
GenerateDeleteOwnerRefStrategicMergeBytes that returns a byte and is
used across ReleasePod, ReleaseControllerRevison
ReleaseReplicaSet.
* Move GeneratePatchBytesForDelete to controller_ref_manager
This feature has graduated to GA in v1.11 and will always be
enabled. So no longe need to check if enabled.
Signed-off-by: Konstantin Misyutin <konstantin.misyutin@huawei.com>
* fix_dsc_rbac_pod_update
* add test for DaemonSet Controller updates label of the pod after "DedupCurHistories"
* rebase
* update parameter of dsc.Run
As well as feature gate are locked, the tests when this feature is
disabled will crash. So we should remove them together with locking
the feature.
Signed-off-by: Konstantin Misyutin <konstantin.misyutin@huawei.com>
The feature gate gets locked to "true", with the goal to remove it in two
releases.
All code now can assume that the feature is enabled. Tests for "feature
disabled" are no longer needed and get removed.
Some code wasn't using the new helper functions yet. That gets changed while
touching those lines.
The name concatenation and ownership check were originally considered small
enough to not warrant dedicated functions, but the intent of the code is more
readable with them.
There also was a missing owner check in the attach controller.
During volume detach, the following might happen in reconciler
1. Pod is deleting
2. remove volume from reportedAsAttached, so node status updater will
update volumeAttached list
3. detach failed due to some issue
4. volume is added back in reportedAsAttached
5. reconciler loops again the volume, remove volume from
reportedAsAttached
6. detach will not be trigged because exponential back off, detach call
will fail with exponential backoff error
7. another pod is added which using the same volume on the same node
8. reconciler loops and it will NOT try to tigger detach anymore
At this point, volume is still attached and in actual state, but
volumeAttached list in node status does not has this volume anymore, and
will block volume mount from kubelet.
The fix in first round is to add volume back into the volume list that
need to reported as attached at step 6 when detach call failed with
error (exponentical backoff). However this might has some performance
issue if detach fail for a while. During this time, volume will be keep
removing/adding back to node status which will cause a surge of API
calls.
So we changed to logic to check first whether operation is safe to retry which
means no pending operation or it is not in exponentical backoff time
period before calling detach. This way we can avoid keep removing/adding
volume from node status.
Change-Id: I5d4e760c880d72937d34b9d3e904ecad125f802e
Add the UIDs of Pods for which we are removing finalizers to an in-memory cache.
The controller removes UIDs from the cache as Pod updates or deletes come in.
This avoids double counting finished Pods when Pod updates arrive after Job status updates.
https://github.com/kubernetes/kubernetes/issues/105200
The bug could result in the EndpointSlice controller unnecessarily updating
EndpointSlices associated with a Service that had Topology Aware Hints enabled.
Doing a GET right before retrying has 2 problems:
- It can masquerade conflicts
- It adds an additional delay
As for retries, we are better of going through the sync backoff.
In the case of conflict, we know that there was a Job update that would trigger another sync, so there is no need to do a rate limited requeue.
The `root_ca_cert_publisher_sync_duration_seconds` metric tracks the sync
duration in the root CA cert publisher per code and namespace. In
clusters with a high namespace turnover (like CI clusters), this may
cause the kube-controller-manager to expose over 100k series to
Prometheus, which may cause degradation of that service.
Drop the `namespace` label to remove the metrics' cardinality, tracking
this metric by namespace does not justify the impact of keeping it.
When doing partial updates for uncountedTerminatedPods, the controller might have removed UIDs for Pods which still had finalizers.
Also make more space by removing UIDs that don't have finalizers at the beginning of the sync.
This PR adds GA AnnStorageProvisioner annotation to
a PVC if the PVC requires dynamic provisioning. This
also deprecates the beta AnnStorageProvisioner annotation
and it will be removed in a later release.
All dependencies of VolumeBinding plugin from
"k8s.io/kubernetes/pkg/controller/volume/scheduling" package moved to
"k8s.io/kubernetes/pkg/scheduler/framework/plugins/volumebinding" package:
- whole file pkg/controller/volume/scheduling/scheduler_assume_cache.go
- whole file pkg/controller/volume/scheduling/scheduler_assume_cache_test.go
- whole file pkg/controller/volume/scheduling/scheduler_binder.go
- whole file pkg/controller/volume/scheduling/scheduler_binder_fake.go
- whole file pkg/controller/volume/scheduling/scheduler_binder_test.go
Package "k8s.io/kubernetes/pkg/controller/volume/scheduling/metrics" moved
to "k8s.io/kubernetes/pkg/scheduler/framework/plugins/volumebinding/metrics"
because it only used in VolumeBinding plugin and (e2e) tests.
More described in issue #89930 and PR #102953.
Signed-off-by: Konstantin Misyutin <konstantin.misyutin@huawei.com>