kubernetes

Author	SHA1	Message	Date
Kubernetes Prow Robot	cbfebf02e8	Merge pull request #121720 from aojea/first_pod_network_startup kubelet: add internal metric for the first pod with network latency	2024-02-22 07:13:25 -08:00
Kubernetes Prow Robot	0f7cc6fcaa	Merge pull request #121778 from Tal-or/mm_metrics kubelet: memorymanager: metrics: add metrics about static allocation	2024-02-20 09:41:50 -08:00
Kubernetes Prow Robot	5d776f935c	Merge pull request #123345 from haircommander/image-gc-metric-reason KEP-4210: kubelet: add reason field to image gc metric	2024-02-19 18:56:59 -08:00
AxeZhan	c74ec3df09	graduate PodLifecycleSleepAction to beta	2024-02-19 19:40:52 +08:00
Peter Hunt	c8b4d8ebed	kubelet: add reason field to image gc metric Signed-off-by: Peter Hunt <pehunt@redhat.com>	2024-02-16 16:02:41 -05:00
Kubernetes Prow Robot	14f8f5519d	Merge pull request #121719 from ruiwen-zhao/metric-size Add image pull duration metric with bucketed image size	2024-02-13 16:23:50 -08:00
ruiwen-zhao	0f5cf6c1cd	Add image pull duration metric with bucketed image size Signed-off-by: ruiwen-zhao <ruiwen@google.com>	2024-02-08 00:30:31 +00:00
carlory	55c5db172e	lock GA feature-gate ConsistentHTTPGetHandlers to default	2024-01-04 15:12:08 +08:00
Antonio Ojea	b8533f7976	kubelet: add metric for the first pod with network latency The first pod with network latency impact user workloads, however, it is difficuly to understand where is the problem of this latency, since it depends on the CNI plugin to be ready at the moment of the pod creation. Add a new internal metric in the kubelet that allow developers and cluster administrator to understand the source of the latency problems on node startups. kubelet_first_network_pod_start_sli_duration_seconds Change-Id: I4cdb55b0df72c96a3a65b78ce2aae404c5195006	2023-11-15 06:09:49 +00:00
Talor Itzhak	ddd60de3f3	memorymanager:metrics: add metrics As part of the memory manager GA graduation effort, we should add metrics in order to iprove observability. The metrics also mentioned in the PR https://github.com/kubernetes/enhancements/pull/4251 (which was not merged yet) Signed-off-by: Talor Itzhak <titzhak@redhat.com>	2023-11-12 09:34:55 +02:00
ruiwen-zhao	1165609036	Add metric for e2e pod startup latency including image pull Signed-off-by: ruiwen-zhao <ruiwen@google.com>	2023-10-25 20:34:17 +00:00
Kubernetes Prow Robot	12b01aff1b	Merge pull request #121275 from haircommander/image-max-gc KEP-4210: add support for ImageMaximumGCAge field	2023-10-25 21:29:10 +02:00
Kubernetes Prow Robot	f82670d8ec	Merge pull request #120680 from ruiwen-zhao/pod-start-bucket Use a wider-range of metric buckets for PodStartDuration	2023-10-25 20:16:34 +02:00
Peter Hunt	49c947ba15	metrics: add and use ImageGarbageCollectedTotal to help find MaxAge thresholds and detect image addition/removal thrashing Signed-off-by: Peter Hunt <pehunt@redhat.com>	2023-10-20 12:23:31 -04:00
ruiwen-zhao	9b50af1f4f	Use a wider-range of metric buckets for PodStartDuration Signed-off-by: ruiwen-zhao <ruiwen@google.com>	2023-09-14 21:32:14 +00:00
Qiutong Song	d3eb082568	Create a node startup latency tracker Signed-off-by: Qiutong Song <songqt01@gmail.com>	2023-09-11 05:54:25 +00:00
Francesco Romani	01c3a51a78	node: podresources: getallocatable: move to GA lock the feature gate to GA, and remove the now-redundant code. Signed-off-by: Francesco Romani <fromani@redhat.com>	2023-07-12 14:11:22 +02:00
Kubernetes Prow Robot	cfeb83d56b	Merge pull request #116525 from ffromani/kubelet-podresources-endpoint-ga node: podresources: graduate to GA	2023-05-25 16:38:50 -07:00
Mark Rossetti	ab9c8eb1e8	Removing WindowsHostProcessContainers feature-gate Signed-off-by: Mark Rossetti <marosset@microsoft.com>	2023-05-01 13:30:38 -07:00
Francesco Romani	69bc685556	node: podresources: graduate to GA Lock the feature gate to ON and simplify the code accordingly. Signed-off-by: Francesco Romani <fromani@redhat.com>	2023-05-01 16:23:28 +02:00
Moshe Levi	71d6e4d53c	kubelet metrics: add pod resources get metrics Signed-off-by: Moshe Levi <moshele@nvidia.com>	2023-03-14 19:33:03 +02:00
Kubernetes Prow Robot	c6f3007071	Merge pull request #115967 from harche/evented_pleg_metrics Graduate Evented PLEG to Beta	2023-03-10 17:34:40 -08:00
Kubernetes Prow Robot	a408be817f	Merge pull request #115972 from jsafrane/add-orphan-pod-metrics Add metric for failed orphan pod cleanup	2023-03-09 22:43:26 -08:00
Clayton Coleman	6b9a381185	kubelet: Force deleted pods can fail to move out of terminating If a CRI error occurs during the terminating phase after a pod is force deleted (API or static) then the housekeeping loop will not deliver updates to the pod worker which prevents the pod's state machine from progressing. The pod will remain in the terminating phase but no further attempts to terminate or cleanup will occur until the kubelet is restarted. The pod worker now maintains a store of the pods state that it is attempting to reconcile and uses that to resync unknown pods when SyncKnownPods() is invoked, so that failures in sync methods for unknown pods no longer hang forever. The pod worker's store tracks desired updates and the last update applied on podSyncStatuses. Each goroutine now synchronizes to acquire the next work item, context, and whether the pod can start. This synchronization moves the pending update to the stored last update, which will ensure third parties accessing pod worker state don't see updates before the pod worker begins synchronizing them. As a consequence, the update channel becomes a simple notifier (struct{}) so that SyncKnownPods can coordinate with the pod worker to create a synthetic pending update for unknown pods (i.e. no one besides the pod worker has data about those pods). Otherwise the pending update info would be hidden inside the channel. In order to properly track pending updates, we have to be very careful not to mix RunningPods (which are calculated from the container runtime and are missing all spec info) and config- sourced pods. Update the pod worker to avoid using ToAPIPod() and instead require the pod worker to directly use update.Options.Pod or update.Options.RunningPod for the correct methods. Add a new SyncTerminatingRuntimePod to prevent accidental invocations of runtime only pod data. Finally, fix SyncKnownPods to replay the last valid update for undesired pods which drives the pod state machine towards termination, and alter HandlePodCleanups to: - terminate runtime pods that aren't known to the pod worker - launch admitted pods that aren't known to the pod worker Any started pods receive a replay until they reach the finished state, and then are removed from the pod worker. When a desired pod is detected as not being in the worker, the usual cause is that the pod was deleted and recreated with the same UID (almost always a static pod since API UID reuse is statistically unlikely). This simplifies the previous restartable pod support. We are careful to filter for active pods (those not already terminal or those which have been previously rejected by admission). We also force a refresh of the runtime cache to ensure we don't see an older version of the state. Future changes will allow other components that need to view the pod worker's actual state (not the desired state the podManager represents) to retrieve that info from the pod worker. Several bugs in pod lifecycle have been undetectable at runtime because the kubelet does not clearly describe the number of pods in use. To better report, add the following metrics: kubelet_desired_pods: Pods the pod manager sees kubelet_active_pods: "Admitted" pods that gate new pods kubelet_mirror_pods: Mirror pods the kubelet is tracking kubelet_working_pods: Breakdown of pods from the last sync in each phase, orphaned state, and static or not kubelet_restarted_pods_total: A counter for pods that saw a CREATE before the previous pod with the same UID was finished kubelet_orphaned_runtime_pods_total: A counter for pods detected at runtime that were not known to the kubelet. Will be populated at Kubelet startup and should never be incremented after. Add a metric check to our e2e tests that verifies the values are captured correctly during a serial test, and then verify them in detail in unit tests. Adds 23 series to the kubelet /metrics endpoint.	2023-03-08 22:03:51 -06:00
Harshal Patil	412b4b3329	Add connection related metrics to EventedPLEG Signed-off-by: Harshal Patil <harpatil@redhat.com>	2023-03-01 11:35:27 -05:00
Jan Safranek	7bf9991389	Add metric for failed orphan pod cleanup	2023-02-22 18:43:38 +01:00
Swati Sehgal	bc941633c1	node: topology-mgr: add metric to measure topology mgr admission latency Signed-off-by: Swati Sehgal <swsehgal@redhat.com>	2023-02-15 13:59:47 +00:00
Kubernetes Prow Robot	4df945853e	Merge pull request #115137 from swatisehgal/topologymgr-metrics node: topologymgr: add metrics about admission requests and errors	2023-01-30 18:43:00 -08:00
Swati Sehgal	172c55d310	node: topologymgr: add metrics about admission requests and errors Signed-off-by: Swati Sehgal <swsehgal@redhat.com>	2023-01-17 17:50:29 +00:00
Paco Xu	70e56fa71a	cleanup: EphemeralContainers feature gate related codes	2023-01-15 21:15:01 +08:00
Kubernetes Prow Robot	1bf4af4584	Merge pull request #111930 from azylinski/new-histogram-pod_start_sli_duration_seconds New histogram: Pod start SLI duration	2022-11-04 07:28:14 -07:00
Francesco Romani	ff44dc1932	cpumanager: the FG is locked to default (ON) hence we can remove the if() guards, the feature is always available. Signed-off-by: Francesco Romani <fromani@redhat.com>	2022-11-02 18:41:41 +01:00
Mark Rossetti	498d065cc5	Promoting WindowsHostProcessContainers to stable Signed-off-by: Mark Rossetti <marosset@microsoft.com>	2022-11-01 14:06:25 -07:00
Francesco Romani	47d3299781	node: metrics: cpumanager: add pinning metrics In order to improve the observability of the cpumanager, add and populate metrics to track if the combination of the kubelet configuration and podspec would trigger exclusive core allocation and pinning. We should avoid leaking any node/machine specific information (e.g. core ids, even though this is admittedly an extreme example); tracking these metrics seems to be a good first step, because it allows us to get feedback without exposing details. Signed-off-by: Francesco Romani <fromani@redhat.com>	2022-10-27 14:40:40 +02:00
Artur Żyliński	9f31669a53	New histogram: Pod start SLI duration	2022-10-26 11:28:17 +02:00
Kubernetes Prow Robot	9bcb81e13f	Merge pull request #113175 from liggitt/pr_normalize_probes_lifecycle_handlers Record event and metric for lifecycle fallback to http	2022-10-20 02:31:08 -07:00
Jordan Liggitt	a5d785fae8	Record metric for lifecycle fallback to http	2022-10-19 14:45:25 -04:00
Francesco Romani	ba6b468982	node: metrics: register podresources metrics Because of a bug in the commit `1e7bb20c52`, podresources metrics were added, they are updated in the right places, but they are never exported, so they cannot be consumed. Fix trivially registering the metrics. Signed-off-by: Francesco Romani <fromani@redhat.com>	2022-10-06 15:14:56 +02:00
Clayton Coleman	e9a5fb7372	kubelet: Record a metric for latency of pod status update Track how long it takes for pod updates to propagate from detection to successful change on API server. Will guide future improvements in pod start and shutdown latency. Metric is `kubelet_pod_status_sync_duration_seconds` and is ALPHA stability. Histogram buckets are chosen based on distribution of observed status delays in practice.	2022-09-08 12:17:44 -04:00
Kubernetes Prow Robot	b0254c8a0b	Merge pull request #108758 from fengzixu/improvement-volume-health re-push "add volume kubelet_volume_stats_health_abnormal to kubelet #105585"	2022-03-29 17:35:34 -07:00
Kubernetes Prow Robot	5cb6fab8f6	Merge pull request #105585 from fengzixu/improvement-volume-health add volume kubelet_volume_stats_health_abnormal to kubelet	2022-03-17 01:32:38 +00:00
Maciej Borsz	aa95513982	Revert "add volume kubelet_volume_stats_health_abnormal to kubelet"	2022-03-16 13:44:09 +01:00
Kubernetes Prow Robot	1a5abe5d1f	Merge pull request #105585 from fengzixu/improvement-volume-health add volume kubelet_volume_stats_health_abnormal to kubelet	2022-03-15 05:58:11 -07:00
Shiming Zhang	5eb3e88f6b	Support metrics for node shutdown	2022-03-11 17:31:10 +08:00
fengzixu	9808ae48a0	change the volume health status metrics name	2022-01-23 02:44:10 +00:00
Sergey Kanzhelev	7e7bc6d53b	remove DynamicKubeletConfig logic from kubelet	2022-01-19 22:38:04 +00:00
fengzixu	bab1755274	fix: correct metrics expression	2022-01-11 13:50:17 +00:00
fengzixu	d71e21e01e	add volume kubelet_volume_stats_health_abnormal to kubelet	2022-01-11 13:50:17 +00:00
Kubernetes Prow Robot	19591a1324	Merge pull request #105829 from yuanchen8911/master Fix and improve comments on kubelet metrics	2022-01-04 23:02:32 -08:00
Elana Hashman	b35c500541	Revert "Bump DynamicKubeConfig metric deprecation to 1.23"	2021-11-17 11:48:49 -08:00

1 2 3

142 Commits