kubernetes

Author	SHA1	Message	Date
Moshe Levi	71d6e4d53c	kubelet metrics: add pod resources get metrics Signed-off-by: Moshe Levi <moshele@nvidia.com>	2023-03-14 19:33:03 +02:00
Kubernetes Prow Robot	c6f3007071	Merge pull request #115967 from harche/evented_pleg_metrics Graduate Evented PLEG to Beta	2023-03-10 17:34:40 -08:00
Kubernetes Prow Robot	a408be817f	Merge pull request #115972 from jsafrane/add-orphan-pod-metrics Add metric for failed orphan pod cleanup	2023-03-09 22:43:26 -08:00
Clayton Coleman	6b9a381185	kubelet: Force deleted pods can fail to move out of terminating If a CRI error occurs during the terminating phase after a pod is force deleted (API or static) then the housekeeping loop will not deliver updates to the pod worker which prevents the pod's state machine from progressing. The pod will remain in the terminating phase but no further attempts to terminate or cleanup will occur until the kubelet is restarted. The pod worker now maintains a store of the pods state that it is attempting to reconcile and uses that to resync unknown pods when SyncKnownPods() is invoked, so that failures in sync methods for unknown pods no longer hang forever. The pod worker's store tracks desired updates and the last update applied on podSyncStatuses. Each goroutine now synchronizes to acquire the next work item, context, and whether the pod can start. This synchronization moves the pending update to the stored last update, which will ensure third parties accessing pod worker state don't see updates before the pod worker begins synchronizing them. As a consequence, the update channel becomes a simple notifier (struct{}) so that SyncKnownPods can coordinate with the pod worker to create a synthetic pending update for unknown pods (i.e. no one besides the pod worker has data about those pods). Otherwise the pending update info would be hidden inside the channel. In order to properly track pending updates, we have to be very careful not to mix RunningPods (which are calculated from the container runtime and are missing all spec info) and config- sourced pods. Update the pod worker to avoid using ToAPIPod() and instead require the pod worker to directly use update.Options.Pod or update.Options.RunningPod for the correct methods. Add a new SyncTerminatingRuntimePod to prevent accidental invocations of runtime only pod data. Finally, fix SyncKnownPods to replay the last valid update for undesired pods which drives the pod state machine towards termination, and alter HandlePodCleanups to: - terminate runtime pods that aren't known to the pod worker - launch admitted pods that aren't known to the pod worker Any started pods receive a replay until they reach the finished state, and then are removed from the pod worker. When a desired pod is detected as not being in the worker, the usual cause is that the pod was deleted and recreated with the same UID (almost always a static pod since API UID reuse is statistically unlikely). This simplifies the previous restartable pod support. We are careful to filter for active pods (those not already terminal or those which have been previously rejected by admission). We also force a refresh of the runtime cache to ensure we don't see an older version of the state. Future changes will allow other components that need to view the pod worker's actual state (not the desired state the podManager represents) to retrieve that info from the pod worker. Several bugs in pod lifecycle have been undetectable at runtime because the kubelet does not clearly describe the number of pods in use. To better report, add the following metrics: kubelet_desired_pods: Pods the pod manager sees kubelet_active_pods: "Admitted" pods that gate new pods kubelet_mirror_pods: Mirror pods the kubelet is tracking kubelet_working_pods: Breakdown of pods from the last sync in each phase, orphaned state, and static or not kubelet_restarted_pods_total: A counter for pods that saw a CREATE before the previous pod with the same UID was finished kubelet_orphaned_runtime_pods_total: A counter for pods detected at runtime that were not known to the kubelet. Will be populated at Kubelet startup and should never be incremented after. Add a metric check to our e2e tests that verifies the values are captured correctly during a serial test, and then verify them in detail in unit tests. Adds 23 series to the kubelet /metrics endpoint.	2023-03-08 22:03:51 -06:00
Harshal Patil	412b4b3329	Add connection related metrics to EventedPLEG Signed-off-by: Harshal Patil <harpatil@redhat.com>	2023-03-01 11:35:27 -05:00
Jan Safranek	7bf9991389	Add metric for failed orphan pod cleanup	2023-02-22 18:43:38 +01:00
Swati Sehgal	bc941633c1	node: topology-mgr: add metric to measure topology mgr admission latency Signed-off-by: Swati Sehgal <swsehgal@redhat.com>	2023-02-15 13:59:47 +00:00
Kubernetes Prow Robot	4df945853e	Merge pull request #115137 from swatisehgal/topologymgr-metrics node: topologymgr: add metrics about admission requests and errors	2023-01-30 18:43:00 -08:00
Patrick Ohly	bc6c7fa912	logging: fix names of keys The stricter checking with the upcoming logcheck v0.4.1 pointed out these names which don't comply with our recommendations in https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/migration-to-structured-logging.md#name-arguments.	2023-01-23 14:24:29 +01:00
Swati Sehgal	172c55d310	node: topologymgr: add metrics about admission requests and errors Signed-off-by: Swati Sehgal <swsehgal@redhat.com>	2023-01-17 17:50:29 +00:00
Paco Xu	70e56fa71a	cleanup: EphemeralContainers feature gate related codes	2023-01-15 21:15:01 +08:00
Peter Hunt	1a7388c2ef	kubelet/metrics: add cri_metrics that pulls metrics from the CRI Signed-off-by: Peter Hunt <pehunt@redhat.com>	2022-11-08 14:47:08 -05:00
David Ashpole	64af1adace	Second attempt: Plumb context to Kubelet CRI calls (#113591 ) * plumb context from CRI calls through kubelet * clean up extra timeouts * try fixing incorrectly cancelled context	2022-11-05 06:02:13 -07:00
Kubernetes Prow Robot	1bf4af4584	Merge pull request #111930 from azylinski/new-histogram-pod_start_sli_duration_seconds New histogram: Pod start SLI duration	2022-11-04 07:28:14 -07:00
Francesco Romani	ff44dc1932	cpumanager: the FG is locked to default (ON) hence we can remove the if() guards, the feature is always available. Signed-off-by: Francesco Romani <fromani@redhat.com>	2022-11-02 18:41:41 +01:00
Antonio Ojea	9c2b333925	Revert "plumb context from CRI calls through kubelet" This reverts commit `f43b4f1b95`.	2022-11-02 13:37:23 +00:00
Kubernetes Prow Robot	9bbd0fbdb2	Merge pull request #113476 from marosset/hpc-to-stable Promoting WindowsHostProcessContainers to stable	2022-11-01 19:59:43 -07:00
Kubernetes Prow Robot	7b84436168	Merge pull request #113408 from dashpole/kubelet_context Plumb context to Kubelet CRI calls	2022-11-01 19:59:08 -07:00
Mark Rossetti	498d065cc5	Promoting WindowsHostProcessContainers to stable Signed-off-by: Mark Rossetti <marosset@microsoft.com>	2022-11-01 14:06:25 -07:00
David Ashpole	f43b4f1b95	plumb context from CRI calls through kubelet	2022-10-28 02:55:28 +00:00
Francesco Romani	47d3299781	node: metrics: cpumanager: add pinning metrics In order to improve the observability of the cpumanager, add and populate metrics to track if the combination of the kubelet configuration and podspec would trigger exclusive core allocation and pinning. We should avoid leaking any node/machine specific information (e.g. core ids, even though this is admittedly an extreme example); tracking these metrics seems to be a good first step, because it allows us to get feedback without exposing details. Signed-off-by: Francesco Romani <fromani@redhat.com>	2022-10-27 14:40:40 +02:00
Artur Żyliński	9f31669a53	New histogram: Pod start SLI duration	2022-10-26 11:28:17 +02:00
Kubernetes Prow Robot	9bcb81e13f	Merge pull request #113175 from liggitt/pr_normalize_probes_lifecycle_handlers Record event and metric for lifecycle fallback to http	2022-10-20 02:31:08 -07:00
Jordan Liggitt	a5d785fae8	Record metric for lifecycle fallback to http	2022-10-19 14:45:25 -04:00
Francesco Romani	ba6b468982	node: metrics: register podresources metrics Because of a bug in the commit `1e7bb20c52`, podresources metrics were added, they are updated in the right places, but they are never exported, so they cannot be consumed. Fix trivially registering the metrics. Signed-off-by: Francesco Romani <fromani@redhat.com>	2022-10-06 15:14:56 +02:00
Clayton Coleman	e9a5fb7372	kubelet: Record a metric for latency of pod status update Track how long it takes for pod updates to propagate from detection to successful change on API server. Will guide future improvements in pod start and shutdown latency. Metric is `kubelet_pod_status_sync_duration_seconds` and is ALPHA stability. Histogram buckets are chosen based on distribution of observed status delays in practice.	2022-09-08 12:17:44 -04:00
JunYang	c71e3a7802	When metrics are counted, discard the wrong container startup time metrics	2022-07-15 08:56:12 +08:00
JunYang	f33652ce61	Fix kubelet panic when accessing metrics/resource endpoint	2022-07-14 16:38:48 +08:00
Kubernetes Prow Robot	b0254c8a0b	Merge pull request #108758 from fengzixu/improvement-volume-health re-push "add volume kubelet_volume_stats_health_abnormal to kubelet #105585"	2022-03-29 17:35:34 -07:00
Kubernetes Prow Robot	5cb6fab8f6	Merge pull request #105585 from fengzixu/improvement-volume-health add volume kubelet_volume_stats_health_abnormal to kubelet	2022-03-17 01:32:38 +00:00
fengzixu	7d675381f8	fix: fix panic bug when volumeHealthStatus is nil	2022-03-17 01:32:24 +00:00
Maciej Borsz	aa95513982	Revert "add volume kubelet_volume_stats_health_abnormal to kubelet"	2022-03-16 13:44:09 +01:00
Kubernetes Prow Robot	1a5abe5d1f	Merge pull request #105585 from fengzixu/improvement-volume-health add volume kubelet_volume_stats_health_abnormal to kubelet	2022-03-15 05:58:11 -07:00
Shiming Zhang	5eb3e88f6b	Support metrics for node shutdown	2022-03-11 17:31:10 +08:00
fengzixu	9808ae48a0	change the volume health status metrics name	2022-01-23 02:44:10 +00:00
Sergey Kanzhelev	7e7bc6d53b	remove DynamicKubeletConfig logic from kubelet	2022-01-19 22:38:04 +00:00
fengzixu	5d544d3f01	fix comment	2022-01-11 14:28:31 +00:00
fengzixu	f96449f2e2	fix unit test	2022-01-11 13:50:18 +00:00
fengzixu	e2b5b5465a	improve metrics comment	2022-01-11 13:50:18 +00:00
fengzixu	c1a58d715c	fix unit test	2022-01-11 13:50:18 +00:00
fengzixu	5593e27429	improve metrics comment	2022-01-11 13:50:18 +00:00
fengzixu	1cdc694ac2	fix unit test	2022-01-11 13:50:18 +00:00
fengzixu	4a72f08a28	add useful comment for volume stats metrics	2022-01-11 13:50:18 +00:00
fengzixu	ed7fd0ced5	add volumeHealth label to metrics	2022-01-11 13:50:17 +00:00
fengzixu	bab1755274	fix: correct metrics expression	2022-01-11 13:50:17 +00:00
fengzixu	d71e21e01e	add volume kubelet_volume_stats_health_abnormal to kubelet	2022-01-11 13:50:17 +00:00
Kubernetes Prow Robot	19591a1324	Merge pull request #105829 from yuanchen8911/master Fix and improve comments on kubelet metrics	2022-01-04 23:02:32 -08:00
Davanum Srinivas	9405e9b55e	Check in OWNERS modified by update-yamlfmt.sh Signed-off-by: Davanum Srinivas <davanum@gmail.com>	2021-12-09 21:31:26 -05:00
Elana Hashman	b35c500541	Revert "Bump DynamicKubeConfig metric deprecation to 1.23"	2021-11-17 11:48:49 -08:00
Mark Rossetti	ef324d6bbd	Adding kubelet metrics for started and failed to start HostProcess containers Signed-off-by: Mark Rossetti <marosset@microsoft.com>	2021-11-04 14:39:57 -07:00

1 2 3 4

190 Commits