Commit Graph

142 Commits

Author SHA1 Message Date
Kubernetes Prow Robot
cbfebf02e8
Merge pull request #121720 from aojea/first_pod_network_startup
kubelet: add internal metric for the first pod with network latency
2024-02-22 07:13:25 -08:00
Kubernetes Prow Robot
0f7cc6fcaa
Merge pull request #121778 from Tal-or/mm_metrics
kubelet: memorymanager: metrics:  add metrics about static allocation
2024-02-20 09:41:50 -08:00
Kubernetes Prow Robot
5d776f935c
Merge pull request #123345 from haircommander/image-gc-metric-reason
KEP-4210: kubelet: add reason field to image gc metric
2024-02-19 18:56:59 -08:00
AxeZhan
c74ec3df09 graduate PodLifecycleSleepAction to beta 2024-02-19 19:40:52 +08:00
Peter Hunt
c8b4d8ebed kubelet: add reason field to image gc metric
Signed-off-by: Peter Hunt <pehunt@redhat.com>
2024-02-16 16:02:41 -05:00
Kubernetes Prow Robot
14f8f5519d
Merge pull request #121719 from ruiwen-zhao/metric-size
Add image pull duration metric with bucketed image size
2024-02-13 16:23:50 -08:00
ruiwen-zhao
0f5cf6c1cd Add image pull duration metric with bucketed image size
Signed-off-by: ruiwen-zhao <ruiwen@google.com>
2024-02-08 00:30:31 +00:00
carlory
55c5db172e lock GA feature-gate ConsistentHTTPGetHandlers to default 2024-01-04 15:12:08 +08:00
Antonio Ojea
b8533f7976 kubelet: add metric for the first pod with network latency
The first pod with network latency impact user workloads, however,
it is difficuly to understand where is the problem of this latency,
since it depends on the CNI plugin to be ready at the moment of the
pod creation.

Add a new internal metric in the kubelet that allow developers and cluster
administrator to understand the source of the latency problems on
node startups.

kubelet_first_network_pod_start_sli_duration_seconds

Change-Id: I4cdb55b0df72c96a3a65b78ce2aae404c5195006
2023-11-15 06:09:49 +00:00
Talor Itzhak
ddd60de3f3 memorymanager:metrics: add metrics
As part of the memory manager GA graduation effort, we should add
metrics in order to iprove observability.

The metrics also mentioned in the PR https://github.com/kubernetes/enhancements/pull/4251 (which was not merged yet)

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
2023-11-12 09:34:55 +02:00
ruiwen-zhao
1165609036 Add metric for e2e pod startup latency including image pull
Signed-off-by: ruiwen-zhao <ruiwen@google.com>
2023-10-25 20:34:17 +00:00
Kubernetes Prow Robot
12b01aff1b
Merge pull request #121275 from haircommander/image-max-gc
KEP-4210: add support for ImageMaximumGCAge field
2023-10-25 21:29:10 +02:00
Kubernetes Prow Robot
f82670d8ec
Merge pull request #120680 from ruiwen-zhao/pod-start-bucket
Use a wider-range of metric buckets for PodStartDuration
2023-10-25 20:16:34 +02:00
Peter Hunt
49c947ba15 metrics: add and use ImageGarbageCollectedTotal
to help find MaxAge thresholds and detect image addition/removal thrashing

Signed-off-by: Peter Hunt <pehunt@redhat.com>
2023-10-20 12:23:31 -04:00
ruiwen-zhao
9b50af1f4f Use a wider-range of metric buckets for PodStartDuration
Signed-off-by: ruiwen-zhao <ruiwen@google.com>
2023-09-14 21:32:14 +00:00
Qiutong Song
d3eb082568 Create a node startup latency tracker
Signed-off-by: Qiutong Song <songqt01@gmail.com>
2023-09-11 05:54:25 +00:00
Francesco Romani
01c3a51a78 node: podresources: getallocatable: move to GA
lock the feature gate to GA, and remove the now-redundant code.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2023-07-12 14:11:22 +02:00
Kubernetes Prow Robot
cfeb83d56b
Merge pull request #116525 from ffromani/kubelet-podresources-endpoint-ga
node: podresources: graduate to GA
2023-05-25 16:38:50 -07:00
Mark Rossetti
ab9c8eb1e8
Removing WindowsHostProcessContainers feature-gate
Signed-off-by: Mark Rossetti <marosset@microsoft.com>
2023-05-01 13:30:38 -07:00
Francesco Romani
69bc685556 node: podresources: graduate to GA
Lock the feature gate to ON and simplify the code
accordingly.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2023-05-01 16:23:28 +02:00
Moshe Levi
71d6e4d53c kubelet metrics: add pod resources get metrics
Signed-off-by: Moshe Levi <moshele@nvidia.com>
2023-03-14 19:33:03 +02:00
Kubernetes Prow Robot
c6f3007071
Merge pull request #115967 from harche/evented_pleg_metrics
Graduate Evented PLEG to Beta
2023-03-10 17:34:40 -08:00
Kubernetes Prow Robot
a408be817f
Merge pull request #115972 from jsafrane/add-orphan-pod-metrics
Add metric for failed orphan pod cleanup
2023-03-09 22:43:26 -08:00
Clayton Coleman
6b9a381185
kubelet: Force deleted pods can fail to move out of terminating
If a CRI error occurs during the terminating phase after a pod is
force deleted (API or static) then the housekeeping loop will not
deliver updates to the pod worker which prevents the pod's state
machine from progressing. The pod will remain in the terminating
phase but no further attempts to terminate or cleanup will occur
until the kubelet is restarted.

The pod worker now maintains a store of the pods state that it is
attempting to reconcile and uses that to resync unknown pods when
SyncKnownPods() is invoked, so that failures in sync methods for
unknown pods no longer hang forever.

The pod worker's store tracks desired updates and the last update
applied on podSyncStatuses. Each goroutine now synchronizes to
acquire the next work item, context, and whether the pod can start.
This synchronization moves the pending update to the stored last
update, which will ensure third parties accessing pod worker state
don't see updates before the pod worker begins synchronizing them.

As a consequence, the update channel becomes a simple notifier
(struct{}) so that SyncKnownPods can coordinate with the pod worker
to create a synthetic pending update for unknown pods (i.e. no one
besides the pod worker has data about those pods). Otherwise the
pending update info would be hidden inside the channel.

In order to properly track pending updates, we have to be very
careful not to mix RunningPods (which are calculated from the
container runtime and are missing all spec info) and config-
sourced pods. Update the pod worker to avoid using ToAPIPod()
and instead require the pod worker to directly use
update.Options.Pod or update.Options.RunningPod for the
correct methods. Add a new SyncTerminatingRuntimePod to prevent
accidental invocations of runtime only pod data.

Finally, fix SyncKnownPods to replay the last valid update for
undesired pods which drives the pod state machine towards
termination, and alter HandlePodCleanups to:

- terminate runtime pods that aren't known to the pod worker
- launch admitted pods that aren't known to the pod worker

Any started pods receive a replay until they reach the finished
state, and then are removed from the pod worker. When a desired
pod is detected as not being in the worker, the usual cause is
that the pod was deleted and recreated with the same UID (almost
always a static pod since API UID reuse is statistically
unlikely). This simplifies the previous restartable pod support.
We are careful to filter for active pods (those not already
terminal or those which have been previously rejected by
admission). We also force a refresh of the runtime cache to
ensure we don't see an older version of the state.

Future changes will allow other components that need to view the
pod worker's actual state (not the desired state the podManager
represents) to retrieve that info from the pod worker.

Several bugs in pod lifecycle have been undetectable at runtime
because the kubelet does not clearly describe the number of pods
in use. To better report, add the following metrics:

  kubelet_desired_pods: Pods the pod manager sees
  kubelet_active_pods: "Admitted" pods that gate new pods
  kubelet_mirror_pods: Mirror pods the kubelet is tracking
  kubelet_working_pods: Breakdown of pods from the last sync in
    each phase, orphaned state, and static or not
  kubelet_restarted_pods_total: A counter for pods that saw a
    CREATE before the previous pod with the same UID was finished
  kubelet_orphaned_runtime_pods_total: A counter for pods detected
    at runtime that were not known to the kubelet. Will be
    populated at Kubelet startup and should never be incremented
    after.

Add a metric check to our e2e tests that verifies the values are
captured correctly during a serial test, and then verify them in
detail in unit tests.

Adds 23 series to the kubelet /metrics endpoint.
2023-03-08 22:03:51 -06:00
Harshal Patil
412b4b3329 Add connection related metrics to EventedPLEG
Signed-off-by: Harshal Patil <harpatil@redhat.com>
2023-03-01 11:35:27 -05:00
Jan Safranek
7bf9991389 Add metric for failed orphan pod cleanup 2023-02-22 18:43:38 +01:00
Swati Sehgal
bc941633c1 node: topology-mgr: add metric to measure topology mgr admission latency
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2023-02-15 13:59:47 +00:00
Kubernetes Prow Robot
4df945853e
Merge pull request #115137 from swatisehgal/topologymgr-metrics
node: topologymgr: add metrics about admission requests and errors
2023-01-30 18:43:00 -08:00
Swati Sehgal
172c55d310 node: topologymgr: add metrics about admission requests and errors
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2023-01-17 17:50:29 +00:00
Paco Xu
70e56fa71a cleanup: EphemeralContainers feature gate related codes 2023-01-15 21:15:01 +08:00
Kubernetes Prow Robot
1bf4af4584
Merge pull request #111930 from azylinski/new-histogram-pod_start_sli_duration_seconds
New histogram: Pod start SLI duration
2022-11-04 07:28:14 -07:00
Francesco Romani
ff44dc1932 cpumanager: the FG is locked to default (ON)
hence we can remove the if() guards, the feature
is always available.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2022-11-02 18:41:41 +01:00
Mark Rossetti
498d065cc5
Promoting WindowsHostProcessContainers to stable
Signed-off-by: Mark Rossetti <marosset@microsoft.com>
2022-11-01 14:06:25 -07:00
Francesco Romani
47d3299781 node: metrics: cpumanager: add pinning metrics
In order to improve the observability of the cpumanager,
add and populate metrics to track if the combination of
the kubelet configuration and podspec would trigger
exclusive core allocation and pinning.

We should avoid leaking any node/machine specific information
(e.g. core ids, even though this is admittedly an extreme example);
tracking these metrics seems to be a good first step, because
it allows us to get feedback without exposing details.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2022-10-27 14:40:40 +02:00
Artur Żyliński
9f31669a53 New histogram: Pod start SLI duration 2022-10-26 11:28:17 +02:00
Kubernetes Prow Robot
9bcb81e13f
Merge pull request #113175 from liggitt/pr_normalize_probes_lifecycle_handlers
Record event and metric for lifecycle fallback to http
2022-10-20 02:31:08 -07:00
Jordan Liggitt
a5d785fae8
Record metric for lifecycle fallback to http 2022-10-19 14:45:25 -04:00
Francesco Romani
ba6b468982 node: metrics: register podresources metrics
Because of a bug in the commit 1e7bb20c52,
podresources metrics were added, they are updated in the right
places, but they are never exported, so they cannot be consumed.
Fix trivially registering the metrics.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2022-10-06 15:14:56 +02:00
Clayton Coleman
e9a5fb7372
kubelet: Record a metric for latency of pod status update
Track how long it takes for pod updates to propagate from detection
to successful change on API server. Will guide future improvements
in pod start and shutdown latency.

Metric is `kubelet_pod_status_sync_duration_seconds` and is ALPHA
stability. Histogram buckets are chosen based on distribution of
observed status delays in practice.
2022-09-08 12:17:44 -04:00
Kubernetes Prow Robot
b0254c8a0b
Merge pull request #108758 from fengzixu/improvement-volume-health
re-push "add volume kubelet_volume_stats_health_abnormal to kubelet #105585"
2022-03-29 17:35:34 -07:00
Kubernetes Prow Robot
5cb6fab8f6 Merge pull request #105585 from fengzixu/improvement-volume-health
add volume kubelet_volume_stats_health_abnormal to kubelet
2022-03-17 01:32:38 +00:00
Maciej Borsz
aa95513982
Revert "add volume kubelet_volume_stats_health_abnormal to kubelet" 2022-03-16 13:44:09 +01:00
Kubernetes Prow Robot
1a5abe5d1f
Merge pull request #105585 from fengzixu/improvement-volume-health
add volume kubelet_volume_stats_health_abnormal to kubelet
2022-03-15 05:58:11 -07:00
Shiming Zhang
5eb3e88f6b Support metrics for node shutdown 2022-03-11 17:31:10 +08:00
fengzixu
9808ae48a0 change the volume health status metrics name 2022-01-23 02:44:10 +00:00
Sergey Kanzhelev
7e7bc6d53b remove DynamicKubeletConfig logic from kubelet 2022-01-19 22:38:04 +00:00
fengzixu
bab1755274 fix: correct metrics expression 2022-01-11 13:50:17 +00:00
fengzixu
d71e21e01e add volume kubelet_volume_stats_health_abnormal to kubelet 2022-01-11 13:50:17 +00:00
Kubernetes Prow Robot
19591a1324
Merge pull request #105829 from yuanchen8911/master
Fix and improve comments on kubelet metrics
2022-01-04 23:02:32 -08:00
Elana Hashman
b35c500541
Revert "Bump DynamicKubeConfig metric deprecation to 1.23" 2021-11-17 11:48:49 -08:00