As already mentioned in this issue https://github.com/kubernetes/kubernetes/issues/79286, some metrics like
"running_pod_count" and "running_container_count" uses non-standard prometheus metrics, this change converts them to be
standard prometheus gauges
Minor refactor in kubelet/pleg/generic.go and added some test for ruuning container and running pod metrics
Fixed issues related to github CI pipeline failure
* Updated bazel for new deps
* Add comment for exported metrics variables,RuuningContainerCount and RunningPodCount
* Specify keys explicitly in Guage metric instantation
Fix go lint errors
Replace "+=1" with "++", as reported by go lint
Set container state as a label for the metrics "running_container_count"
As per the metrics name "running_container_count" it should "ideally" be showing
the number of containers in "running" state , but it was showing all the container count, irrespective of the state it is in.
This commit adds a new label "container_running_state" to the metrics "running_container_count", which doesn't change the base metrics but adds the
option to query the metrics with "container_state" such as "running"/"unknown/...
remove unused methods reported by staticcheck
Remove variables while instantiating gauge(vec) which are default set to nil
Convert kubelet metrics(running_pod_count and running_container_count) to standard gauges and added label to running_container_count metrics.
Currently kubelet metrics(running_pod_count and running_container_count) use non-standard prometheus collectors , this change
converts them to standard prometheus gauges. Also this adds a new label(container_state) to running_container_count which does a breakdown of
containers tracked by kubelet based on the containers' state(running/unknown/created/exited).
Set statbility explicitly for running_pod_count and running_container_count and reformat test
register metrics explicitly in test , so that they don't become no-op
buildPodRef creates a unique key with the {podName, namespace, UID}
tuple. By omitting the UID in the metric, duplicate metrics can be sent
to prometheus causing 500's on the /metrics endpoint.
CRI runtimes do not supply cpu nano core usage as it is not part of CRI
stats. However, there are upstream components that still rely on such
stats to function. The previous fix was faulty because the multiple
callers could compete and update the stats, causing
inconsistent/incoherent metrics. This change, instead, creates a
separate call for updating the usage, and rely on eviction manager,
which runs periodically, to trigger the updates. The caveat is that if
eviction manager is completley turned off, no one would compute the
usage.
1) Add suffix (`seconds` or `total`) to metric name
2) Switch Summary metric to Histogram metric (Summary metrics are not
supported completely by prometheus-to-sd and can't be aggregated.)
- Move from the old github.com/golang/glog to k8s.io/klog
- klog as explicit InitFlags() so we add them as necessary
- we update the other repositories that we vendor that made a similar
change from glog to klog
* github.com/kubernetes/repo-infra
* k8s.io/gengo/
* k8s.io/kube-openapi/
* github.com/google/cadvisor
- Entirely remove all references to glog
- Fix some tests by explicit InitFlags in their init() methods
Change-Id: I92db545ff36fcec83afe98f550c9e630098b3135
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Fix typo in volume_stats.go
**What this PR does / why we need it**:
While reviewing the implementation details I came across a typo in volume_stats.go
sed/volumeStatsCollecotr/volumeStatsCollector/
**Release note**:
```release-note
NONE
```
This PR exports config-releated metrics from the Kubelet.
The Guages for active, assigned, and last-known-good config can be used
to identify config versions and produce aggregate counts across several
nodes. The error-reporting Gauge can be used to determine whether a node
is experiencing a config-related error, and to prodouce an aggregate
count of nodes in an error state.