Automatic merge from submit-queue (batch tested with PRs 58184, 59307, 58172). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Add annotations to the device plugin API
**What this PR does / why we need it**:
**Which issue(s) this PR fixes** : Related to #56649 but does not fix it
This adds the ability for the device plugins to annotate containers.
Product wise, this allows the NVIDIA device plugin to support CRI-O (which allows hooks through container annotations).
**Special notes for your reviewer**:
/area hw-accelerators
/cc @vishh @jiayingz @vikaschoudhary16
I'm wondering if it would make sense to fire a blank call to `newContainerAnnotations` at the start of the deviceplugin to get Annotations that are forbidden.
Current behavior is that any Annotations that conflicts with Kubelet will be overwritten by Kubelet.
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 57973, 57990). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Set pids limit at pod level
**What this PR does / why we need it**:
Add a new Alpha Feature to set a maximum number of pids per Pod.
This is to allow the use case where cluster administrators wish
to limit the pids consumed per pod (example when running a CI system).
By default, we do not set any maximum limit, If an administrator wants
to enable this, they should enable `SupportPodPidsLimit=true` in the
`--feature-gates=` parameter to kubelet and specify the limit using the
`--pod-max-pids` parameter.
The limit set is the total count of all processes running in all
containers in the pod.
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes#43783
**Special notes for your reviewer**:
**Release note**:
```release-note
New alpha feature to limit the number of processes running in a pod. Cluster administrators will be able to place limits by using the new kubelet command line parameter --pod-max-pids. Note that since this is a alpha feature they will need to enable the "SupportPodPidsLimit" feature.
```
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Handle Unhealthy devices
Update node capacity with sum of both healthy and unhealthy devices.
Node allocatable reflect only healthy devices.
**What this PR does / why we need it**:
Currently node capacity only reflects healthy devices. Unhealthy devices are ignored totally while updating node status. This PR handles unhealthy devices while updating node status.
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes#57241
**Special notes for your reviewer**:
**Release note**:
<!-- Write your release note:
Handle Unhealthy devices
```release-note
Handle Unhealthy devices
```
/cc @tengqm @ConnorDoyle @jiayingz @vishh @jeremyeder @sjenning @resouer @ScorpioCPH @lichuqiang @RenaudWasTaken @balajismaniam
/sig node
Add a new Alpha Feature to set a maximum number of pids per Pod.
This is to allow the use case where cluster administrators wish
to limit the pids consumed per pod (example when running a CI system).
By default, we do not set any maximum limit, If an administrator wants
to enable this, they should enable `SupportPodPidsLimit=true` in the
`--feature-gates=` parameter to kubelet and specify the limit using the
`--pod-max-pids` parameter.
The limit set is the total count of all processes running in all
containers in the pod.
This moves plugin/pkg/scheduler to pkg/scheduler and
plugin/cmd/kube-scheduler to cmd/kube-scheduler.
Bulk of the work was done with gomvpkg, except for kube-scheduler main
package.
- Instead of using cm.capacity field to communicate device plugin resource
capacity, this PR changes to use an explicit cm.GetDevicePluginResourceCapacity()
function that returns device plugin resource capacity as well as any inactive
device plugin resource. Kubelet syncNodeStatus call this function during its
periodic run to update node status capacity and allocatable. After this call,
device plugin can remove the inactive device plugin resource from its allDevices
field as the update is already pushed to API server.
- Extends device plugin checkpoint data to record registered resources
so that we can finish resource removing even upon kubelet restarts.
- Passes sourcesReady from kubelet to device plugin to avoid removing
inactive pods during grace period of kubelet restart.
Automatic merge from submit-queue (batch tested with PRs 54837, 55970, 55912, 55898, 52977). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Improve kubelet cgroup
**What this PR does / why we need it**:
1.Use arg cgroupRoot,not nodeConfig.CgroupRoot
Using both arg cgroupRoot and nodeConfig.CgroupRoot is confused in function NewQOSContainerManager
2.improve cgroupmanager in qosContainerManager
3. improve arg "cgroupRoot" type in NewQOSContainerManager
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
**Special notes for your reviewer**:
**Release note**:
```release-note
```
While moving device_plugin_handler_test.go from pkg/kubelet/cm/ to
pkg/kubelet/cm/deviceplugin/, we can no longer uses cm in its tests
because that would cause a cycle dependency. To solve this problem,
I moved the main cm GetResources functionality as well as part of the
current device plugin handler Allocate functionality into a new device
plugin handler function, GetDeviceRunContainerOptions(). This
refactoring is also needed by another PR 51895 that moves device
allocation into admission phase. Now device plugin handler Allocate()
first checks whether there is cached device runtime state and only
issues Allocate grpc call if there is no cached state available.
The new GetDeviceRunContainerOptions() function simply returns device
runtime config from the cached state. To support this change, extended the
podDevices struct and checkpoint data structure with device runtime state.
There is no use case known for passing artifacts per device as it currently exists. The current API is also
complex to use for simple clients. Hence this PR creates a flat namespace where artifacts like environment variables
and mount points apply globally to all devices returned as part of AllocateResponse proto.
Signed-off-by: Vishnu kannan <vishnuk@google.com>
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
fix typos: remove duplicated word in comments
**What this PR does / why we need it**: Remove the duplicated word `the` in comments
**Which issue this PR fixes** : fixes #
**Special notes for your reviewer**:
```release-note
NONE
```
* Deprecate the old experimental-fail-swap-on
* Add a new flag fail-swap-on and set it to true
Before this change, we would not fail when swap is on. With this
change we fail for everyone when swap is on, unless they explicitly
set --fail-swap-on to false.
In Container manager, we set up the capacity by retrieving information
from cadvisor. However unlike machineinfo, filesystem information is
available at a later unknown time. This PR uses a go routine to keep
retriving the information until it is avaialble or timeout.
Centralize Capacity discovery of standard resources in Container manager.
Have storage derive node capacity from container manager.
Move certain cAdvisor interfaces to the cAdvisor package in the process.
This patch fixes a bug in container manager where it was writing to a map without synchronization.
Signed-off-by: Vishnu kannan <vishnuk@google.com>
Automatic merge from submit-queue (batch tested with PRs 43005, 46660, 46385, 46991, 47103)
Consolidate sysctl commands for kubelet
**What this PR does / why we need it**:
These commands are important enough to be in the Kubelet itself.
By default, Ubuntu 14.04 and Debian Jessie have these set to 200 and
20000. Without this setting, nodes are limited in the number of
containers that they can start.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes#26005
**Special notes for your reviewer**:
I had a difficult time writing tests for this. It is trivial to create a fake sysctl for testing, but the Kubelet does not have any tests for the prior settings.
**Release note**:
```release-note
```
These commands are important enough to be in the Kubelet itself.
By default, Ubuntu 14.04 and Debian Jessie have these set to 200 and
20000. Without this setting, nodes are limited in the number of
containers that they can start.
Automatic merge from submit-queue
BestEffort QoS class has min cpu shares
**What this PR does / why we need it**:
BestEffort QoS class is given the minimum amount of CPU shares per the QoS design.
Provides an opt-in flag, --experimental-fail-swap-on (and corresponding
KubeletConfiguration value, ExperimentalFailSwapOn), which is false by default.
This change add a container manager inside the dockershim to move docker daemon
and associated processes to a specified cgroup. The original kubelet container
manager will continue checking the name of the cgroup, so that kubelet know how
to report runtime stats.
Automatic merge from submit-queue
Redundant code process for container_mananger start
1. need not sum the total numEnsureStateFuncs
2. numEnsureStateFuncs should > 0, otherwise, calculate numEnsureStateFuncs would be not neccessary
Automatic merge from submit-queue
Remove kubelet pkill dependency
Issue #26093 identified pkill as one of the dependencies of kublet
which could be worked around. Build on the code introduced for pidof
and regexp for the process(es) we need to send a signal to.
Related to #26093
Issue #26093 identified pkill as one of the dependencies of kublet
which could be worked around. Build on the code introduced for pidof
and regexp for the process(es) we need to send a signal to.
Related to #26093
Issue #26093 identified pidof as one of the dependencies of kublet
which could be worked around. In this PR, we just look at /proc
to construct the list of pids we need for a specified process
instead of running "pidof" executable
Related to #26093
Automatic merge from submit-queue
Fix system container detection in kubelet on systemd
```release-note
Fix system container detection in kubelet on systemd.
This fixed environments where CPU and Memory Accounting were not enabled on the unit
that launched the kubelet or docker from reporting the root cgroup when
monitoring usage stats for those components.
```
Fixes https://github.com/kubernetes/kubernetes/issues/25909
/cc @kubernetes/sig-node @kubernetes/rh-cluster-infra @vishh @dchen1107
Check docker's pid file, then fallback to pidof when trying to determine the pid for docker. The
latest docker RPM for RHEL changes /usr/bin/docker from an executable to a shell script (to support
/usr/bin/docker-current and /usr/bin/docker-latest). The pidof check for docker fails in this case,
so we check /var/run/docker.pid first (the default location), and fallback to pidof if that fails.
`--kubelet-cgroups` and `--system-cgroups` respectively.
Updated `--runtime-container` to `--runtime-cgroups`.
Cleaned up most of the kubelet code that consumes these flags to match
the flag name changes.
Signed-off-by: Vishnu kannan <vishnuk@google.com>