WARNING WARNING WARNING WARNING WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

If you are using a released version of Kubernetes, you should refer to the docs that go with that version. Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). -- # Kubelet - Eviction Policy **Author**: Derek Carr (@derekwaynecarr) **Status**: Proposed This document presents a specification for how the `kubelet` evicts pods when compute resources are too low. ## Goals The node needs a mechanism to preserve stability when available compute resources are low. This is especially important when dealing with incompressible compute resources such as memory or disk. If either resource is exhausted, the node would become unstable. The `kubelet` has some support for influencing system behavior in response to a system OOM by having the system OOM killer see higher OOM score adjust scores for containers that have consumed the largest amount of memory relative to their request. System OOM events are very compute intensive, and can stall the node until the OOM killing process has completed. In addition, the system is prone to return to an unstable state since the containers that are killed due to OOM are either restarted or a new pod is scheduled on to the node. Instead, we would prefer a system where the `kubelet` can pro-actively monitor for and prevent against total starvation of a compute resource, and in cases of where it could appear to occur, pro-actively fail one or more pods, so the workload can get moved and scheduled elsewhere when/if its backing controller creates a new pod. ## Scope of proposal This proposal defines a pod eviction policy for reclaiming compute resources. In the first iteration, it focuses on memory; later iterations are expected to cover other resources like disk. The proposal focuses on a simple default eviction strategy intended to cover the broadest class of user workloads. ## Eviction Signals The `kubelet` will support the ability to trigger eviction decisions on the following signals. | Eviction Signal | Description | |------------------|---------------------------------------------------------------------------------| | memory.available | memory.available := node.status.capacity[memory] - node.stats.memory.workingSet | ## Eviction Thresholds The `kubelet` will support the ability to specify eviction thresholds. An eviction threshold is of the following form: `` * valid `eviction-signal` tokens as defined above. * valid `operator` tokens are `<` * valid `quantity` tokens must match the quantity representation used by Kubernetes If threhold criteria are met, the `kubelet` will take pro-active action to attempt to reclaim the starved compute resource associated with the eviction signal. The `kubelet` will support soft and hard eviction thresholds. ### Soft Eviction Thresholds A soft eviction threshold pairs an eviction threshold with a required administrator specified grace period. No action is taken by the `kubelet` to reclaim resources associated with the eviction signal until that grace period has been exceeded. If no grace period is provided, the `kubelet` will error on startup. In addition, if a soft eviction threshold has been met, an operator can specify a maximum allowed pod termination grace period to use when evicting pods from the node. If specified, the `kubelet` will use the lesser value among the `pod.Spec.TerminationGracePeriodSeconds` and the max allowed grace period. If not specified, the `kubelet` will kill pods immediately with no graceful termination. To configure soft eviction thresholds, the following flags will be supported: ``` --eviction-soft="": A set of eviction thresholds (e.g. memory.available<1.5Gi) that if met over a corresponding grace period would trigger a pod eviction. --eviction-soft-grace-period="": A set of eviction grace periods (e.g. memory.available=1m30s) that correspond to how long a soft eviction threshold must hold before triggering a pod eviction. --eviction-max-pod-grace-period="0": Maximum allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met. ``` ### Hard Eviction Thresholds A hard eviction threshold has no grace period, and if observed, the `kubelet` will take immediate action to reclaim the associated starved resource. If a hard eviction threshold is met, the `kubelet` will kill the pod immediately with no graceful termination. To configure hard eviction thresholds, the following flag will be supported: ``` --eviction-hard="": A set of eviction thresholds (e.g. memory.available<1Gi) that if met would trigger a pod eviction. ``` ## Eviction Monitoring Interval The `kubelet` will initially evaluate eviction thresholds at the same housekeeping interval as `cAdvisor` housekeeping. In Kubernetes 1.2, this was defaulted to `10s`. It is a goal to shrink the monitoring interval to a much shorter window. This may require changes to `cAdvisor` to let alternate housekeeping intervals be specified for selected data (https://github.com/google/cadvisor/issues/1247) For the purposes of this proposal, we expect the monitoring interval to be no more than `10s` to know when a threshold has been triggered, but we will strive to reduce that latency time permitting. ## Node Conditions The `kubelet` will support a node condition that corresponds to each eviction signal. If a hard eviction threshold has been met, or a soft eviction threshold has been met independent of its associated grace period, the `kubelet` will report a condition that reflects the node is under pressure. The following node conditions are defined that correspond to the specified eviction signal. | Node Condition | Eviction Signal | Description | |----------------|------------------|------------------------------------------------------------------| | MemoryPressure | memory.available | Available memory on the node has satisfied an eviction threshold | The `kubelet` will continue to report node status updates at the frequency specified by `--node-status-update-frequency` which defaults to `10s`. ## Eviction scenario Let's assume the operator started the `kubelet` with the following: ``` --eviction-hard="memory.available<100Mi" --eviction-soft="memory.available<300Mi" --eviction-soft-grace-period="memory.available=30s" ``` The `kubelet` will run a sync loop that looks at the available memory on the node as reported from `cAdvisor` by calculating (capacity - workingSet). If available memory is observed to drop below 100Mi, the `kubelet` will immediately initiate eviction. If available memory is observed as falling below `300Mi`, it will record when that signal was observed internally in a cache. If at the next sync, that criteria was no longer satisfied, the cache is cleared for that signal. If that signal is observed as being satisfied for longer than the specified period, the `kubelet` will initiate eviction to attempt to reclaim the resource that has met its eviction threshold. ## Eviction of Pods If an eviction threshold has been met, the `kubelet` will initiate the process of evicting pods until it has observed the signal has gone below its defined threshold. The eviction sequence works as follows: * for each monitoring interval, if eviction thresholds have been met * find candidate pod * fail the pod * block until pod is terminated on node If a pod is not terminated because a container does not happen to die (i.e. processes stuck in disk IO for example), the `kubelet` may select an additional pod to fail instead. The `kubelet` will invoke the `KillPod` operation exposed on the runtime interface. If an error is returned, the `kubelet` will select a subsequent pod. ## Eviction Strategy The `kubelet` will implement a default eviction strategy oriented around the pod quality of service class. It will target pods that are the largest consumers of the starved compute resource relative to their scheduling request. It ranks pods within a quality of service tier in the following order. * `BestEffort` pods that consume the most of the starved resource are failed first. * `Burstable` pods that consume the greatest amount of the starved resource relative to their request for that resource are killed first. If no pod has exceeded its request, the strategy targets the largest consumer of the starved resource. * `Guaranteed` pods that consume the greatest amount of the starved resource relative to their request are killed first. If no pod has exceeded its request, the strategy targets the largest consumer of the starved resource. A guaranteed pod is guaranteed to never be evicted because of another pod's resource consumption. That said, guarantees are only as good as the underlying foundation they are built upon. If a system daemon (i.e. `kubelet`, `docker`, `journald`, etc.) is consuming more resources than were reserved via `system-reserved` or `kube-reserved` allocations, and the node only has guaranteed pod(s) remaining, then the node must choose to evict a guaranteed pod in order to preserve node stability, and to limit the impact of the unexpected consumption to other guaranteed pod(s). ## Kubelet Admission Control ### Feasibility checks during kubelet admission The `kubelet` will reject `BestEffort` pods if any of its associated eviction thresholds have been exceeded independent of the configured grace period. Let's assume the operator started the `kubelet` with the following: ``` --eviction-soft="memory.available<256Mi" --eviction-soft-grace-period="memory.available=30s" ``` If the `kubelet` sees that it has less than `256Mi` of memory available on the node, but the `kubelet` has not yet initiated eviction since the grace period criteria has not yet been met, the `kubelet` will still immediately fail any incoming best effort pods. The reasoning for this decision is the expectation that the incoming pod is likely to further starve the particular compute resource and the `kubelet` should return to a steady state before accepting new workloads. ## Scheduler The node will report a condition when a compute resource is under pressure. The scheduler should view that condition as a signal to dissuade placing additional best effort pods on the node. In this case, the `MemoryPressure` condition if true should dissuade the scheduler from placing new best effort pods on the node since they will be rejected by the `kubelet` in admission. ## Best Practices ### DaemonSet It is never desired for a `kubelet` to evict a pod that was derived from a `DaemonSet` since the pod will immediately be recreated and rescheduled back to the same node. At the moment, the `kubelet` has no ability to distinguish a pod created from `DaemonSet` versus any other object. If/when that information is available, the `kubelet` could pro-actively filter those pods from the candidate set of pods provided to the eviction strategy. In general, it should be strongly recommended that `DaemonSet` not create `BestEffort` pods to avoid being identified as a candidate pod for eviction. [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-eviction.md?pixel)]()