In future commits we will need this to set the user/group of supported
volumes of KEP 127 - Phase 1.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
it is used to allocate and keep track of the unique users ranges
assigned to each pod that runs in a user namespace.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Co-authored-by: Rodrigo Campos <rodrigoca@microsoft.com>
The pod worker is the owner of when a container is running or not,
and the start and stop of the probes for a given pod should be
handled during the pod sync loop. This ensures that probes do not
continue running even after eviction.
Because the pod semantics allow lifecycle probes to shorten grace
period, the probe is removed after the containers in a pod are
terminated successfully. As an optimization, if the pod will have
a very short grace period (0 or 1 seconds) we stop the probes
immediately to reduce resource usage during eviction slightly.
After this change, the probe manager is only called by the pod
worker or by the reconcile loop.
* Add FeatureGate PodHostIPs
* Add HostIPs field and update PodIPs field
* Types conversion
* Add dropDisabledStatusFields
* Add HostIPs for kubelet
* Add fuzzer for PodStatus
* Add status.hostIPs in ConvertDownwardAPIFieldLabel
* Add status.hostIPs in validEnvDownwardAPIFieldPathExpressions
* Downward API support for status.hostIPs
* Add DownwardAPI validation for status.hostIPs
* Add e2e to check that hostIPs works
* Add e2e to check that Downward API works
* Regenerate
Other components must know when the Kubelet has released critical
resources for terminal pods. Do not set the phase in the apiserver
to terminal until all containers are stopped and cannot restart.
As a consequence of this change, the Kubelet must explicitly transition
a terminal pod to the terminating state in the pod worker which is
handled by returning a new isTerminal boolean from syncPod.
Finally, if a pod with init containers hasn't been initialized yet,
don't default container statuses or not yet attempted init containers
to the unknown failure state.
The field in fact says that the container runtime should relabel a volume
when running a container with it, it does not say that the volume supports
SELinux. For example, NFS can support SELinux, but we don't want NFS
volumes relabeled, because they can be shared among several Pods.
If CRI returns a container that has been created but is not running,
it is not safe to assume it is terminal, as our connection to CRI
may have failed. Instead, created is treated as waiting, as in
"waiting for this container to start". Either syncPod or
syncTerminatingPod is responsible for handling this state.
host-network pods IPs are obtained from the reported kubelet nodeIPs.
Historically, host-network podIPs are immutable once set, but when
we've added dual-stack support, we didn't consider that the secondary
IP address may not be present at the same time that the primary nodeIP.
If a secondary IP address is added to a node after the host-network pods
IPs are set, we can add the secondary host-network pod IP address
maintaining the current behavior of not updating the current podIPs on
host-network pods.
This patch makes the CRI `v1` API the new project-wide default version.
To allow backwards compatibility, a fallback to `v1alpha2` has been added
as well. This fallback can either used by automatically determined by
the kubelet.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
The feature gate gets locked to "true", with the goal to remove it in two
releases.
All code now can assume that the feature is enabled. Tests for "feature
disabled" are no longer needed and get removed.
Some code wasn't using the new helper functions yet. That gets changed while
touching those lines.
The name concatenation and ownership check were originally considered small
enough to not warrant dedicated functions, but the intent of the code is more
readable with them.
When adding the ephemeral volume feature, the special case for
PersistentVolumeClaim volume sources in kubelet's host path and node
limits checks was overlooked. An ephemeral volume source is another
way of referencing a claim and has to be treated the same way.
Remove the VolumeSubpath feature gate.
Feature gate convention has been updated since this was introduced to
indicate that they "are intended to be deprecated and removed after a
feature becomes GA or is dropped.".
If a pod is killed (no longer wanted) and then a subsequent create/
add/update event is seen in the pod worker, assume that a pod UID
was reused (as it could be in static pods) and have the next
SyncKnownPods after the pod terminates remove the worker history so
that the config loop can restart the static pod, as well as return
to the caller the fact that this termination was not final.
The housekeeping loop then reconciles the desired state of the Kubelet
(pods in pod manager that are not in a terminal state, i.e. admitted
pods) with the pod worker by resubmitting those pods. This adds a
small amount of latency (2s) when a pod UID is reused and the pod
is terminated and restarted.
A pod that has been rejected by admission will have status manager
set the phase to Failed locally, which make take some time to
propagate to the apiserver. The rejected pod will be included in
admission until the apiserver propagates the change back, which
was an unintended regression when checking pod worker state as
authoritative.
A pod that is terminal in the API may still be consuming resources
on the system, so it should still be included in admission.
Fixes two issues with how the pod worker refactor calculated the
pods that admission could see (GetActivePods() and
filterOutTerminatedPods())
First, completed pods must be filtered from the "desired" state
for admission, which arguably should be happening earlier in
config. Exclude the two terminal pods states from GetActivePods()
Second, the previous check introduced with the pod worker lifecycle
ownership changes was subtly wrong for the admission use case.
Admission has to include pods that haven't yet hit the pod worker,
which CouldHaveRunningContainers was filtering out (because the
pod worker hasn't seen them). Introduce a weaker check -
IsPodKnownTerminated() - that returns true only if the pod is in
a known terminated state (no running containers AND known to pod
worker). This weaker check may only be called from components that
need admitted pods, not other kubelet subsystems.
This commit does not fix the long standing bug that force deleted
pods are omitted from admission checks, which must be fixed by
having GetActivePods() also include pods "still terminating".
* Change uses of whitelist to allowlist in kubelet sysctl
* Rename whitelist files to allowlist in Kubelet sysctl
* Further renames of whitelist to allowlist in Kubelet
* Rename podsecuritypolicy uses of whitelist to allowlist
* Update pkg/kubelet/kubelet.go
Co-authored-by: Danielle <dani@builds.terrible.systems>
Co-authored-by: Danielle <dani@builds.terrible.systems>
The Kubelet always clears reason and message in generateAPIPodStatus
even when the phase is unchanged. It is reasonable that we preserve
the previous values when the phase does not change, and clear it
when the phase does change.
When a pod is evicted, this ensurse that the eviction message and
reason are propagated even in the face of subsequent updates. It also
preserves the message and reason if components beyond the Kubelet
choose to set that value.
To preserve the value we need to know the old phase, which requires
a change to convertStatusToAPIStatus so that both methods have
access to it.
Prevent Kubelet from incorrectly interpreting "not yet started" pods as "ready to terminate pods" by unifying responsibility for pod lifecycle into pod worker
runtimes may return an arbitrary number of Pod IPs, however, kubernetes
only takes into consideration the first one of each IP family.
The order of the IPs are the one defined by the Kubelet:
- default prefer IPv4
- if NodeIPs are defined, matching the first nodeIP family
PodIP is always the first IP of PodIPs.
The downward API must expose the same IPs and in the same order than
the pod.Status API object.
A number of race conditions exist when pods are terminated early in
their lifecycle because components in the kubelet need to know "no
running containers" or "containers can't be started from now on" but
were relying on outdated state.
Only the pod worker knows whether containers are being started for
a given pod, which is required to know when a pod is "terminated"
(no running containers, none coming). Move that responsibility and
podKiller function into the pod workers, and have everything that
was killing the pod go into the UpdatePod loop. Split syncPod into
three phases - setup, terminate containers, and cleanup pod - and
have transitions between those methods be visible to other
components. After this change, to kill a pod you tell the pod worker
to UpdatePod({UpdateType: SyncPodKill, Pod: pod}).
Several places in the kubelet were incorrect about whether they
were handling terminating (should stop running, might have
containers) or terminated (no running containers) pods. The pod worker
exposes methods that allow other loops to know when to set up or tear
down resources based on the state of the pod - these methods remove
the possibility of race conditions by ensuring a single component is
responsible for knowing each pod's allowed state and other components
simply delegate to checking whether they are in the window by UID.
Removing containers now no longer blocks final pod deletion in the
API server and are handled as background cleanup. Node shutdown
no longer marks pods as failed as they can be restarted in the
next step.
See https://docs.google.com/document/d/1Pic5TPntdJnYfIpBeZndDelM-AbS4FN9H2GTLFhoJ04/edit# for details