Prevent Kubelet from incorrectly interpreting "not yet started" pods as "ready to terminate pods" by unifying responsibility for pod lifecycle into pod worker
A number of race conditions exist when pods are terminated early in
their lifecycle because components in the kubelet need to know "no
running containers" or "containers can't be started from now on" but
were relying on outdated state.
Only the pod worker knows whether containers are being started for
a given pod, which is required to know when a pod is "terminated"
(no running containers, none coming). Move that responsibility and
podKiller function into the pod workers, and have everything that
was killing the pod go into the UpdatePod loop. Split syncPod into
three phases - setup, terminate containers, and cleanup pod - and
have transitions between those methods be visible to other
components. After this change, to kill a pod you tell the pod worker
to UpdatePod({UpdateType: SyncPodKill, Pod: pod}).
Several places in the kubelet were incorrect about whether they
were handling terminating (should stop running, might have
containers) or terminated (no running containers) pods. The pod worker
exposes methods that allow other loops to know when to set up or tear
down resources based on the state of the pod - these methods remove
the possibility of race conditions by ensuring a single component is
responsible for knowing each pod's allowed state and other components
simply delegate to checking whether they are in the window by UID.
Removing containers now no longer blocks final pod deletion in the
API server and are handled as background cleanup. Node shutdown
no longer marks pods as failed as they can be restarted in the
next step.
See https://docs.google.com/document/d/1Pic5TPntdJnYfIpBeZndDelM-AbS4FN9H2GTLFhoJ04/edit# for details
When run as non-root user, TestAttacherMountDevice fails, because of missing
nil check that induces a panic. Fixed by doing err nil check
before using the returned user value from user.Current()
When UnmountDevice fails, kubelet treat the volume mount as uncertain,
because it does not know at which stage UnmountDevice failed. It may be
already partially unmonted / destroyed.
As result, MountDevice will be performer when a new Pod is started on the
node after UnmountDevice faiure.
iSCSI and FC volume plugins do not implement real 3rd party attach/detach.
If reconstruction fails with an error on a FC or iSCSI volume, it will not
be unmounted from the volume global dir and at the same time it will be
marked as unused, to be available to be mounted on another node.
The volume can then be mounted on several nodes, resulting in volume
corruption.
The other block based volume plugins implement attach/detach that either
makes the volume stuck (can't be detached) or will be force-detached from a
node before attaching it somewhere else.
When UnmountDevice() of a FibreChannel volume fails after unmounting the
device and before the device is fully cleaned up, subsequent
UnmountDevice() retry won't find the device mounted and return without
retrying the device cleanup.
Therefore implement its own retry inside UnmountDevice() to make sure that
the volume devices are either fully cleaned or the error is serius enough
that even 1 minute of trying does not help.
Volumes that are provisioned with `VolumeMode: Block` often have a
MetrucsProvider interface declared in their type. However, the
MetricsProvider should implement a GetMetrics() function. In the cases
where the storage drivers do not implement GetMetrics(), a panic can
occur.
Usual type-assertions are not sufficient in this case. All assertions
assume the interface is present. There is no straight forward way to
verify that a valid GetMetrics() function is provided.
By adding SupportsMetrics(), storage driver implementations require
careful reviewing for metrics support.
PR #97972 added support for gathering metrics for Block PVCs provided by
CSI drivers. The in-tree drivers can support at leas the most basic
metric; Capacity.
Similar to how NewMetricsStatFS() works, the new NewMetricsBlock()
provides the GetMetrics() interface for Block volumes.
Additional metrics for Block volumes are difficult to gather. There is
no guarantee that there is a filesystem on the volume, which makes most
of the volume metrics useless.
Advanced storage might be able to detect the actual consumption (when
thin-provisioned) vs the capacity. However, this is out of the scope for
a standard helper function and requires intimate knowledge of the used
storage system.