This change adds CDI device IDs to the ContainerAllocateResponse in the
device plugin API. This allows a device plugin to specify CDI devices
by their unique fully-qualified CDI device names using the related field
in the CRI specification.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Combining all prepare/unprepare operations for a pod enables plugins to
optimize the execution. Plugins can continue to use the v1beta2 API for now,
but should switch. The new API is designed so that plugins which want to work
on each claim one-by-one can do so and then report errors for each claim
separately, i.e. partial success is supported.
One of the contributing factors of issues #118559 and #109595 hard to
debug and fix is that the devicemanager has very few logs in important
flow, so it's unnecessarily hard to reconstruct the state from logs.
We add minimal logs to be able to improve troubleshooting.
We add minimal logs to be backport-friendly, deferring a more
comprehensive review of logging to later PRs.
Signed-off-by: Francesco Romani <fromani@redhat.com>
When kubelet initializes, runs admission for pods and possibly
allocated requested resources. We need to distinguish between
node reboot (no containers running) versus kubelet restart (containers
potentially running).
Running pods should always survive kubelet restart.
This means that device allocation on admission should not be attempted,
because if a container requires devices and is still running when kubelet
is restarting, that container already has devices allocated and working.
Thus, we need to properly detect this scenario in the allocation step
and handle it explicitely. We need to inform
the devicemanager about which pods are already running.
Note that if container runtime is down when kubelet restarts, the
approach implemented here won't work. In this scenario, so on kubelet
restart containers will again fail admission, hitting
https://github.com/kubernetes/kubernetes/issues/118559 again.
This scenario should however be pretty rare.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Generating the name avoids all potential name collisions. It's not clear how
much of a problem that was because users can avoid them and the deterministic
names for generic ephemeral volumes have not led to reports from users. But
using generated names is not too hard either.
What makes it relatively easy is that the new pod.status.resourceClaimStatus
map stores the generated name for kubelet and node authorizer, i.e. the
information in the pod is sufficient to determine the name of the
ResourceClaim.
The resource claim controller becomes a bit more complex and now needs
permission to modify the pod status. The new failure scenario of "ResourceClaim
created, updating pod status fails" is handled with the help of a new special
"resource.kubernetes.io/pod-claim-name" annotation that together with the owner
reference identifies exactly for what a ResourceClaim was generated, so
updating the pod status can be retried for existing ResourceClaims.
The transition from deterministic names is handled with a special case for that
recovery code path: a ResourceClaim with no annotation and a name that follows
the Kubernetes <= 1.27 naming pattern is assumed to be generated for that pod
claim and gets added to the pod status.
There's no immediate need for it, but just in case that it may become relevant,
the name of the generated ResourceClaim may also be left unset to record that
no claim was needed. Components processing such a pod can skip whatever they
normally would do for the claim. To ensure that they do and also cover other
cases properly ("no known field is set", "must check ownership"),
resourceclaim.Name gets extended.
This chagne introduces a helper to construct ContainerAllocateResponse instances.
Test cases are updated to use a new constructor accepting functional options
allowing the response contents to be set based on the test requirements.
This can then be extended to also test additional fields in the device plugin API
such as annotations which are not currently covered or new fields.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
node.status.volumesInUse should report only attachable volumes, therefore
it needs to wait for the reconciler to update uncertain attachability of
volumes from the API server.
During CSI volume reconstruction it's not possible to tell, if the volume
is attachable or not - CSIDriver instance may not be available, because
kubelet may not have connection to the API server at that time.
Adding uncertain state during reconstruction + adding a correct state when
the API server is available.
- Implement `computeInitContainerActions` to sets the actions for the
init containers, including those with `RestartPolicyAlways`.
- Allow StartupProbe on the restartable init containers.
- Update PodPhase considering the restartable init containers.
- Update PodInitialized status and status manager considering the
restartable init containers.
Co-authored-by: Matthias Bertschy <matthias.bertschy@gmail.com>
The container config image references either an image ID or a digest,
but not the original image from the container config. We require the
image for signature verification to ensure that we actually verify the
correct image.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
hostIPs order may not be be consistent. If secondary IP is before
primary one, current logic adds primary IP twice into PodIPs, which
leads to error: "may specify no more than one IP for each IP family".
In this case, the second IP shouldn't be added.
Co-authored-by: Antonio Ojea <antonio.ojea.garcia@gmail.com>