This change adds CDI device IDs to the ContainerAllocateResponse in the
device plugin API. This allows a device plugin to specify CDI devices
by their unique fully-qualified CDI device names using the related field
in the CRI specification.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
After this commit, when LimitedSwap is enabled,
containers would get swap acess limited with respect
the container memory request, total physical memory
on the node, and the swap size on the node.
Pods of Best-Effort / Guaranteed QoS classes don't get
to swap. In addition, container with memory requests
that are equal to their memory limits also don't get to
swap.
The swap limitation is calculated in the following way:
1. Calculate the container's memory proportionate to the node's memory:
- Divide the container's memory request by the total node's physical memory.
Let's call this value ContainerMemoryProportion.
2. Multiply the container memory proportion by the available
swap memory for Pods:
Meaning: ContainerMemoryProportion * TotalPodsSwapAvailable.
Fore more information:
https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2400-node-swap/README.md
Signed-off-by: Itamar Holder <iholder@redhat.com>
Before this commit, to find out the current node's
cgroup version, a libcontainers function was used
directly. This way, cgroup version is hard to mock
and is dependant on the environment in which the unit
tests are being run.
After this commit, libcontainer's function is wrapped
within a variable function that can be re-assigned by
the tests. This way the test can easily mock the cgroup
version and become environment independant.
After this commit both cgroup versions v1 and v2
are being tested, no matter in which environment
it runs.
Signed-off-by: Itamar Holder <iholder@redhat.com>
Combining all prepare/unprepare operations for a pod enables plugins to
optimize the execution. Plugins can continue to use the v1beta2 API for now,
but should switch. The new API is designed so that plugins which want to work
on each claim one-by-one can do so and then report errors for each claim
separately, i.e. partial success is supported.
One of the contributing factors of issues #118559 and #109595 hard to
debug and fix is that the devicemanager has very few logs in important
flow, so it's unnecessarily hard to reconstruct the state from logs.
We add minimal logs to be able to improve troubleshooting.
We add minimal logs to be backport-friendly, deferring a more
comprehensive review of logging to later PRs.
Signed-off-by: Francesco Romani <fromani@redhat.com>
When kubelet initializes, runs admission for pods and possibly
allocated requested resources. We need to distinguish between
node reboot (no containers running) versus kubelet restart (containers
potentially running).
Running pods should always survive kubelet restart.
This means that device allocation on admission should not be attempted,
because if a container requires devices and is still running when kubelet
is restarting, that container already has devices allocated and working.
Thus, we need to properly detect this scenario in the allocation step
and handle it explicitely. We need to inform
the devicemanager about which pods are already running.
Note that if container runtime is down when kubelet restarts, the
approach implemented here won't work. In this scenario, so on kubelet
restart containers will again fail admission, hitting
https://github.com/kubernetes/kubernetes/issues/118559 again.
This scenario should however be pretty rare.
Signed-off-by: Francesco Romani <fromani@redhat.com>