As part of the memory manager GA graduation effort, we should add
metrics in order to iprove observability.
The metrics also mentioned in the PR https://github.com/kubernetes/enhancements/pull/4251 (which was not merged yet)
Signed-off-by: Talor Itzhak <titzhak@redhat.com>
Today, DRA manager does not call plugin NodePrepareResource
for claims that it previously successfully handled, that is,
if claims are present in cache (checkpoint) even if node
rebooted.
After node reboots, it is required to call DRA plugin
for resource claims so that plugins may prepare them
again in case the resources dont persist reboot.
To achieve that, once kubelet is started, we call DRA
plugins for claims once if a pod sandbox is required
to be created during PodSync.
Signed-off-by: adrianc <adrianc@nvidia.com>
The cpu accumulator logic (that select CPUs for containers)
has some non-obvious code.
This commit adds some comments to make that code easier to
understand for new contributors. Some minor renames to
improve readability are also performed.
Add retry mechanism to handle cases where after kubelet restarts, the device
plugin unix socket(s) were created but not ready to serve yet.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Add kubeletSocket file to fsnotify instead of polling and waiting for deletion
of device plugin unix socket as a way of detecting kubelet restart. We need to
ensure that the device plugin re-registers itself after kubelet restart depending
on the configured registration mode (auto-registration or controller registration).
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
If the user specifies the intent to control registration process, we rely on
registration triggers (deletion of control file) to prompt registration.
This behvaiour is expected to be consistent across kubelet restarts and therefore
across the watch calls where we watch for changes to the unix socket so we make
this part of Stub object instead of a parameter.
Co-authored-by: Francesco Romani <fromani@redhat.com>
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
In case `REGISTER_CONTROL_FILE` is specified, we want to ensure that the
registration is triggered by deletion of the control file. This is
applicable both when the registration happens for the first time and
subsequent ones because of kubelet restarts.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
In issue: 115107 we added an environment variable to control the registration of sample
device plugin to kubelet. The intent of this patch is to ensure that the default
behaviour of the plugin is to register to kubelet (in case no environment
variable is specified).
In addition to that, we want to ensure that the plugin registers itself not just once.
It should re-register itself to kubelet in case of node reboot or kubelet restarts.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Include number of requested and available CPUs in the error message
when the assignment of CPUs fails because there are less available
CPUs than requested.
Add file doc.go with some rudimentary information to package
kubelet/cm. This will make it easier for people approaching the
kubelet codebase for the first time to quickly understand what's
in the package, since its name is abbreviated and hostile to
newcomers.
This removes deprecated sets.String and sets.Int
- replace sets.String with sets.Set[string]
- replace sets.Int with sets.Set[int]
- replace sets.NewString with sets.New[string]
- replace sets.NewInt with sets.New[int]
- replace sets.(OLD).List with sets.List(NEW)
The 10 second timeout was too low. Given that the retry loop for the
kubelet itself is 90s, increasing the timeout to half of this seems
reasonable. Ideally we would pull in the variable that sets the retry
timeout to 90s and then just set our local timeout to half of that.
Unfortunately, this is not exported, so we settle (for now with just
explicitly setting it to 45s.
Signed-off-by: Kevin Klues <kklues@nvidia.com>