With Topology Manager enabled by default, we no longer need
`resourceAllocator` as Topology Manager serves as the main
PodAdmitHandler completely responsible for admission check
based on hints received from the hintProviders and the
subsequent allocation of the corresponding resources to a
pod as can be seen here:
https://github.com/kubernetes/kubernetes/blob/v1.26.0/pkg/kubelet/cm/topologymanager/scope.go#L150
With regard to DRA, the passing of `cm.draManager` into
resourceAllocator seems redundant as no admission checks
(and allocation of resources handled by DRA) is taking place
in `Admit` method of resourceAllocator. DRA has a completely
different model to the rest of the resource managers where
pod is only scheduled on a node once resources are reserved
for it. Because of this, admission checks or waiting for
resources to be provisioned after the pod has been scheduled
on the node is not required.
Before making the above change, it was verified that DRA Manager
is instantiated in `NewContainerManager`:
https://github.com/kubernetes/kubernetes/blob/v1.26.0/pkg/kubelet/cm/container_manager_linux.go#L318
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Since Topology manager is graduating to GA, we remove
internal configuration variable names with `Experimental`
prefix.
There is no expected change in behavior, only trival
variable renaming.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
In case of node reboot/kubelet restart, the flow of events involves
obtaining the state from the checkpoint file followed by setting
the `healthDevices`/`unhealthyDevices` to its zero value. This is
done to allow the device plugin to re-register itself so that
capacity can be updated appropriately.
During the allocation phase, we need to check if the resources requested
by the pod have been registered AND healthy devices are present on
the node to be allocated.
Also we need to move this check above `needed==0` where needed is
required - devices allocated to the container (which is obtained from
the checkpoint file) because even in cases where no additional devices
have to be allocated (as they were pre-allocated), we still need to
make the devices that were previously allocated are healthy.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
In order to implement the `full-pcpus-only` cpumanager policy option,
we leverage the implementation of the algorithm which picks CPUs.
By design, CPUs are taken from the biggest chunk available (socket
or NUMA zone) to physical cores, down to single cores.
Leveraging this, if the requested CPU count is a multiple of the SMT
level (commonly 2), we're guaranteed that only full physical cores
will be taken.
The hidden assumption here is this holds true by construction iff
the user reserved CPUs (if any) considering full physical CPUs.
IOW, if the user did intentionally or mistakely reserve single threads
which are no core siblings[1], then the simple check we implemented
is not sufficient.
A easy example can probably outline this better. With this setup:
cores: [(0, 4), (1, 5), (2, 6), (3, 8)] (in parens: thread siblings).
SMT level: 2 (each tuple is 2 elements)
Reserved CPUs: 0,1 (explicit pick using `--reserved-cpus`)
A container then requests 6 cpus. full-pcpus-only check: 6 % 2 == 0. Passed.
The CPU allocator will take first full cores, (2,6) and (3,8), and will
then pick the remaining single CPUs. The allocation will succeed, but
it's incorrect.
We can fix this case with a stricter precheck.
We need to additionally consider all the core siblings of the reserved
CPUs as unavailable when computing the free cpus, before to start the
actual allocation. Doing so, we fall back in the intended behavior, and
by construction all possible CPUs allocation whose number is multiple
of the SMT level are now correct again.
+++
[1] or thread siblings in the linux parlance, in any case:
hyperthread siblings of the same physical core
Signed-off-by: Francesco Romani <fromani@redhat.com>
1. Scheduler bug-fix + scheduler-focussed E2E tests
2. Add cgroup v2 support for in-place pod resize
3. Enable full E2E pod resize test for containerd>=1.6.9 and EventedPLEG related changes.
Co-Authored-By: Vinay Kulkarni <vskibum@gmail.com>
1. Core Kubelet changes to implement In-place Pod Vertical Scaling.
2. E2E tests for In-place Pod Vertical Scaling.
3. Refactor kubelet code and add missing tests (Derek's kubelet review)
4. Add a new hash over container fields without Resources field to allow feature gate toggling without restarting containers not using the feature.
5. Fix corner-case where resize A->B->A gets ignored
6. Add cgroup v2 support to pod resize E2E test.
KEP: /enhancements/keps/sig-node/1287-in-place-update-pod-resources
Co-authored-by: Chen Wang <Chen.Wang1@ibm.com>
Feedback from https://github.com/kubernetes/utils/pull/267 and related
reviews.
* Equality when insertion order is different
* UnsortedList contents
* Not-Subset cases
* Clone coverage
Describe use cases (node IDs, HT siblings, etc)
Call out novelty (Linux CPU list parse/dump)
Describe future work (relax immutable, refactor to use 'set')
All usage of builder pattern is convertible to cpuset.New()
with the same or fewer lines of code.
Migrate Builder.Add to a private method of CPUSet, with a comment
that it is only intended for internal use to preserve immutable
propoerty of the exported interface.
This also removes 'require' library dependency, which avoids
non-standard library usage.
FilterNot is only used in this file, and is trivially converted to a
'filter' call site by inverting the predicate.
Filter is only used in this file, so don't export it.
In 'set', conversions to slice are done also, but with different names:
ToSliceNoSort() -> UnsortedList()
ToSlice() -> List()
Reimplement List() in terms of UnsortedList to save some duplication.
Removes exit/fatal from cpuset library.
Usage in podresources test was not necessary.
Library reference in cpu_manager_test was moved to a local function, and
converted to use e2e test framework error catching.
When kubelet starts a Pod that requires device resources, if the device
plug-in updates the device at the same time, it may cause kubelet to crash.
Signed-off-by: huyinhou <huyinhou@bytedance.com>