Implement DOS prevention wiring a global rate limit for podresources
API. The goal here is not to introduce a general ratelimiting solution
for the kubelet (we need more research and discussion to get there),
but rather to prevent misuse of the API.
Known limitations:
- the rate limits value (QPS, BurstTokens) are hardcoded to
"high enough" values.
Enabling user-configuration would require more discussion
and sweeping changes to the other kubelet endpoints, so it
is postponed for now.
- the rate limiting is global. Malicious clients can starve other
clients consuming the QPS quota.
Add e2e test to exercise the flow, because the wiring itself
is mostly boilerplate and API adaptation.
The pod_logs subsystem was inadvertently made redundant in the following
kube-apiserver metrics:
- kube_apiserver_pod_logs_pods_logs_backend_tls_failure_total
- kube_apiserver_pod_logs_pods_logs_insecure_backend_total
To safely rename them, it is required to deprecate them in 1.27 whilst
introducing the new metrics replacing them.
Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
* scheduler(NodeResourcesFit): calculatePodResourceRequest in PreScore phase
* scheduler(NodeResourcesFit and NodeResourcesBalancedAllocation): calculatePodResourceRequest in PreScore phase
* modify the comments and tests.
* revert the tests.
* don't need consider nodes.
* use list instead of map.
* add comment for podRequests.
* avoid using negative wording in variable names.
DesiredStateOfWorld must remember both
- the effective SELinux label to apply as a mount option (non-empty for
RWOP volumes, empty otherwise)
- and the label that _would_ be used if the mount option would be used by
all access modes.
Mismatch warning metrics must be generated from the second label.
The checkpointing mechanism will repopulate DRA Manager in-memory cache on kubelet restart.
This will ensure that the information needed by the PodResources API is available across
a kubelet restart.
The ClaimInfoState struct represent the DRA Manager in-memory cache state in checkpoint.
It is embedd in the ClaimInfo which also include the annotation field. The separation between
the in-memory cache and the cache state in the checkpoint is so we won't be tied to the in-memory
cache struct which may change in the future. In the ClaimInfoState we save the minimal required fields
to restore the in-memory cache.
Signed-off-by: Moshe Levi <moshele@nvidia.com>
To enable rate limiting, needed for GA graduation,
we need to pass more parameters to the already crowded
`ListenAndServePodresources` function.
To tidy up a bit, pack the parameters in a helper struct,
with no intended changes in behavior.
Signed-off-by: Francesco Romani <fromani@redhat.com>
If a CRI error occurs during the terminating phase after a pod is
force deleted (API or static) then the housekeeping loop will not
deliver updates to the pod worker which prevents the pod's state
machine from progressing. The pod will remain in the terminating
phase but no further attempts to terminate or cleanup will occur
until the kubelet is restarted.
The pod worker now maintains a store of the pods state that it is
attempting to reconcile and uses that to resync unknown pods when
SyncKnownPods() is invoked, so that failures in sync methods for
unknown pods no longer hang forever.
The pod worker's store tracks desired updates and the last update
applied on podSyncStatuses. Each goroutine now synchronizes to
acquire the next work item, context, and whether the pod can start.
This synchronization moves the pending update to the stored last
update, which will ensure third parties accessing pod worker state
don't see updates before the pod worker begins synchronizing them.
As a consequence, the update channel becomes a simple notifier
(struct{}) so that SyncKnownPods can coordinate with the pod worker
to create a synthetic pending update for unknown pods (i.e. no one
besides the pod worker has data about those pods). Otherwise the
pending update info would be hidden inside the channel.
In order to properly track pending updates, we have to be very
careful not to mix RunningPods (which are calculated from the
container runtime and are missing all spec info) and config-
sourced pods. Update the pod worker to avoid using ToAPIPod()
and instead require the pod worker to directly use
update.Options.Pod or update.Options.RunningPod for the
correct methods. Add a new SyncTerminatingRuntimePod to prevent
accidental invocations of runtime only pod data.
Finally, fix SyncKnownPods to replay the last valid update for
undesired pods which drives the pod state machine towards
termination, and alter HandlePodCleanups to:
- terminate runtime pods that aren't known to the pod worker
- launch admitted pods that aren't known to the pod worker
Any started pods receive a replay until they reach the finished
state, and then are removed from the pod worker. When a desired
pod is detected as not being in the worker, the usual cause is
that the pod was deleted and recreated with the same UID (almost
always a static pod since API UID reuse is statistically
unlikely). This simplifies the previous restartable pod support.
We are careful to filter for active pods (those not already
terminal or those which have been previously rejected by
admission). We also force a refresh of the runtime cache to
ensure we don't see an older version of the state.
Future changes will allow other components that need to view the
pod worker's actual state (not the desired state the podManager
represents) to retrieve that info from the pod worker.
Several bugs in pod lifecycle have been undetectable at runtime
because the kubelet does not clearly describe the number of pods
in use. To better report, add the following metrics:
kubelet_desired_pods: Pods the pod manager sees
kubelet_active_pods: "Admitted" pods that gate new pods
kubelet_mirror_pods: Mirror pods the kubelet is tracking
kubelet_working_pods: Breakdown of pods from the last sync in
each phase, orphaned state, and static or not
kubelet_restarted_pods_total: A counter for pods that saw a
CREATE before the previous pod with the same UID was finished
kubelet_orphaned_runtime_pods_total: A counter for pods detected
at runtime that were not known to the kubelet. Will be
populated at Kubelet startup and should never be incremented
after.
Add a metric check to our e2e tests that verifies the values are
captured correctly during a serial test, and then verify them in
detail in unit tests.
Adds 23 series to the kubelet /metrics endpoint.
When using JSON as output format, we want objectReference values to be
represented as a struct. For example, "item" is such a value:
{"ts":1678135015708.349,"caller":"garbagecollector/garbagecollector.go:595","msg":"classify object references","v":5,"item":{"name":"dra-test-driver-g4tkd","namespace":"dra-1830","apiVersion":"v1","uid":"c3f88616-7282-488c-887c-3f04291e6f4f"},"solid":null,"dangling":[{"apiVersion":"apps/v1","kind":"ReplicaSet","name":"dra-test-driver","uid":"dbe9a90c-9dfd-4ad0-8395-e5fa228f9851","controller":true,"blockOwnerDeletion":true}],"waitingForDependentsDeletion":null}
* migrating pkg/controller/serviceaccount to contextual logging
Signed-off-by: Naman <namanlakhwani@gmail.com>
* small nit
Signed-off-by: Naman <namanlakhwani@gmail.com>
* capitalising first letter of error
Signed-off-by: Naman <namanlakhwani@gmail.com>
* addressed review comments
Signed-off-by: Naman <namanlakhwani@gmail.com>
* small nit to add key
Signed-off-by: Naman <namanlakhwani@gmail.com>
---------
Signed-off-by: Naman <namanlakhwani@gmail.com>
* migrated controller/replicaset to contextual logging
Signed-off-by: Naman <namanlakhwani@gmail.com>
* small nits
Signed-off-by: Naman <namanlakhwani@gmail.com>
* addressed changes
Signed-off-by: Naman <namanlakhwani@gmail.com>
* small nit
Signed-off-by: Naman <namanlakhwani@gmail.com>
* taking t as input
Signed-off-by: Naman <namanlakhwani@gmail.com>
---------
Signed-off-by: Naman <namanlakhwani@gmail.com>
With Topology Manager enabled by default, we no longer need
`resourceAllocator` as Topology Manager serves as the main
PodAdmitHandler completely responsible for admission check
based on hints received from the hintProviders and the
subsequent allocation of the corresponding resources to a
pod as can be seen here:
https://github.com/kubernetes/kubernetes/blob/v1.26.0/pkg/kubelet/cm/topologymanager/scope.go#L150
With regard to DRA, the passing of `cm.draManager` into
resourceAllocator seems redundant as no admission checks
(and allocation of resources handled by DRA) is taking place
in `Admit` method of resourceAllocator. DRA has a completely
different model to the rest of the resource managers where
pod is only scheduled on a node once resources are reserved
for it. Because of this, admission checks or waiting for
resources to be provisioned after the pod has been scheduled
on the node is not required.
Before making the above change, it was verified that DRA Manager
is instantiated in `NewContainerManager`:
https://github.com/kubernetes/kubernetes/blob/v1.26.0/pkg/kubelet/cm/container_manager_linux.go#L318
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Since Topology manager is graduating to GA, we remove
internal configuration variable names with `Experimental`
prefix.
There is no expected change in behavior, only trival
variable renaming.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
In case of node reboot/kubelet restart, the flow of events involves
obtaining the state from the checkpoint file followed by setting
the `healthDevices`/`unhealthyDevices` to its zero value. This is
done to allow the device plugin to re-register itself so that
capacity can be updated appropriately.
During the allocation phase, we need to check if the resources requested
by the pod have been registered AND healthy devices are present on
the node to be allocated.
Also we need to move this check above `needed==0` where needed is
required - devices allocated to the container (which is obtained from
the checkpoint file) because even in cases where no additional devices
have to be allocated (as they were pre-allocated), we still need to
make the devices that were previously allocated are healthy.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Today, the health check response to the load balancers asking Kube-proxy for
the status of ETP:Local services does not include the healthz state of Kube-
proxy. This means that Kube-proxy might indicate to load balancers that they
should forward traffic to the node in question, simply because the endpoint
is running on the node - this overlooks the fact that Kube-proxy might be
not-healthy and hasn't successfully written the rules enabling traffic to
reach the endpoint.
- set higher severity and log level when unmanaged pods found and improve testing
- do not mention unsupported controller when triggering event for
unmanaged pods (this is covered by CalculateExpectedPodCountFailed
event)
- test unsupported controller
- make testing for events non blocking when event not found
In order to implement the `full-pcpus-only` cpumanager policy option,
we leverage the implementation of the algorithm which picks CPUs.
By design, CPUs are taken from the biggest chunk available (socket
or NUMA zone) to physical cores, down to single cores.
Leveraging this, if the requested CPU count is a multiple of the SMT
level (commonly 2), we're guaranteed that only full physical cores
will be taken.
The hidden assumption here is this holds true by construction iff
the user reserved CPUs (if any) considering full physical CPUs.
IOW, if the user did intentionally or mistakely reserve single threads
which are no core siblings[1], then the simple check we implemented
is not sufficient.
A easy example can probably outline this better. With this setup:
cores: [(0, 4), (1, 5), (2, 6), (3, 8)] (in parens: thread siblings).
SMT level: 2 (each tuple is 2 elements)
Reserved CPUs: 0,1 (explicit pick using `--reserved-cpus`)
A container then requests 6 cpus. full-pcpus-only check: 6 % 2 == 0. Passed.
The CPU allocator will take first full cores, (2,6) and (3,8), and will
then pick the remaining single CPUs. The allocation will succeed, but
it's incorrect.
We can fix this case with a stricter precheck.
We need to additionally consider all the core siblings of the reserved
CPUs as unavailable when computing the free cpus, before to start the
actual allocation. Doing so, we fall back in the intended behavior, and
by construction all possible CPUs allocation whose number is multiple
of the SMT level are now correct again.
+++
[1] or thread siblings in the linux parlance, in any case:
hyperthread siblings of the same physical core
Signed-off-by: Francesco Romani <fromani@redhat.com>
Passing in a context instead of a stop channel has several advantages:
- ensures that client-go calls return as soon as the controller is asked to stop
- contextual logging can be used
By passing that context down to its own functions and checking it while
waiting, the lease controller also doesn't get stuck in backoffEnsureLease
anymore (https://github.com/kubernetes/kubernetes/issues/116196).
Update go-jose from v2.2.2 to v2.6.0.
This is to make the kubernetes code compatible with newer go-jose versions that have a small breaking change (`jwt.NewNumericDate()` returns a pointer).
Signed-off-by: Max Goltzsche <max.goltzsche@gmail.com>
This improves performance of the text formatting and ktesting.
Because ktesting no longer buffers messages by default, one unit
test needs to ask for that explicitly.