Subsequent SELinux work (see http://kep.k8s.io/1710) will need
ActualStateOfWorld populated around the time kubelet starts mounting
volumes.
Therefore reconstruct volumes before starting reconciler, but do not depend
on the desired state of world populated nor node.status - both need a
working API server, which may not be available at that time.
All reconstructed volumes are marked as Uncertain and reconciler will sort
them out - call SetUp to ensure the volume is really mounted when a pod
needs the volume or call TearDown then there is no such pod.
Finish the reconstruction when the API server becomes available:
- Clean up volumes that failed reconstruction and are not needed.
- Update devicePath of reconstructed volumes from node.status. Make sure
not to overwrite devicePath that may have been updated when the volume
was mounted by reconcile().
Hiding all this rework behind SELinuxMountReadWriteOncePod FeatureGate,
just to make sure we have a way back if this commit is buggy.
This patch adds new Kubelet option topologyManagerPolicyOptions.
To introduce new TopologyManager options, first we need to introduce new
flag called `topology-manager-policy-options` to allow users to modify
behaviour of best-effort and restricted policies.
Signed-off-by: PiotrProkop <pprokop@nvidia.com>
CPUManager is going GA, thus it makes little sense
to keep the names of the internal configuration
variables `Experimental*`.
Trivial rename only.
Signed-off-by: Francesco Romani <fromani@redhat.com>
With graduation of device plugins to GA in 1.26, the feature gate is
enabled by default so `devicePluginEnabled` field no longer needs to
be passed at the time of Container Manager creation.
In addition to that, we remove the `ManagerStub` as it is no longer
needed.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
When a volume is already mounted with an unexpected SELinux label,
kubelet must unmount it first and then mount it back with the expected one.
Report an error to user, just in case the unmount takes too long.
In therory, this error should not happen too often, because two Pods with
different SELinux label will not enter Desired State of World, see
dsw.AddPodToVolume. It can happen when DSW and ASW SELinux labels only when
a volume has been deleted from DSW (= Pod was deleted) or a volume was
reconstructed after kubelet restart. In both cases, volume manager should
unmount the volume quickly.
In PodExistsInVolume with volumeObj.seLinuxMountContext != nil we know that
the volume has been previously mounted with a given SELinuxMountContext.
Either it has been mounted by this kubelet and we know it's correct or it
was by a previous instance of kubelet and the context has been
reconstructed from the filesystem. In both cases, the actual context is
correct, regardless if the volume plugin or PV access mode supports SELinux
mounts.
The Desired State of World can require a different SELinux mount context than
is in the Actual State of World and it's perfectly OK. For example when
user changes SELinux context of Pods or when the context is reconstructed
after kubelet restart.
Don't spam log and don't report errors to the user as event - reconciler
will do the right thing and unmount the old volume (with wrong context) and
mount a new one in the next reconciliation. It's not an error, it's
expected workflow.
In order to improve the observability of the cpumanager,
add and populate metrics to track if the combination of
the kubelet configuration and podspec would trigger
exclusive core allocation and pinning.
We should avoid leaking any node/machine specific information
(e.g. core ids, even though this is admittedly an extreme example);
tracking these metrics seems to be a good first step, because
it allows us to get feedback without exposing details.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Currently, there are some unit tests that are failing on Windows due to
various reasons:
- config options not supported on Windows.
- files not closed, which means that they cannot be removed / renamed.
- paths not properly joined (filepath.Join should be used).
- time.Now() is not as precise on Windows, which means that 2
consecutive calls may return the same timestamp.
- different error messages on Windows.
- files have \r\n line endings on Windows.
- /tmp directory being used, which might not exist on Windows. Instead,
the OS-specific Temp directory should be used.
- the default value for Kubelet's EvictionHard field was containing
OS-specific fields. This is now moved, the field is now set during
Kubelet's initialization, after the config file is read.
Currently, there are some unit tests that are failing on Windows due to
various reasons:
- paths not properly joined (filepath.Join should be used).
- Proxy Mode IPVS not supported on Windows.
- DeadlineExceeded can occur when trying to read data from an UDP
socket. This can be used to detect whether the port was closed or not.
- In Windows, with long file name support enabled, file names can have
up to 32,767 characters. In this case, the error
windows.ERROR_FILENAME_EXCED_RANGE will be encountered instead.
- files not closed, which means that they cannot be removed / renamed.
- time.Now() is not as precise on Windows, which means that 2
consecutive calls may return the same timestamp.
- path.Base() will return the same path. filepath.Base() should be used
instead.
- path.Join() will always join the paths with a / instead of the OS
specific separator. filepath.Join() should be used instead.
Align the behavior of HTTP-based lifecycle handlers and HTTP-based
probers, converging on the probers implementation. This fixes multiple
deficiencies in the current implementation of lifecycle handlers
surrounding what functionality is available.
The functionality is gated by the features.ConsistentHTTPGetHandlers feature gate.
Some of the unit tests cannot pass on Windows due to various reasons:
- fsnotify does not have a Windows implementation.
- Proxy Mode IPVS not supported on Windows.
- Seccomp not supported on Windows.
- VolumeMode=Block is not supported on Windows.
- iSCSI volumes are mounted differently on Windows, and iscsiadm is a
Linux utility.
There is a corner case when blocking Pod termination via a lifecycle
preStop hook, for example by using this StateFulSet:
```yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
spec:
selector:
matchLabels:
app: ubi
serviceName: "ubi"
replicas: 1
template:
metadata:
labels:
app: ubi
spec:
terminationGracePeriodSeconds: 1000
containers:
- name: ubi
image: ubuntu:22.04
command: ['sh', '-c', 'echo The app is running! && sleep 360000']
ports:
- containerPort: 80
name: web
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- 'echo aaa; trap : TERM INT; sleep infinity & wait'
```
After creation, downscaling, forced deletion and upscaling of the
replica like this:
```
> kubectl apply -f sts.yml
> kubectl scale sts web --replicas=0
> kubectl delete pod web-0 --grace-period=0 --force
> kubectl scale sts web --replicas=1
```
We will end up having two pods running by the container runtime, while
the API only reports one:
```
> kubectl get pods
NAME READY STATUS RESTARTS AGE
web-0 1/1 Running 0 92s
```
```
> sudo crictl pods
POD ID CREATED STATE NAME NAMESPACE ATTEMPT RUNTIME
e05bb7dbb7e44 12 minutes ago Ready web-0 default 0 (default)
d90088614c73b 12 minutes ago Ready web-0 default 0 (default)
```
When now running `kubectl exec -it web-0 -- ps -ef`, there is a random chance that we hit the wrong
container reporting the lifecycle command `/bin/sh -c echo aaa; trap : TERM INT; sleep infinity & wait`.
This is caused by the container lookup via its name (and no podUID) at:
02109414e8/pkg/kubelet/kubelet_pods.go (L1905-L1914)
And more specifiy by the conversion of the pod result map to a slice in `GetPods`:
02109414e8/pkg/kubelet/kuberuntime/kuberuntime_manager.go (L407-L411)
We now solve that unexpected behavior by tracking the creation time of
the pod and sorting the result based on that. This will cause to always
match the most recently created pod.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
Mark remotecommand.Executor as deprecated and related modifications.
Handle crash when streamer.stream panics
Add a test to verify if stream is closed after connection being closed
Remove blank line and update waiting time to 1s to avoid test flakes in CI.
Refine the tests of StreamExecutor according to comments.
Remove the comment of context controlling the negotiation progress and misc.
Signed-off-by: arkbriar <arkbriar@gmail.com>
Because of a bug in the commit 1e7bb20c52,
podresources metrics were added, they are updated in the right
places, but they are never exported, so they cannot be consumed.
Fix trivially registering the metrics.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Several reports exist (both with device plugins and CSI) that
kubelet w/ grpc-go sends invalid Authority header and some non
grpc-go servers reject these unix domain socket client connections.
grpc-go sets the Authority header correct when the dial address
is in a format where the its address scheme can be determined.
Instead of making changes to get the all server addresses to unix://
prefixed format, set grpc.WithAuthority("localhost") client connection
override to get the same result.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Device Plugins that wish to leverage the Topology Manager can send back a populated
TopologyInfo struct as part of the device registration, along with the device IDs
and the health of the device. TopologyInfo is converted to TopologyHints and
used by TopologyManager to find the optimal/desired resource allocation for a Pod.
If a plugin sends an empty but non-nil instance of TopologyInfo for a resource,
devicemanager passes it on as an empty instance of TopologyHint which is
currently interpreted as "Hint Provider has no possible NUMA affinities
for resource" which further means that pods requesting that resource will fail.
To not block device resources that pass TopologyInfo{Nodes:[]*NUMANode{}} from being
used, interprete that as nil set of hints and not a []TopologyHint{}.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
cpu.cfs_period_us is measured in microseconds in the kernel but
provided in time.Duration by the user, that change clarifies the code
to make this evident to the reader.
Also, the minimum value for that feature is 1ms and not 1μs, and this
change alters the validation to reject values smaller than 1ms.
Track how long it takes for pod updates to propagate from detection
to successful change on API server. Will guide future improvements
in pod start and shutdown latency.
Metric is `kubelet_pod_status_sync_duration_seconds` and is ALPHA
stability. Histogram buckets are chosen based on distribution of
observed status delays in practice.
If you run "kubelet --cloud-provider X --node-ip Y", kubelet will set
an annotation on the node, but previously, if you then ran just
"kubelet --cloud-provider X" (or just "kubelet --node-ip Y"), it
wouldn't delete the stale annotation. Fix that.
* Adapt https://github.com/kubernetes/kubernetes/pull/109441 but
ensures that `search .` does not get propagated into containers'
/etc/resolv.conf. There is no reason to put `.` in a container's
search field and it causes issues for musl
These host paths have a well known location under /tmp/hostpath_pv
and are therefore safe to be labeled with the shared SELinux label.
Without this label, the mounted volumes cannot be accessed by the
container processes.
Signed-off-by: Mrunal Patel <mpatel@redhat.com>
drop bitArray implementation and use the bitmap implementation from
k8s.io/kubernetes/pkg/registry/core/service/allocator.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
and by doing so, fix a bug where the stats providers report a directory is not found after a pod's storage is removed
Signed-off-by: Peter Hunt <pehunt@redhat.com>
cpu.cfs_period_us is 100μs by default despite having an "ms" unit
for some unfortunate reason. Documentation:
https://www.kernel.org/doc/html/latest/scheduler/sched-bwc.html#management
The desired effect of that change is to match
k8s default `CPUCFSQuotaPeriod` value (100ms before that change)
with one used in k8s without the `CustomCPUCFSQuotaPeriod` flag enabled
and Linux CFS (100us, 1000x smaller than 100ms).
github.com/opencontainers/selinux/go-selinux needs OS that supports SELinux
and SELinux enabled in it to return useful data, therefore add an interface
in front of it, so we can mock its behavior in unit tests.
In future commits we will need this to set the user/group of supported
volumes of KEP 127 - Phase 1.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
it is used to allocate and keep track of the unique users ranges
assigned to each pod that runs in a user namespace.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Co-authored-by: Rodrigo Campos <rodrigoca@microsoft.com>
This change is to promote local storage capacity isolation feature to GA
At the same time, to allow rootless system disable this feature due to
unable to get root fs, this change introduced a new kubelet config
"localStorageCapacityIsolation". By default it is set to true. For
rootless systems, they can set this configuration to false to disable
the feature. Once it is set, user cannot set ephemeral-storage
request/limit because capacity and allocatable will not be set.
Change-Id: I48a52e737c6a09e9131454db6ad31247b56c000a
- PreemptionByKubeScheduler (Pod preempted by kube-scheduler)
- DeletionByTaintManager (Pod deleted by taint manager due to NoExecute taint)
- EvictionByEvictionAPI (Pod evicted by Eviction API)
- DeletionByPodGC (an orphaned Pod deleted by PodGC)PreemptedByScheduler (Pod preempted by kube-scheduler)
cpu.cfs_period_us is 100μs by default despite having an "ms" unit
for some unfortunate reason. Documentation:
https://www.kernel.org/doc/html/latest/scheduler/sched-bwc.html#management
The desired effect of that change is to match
k8s default `CPUCFSQuotaPeriod` value (100ms before that change)
with one used in k8s without the `CustomCPUCFSQuotaPeriod` flag enabled
and Linux CFS (100us, 1000x smaller than 100ms).
We now partly drop the support for seccomp annotations which is planned
for v1.25 as part of the KEP:
https://github.com/kubernetes/enhancements/issues/135
Pod security policies are not touched by this change and therefore we
have to keep the annotation key constants.
This means we only allow the usage of the annotations for backwards
compatibility reasons while the synchronization of the field to
annotation is no longer supported. Using the annotations for static pods
is also not supported any more.
Making the annotations fully non-functional will be deferred to a
future release.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
cpu.cfs_period_us is 100μs by default despite having an "ms" unit
for some unfortunate reason. Documentation:
https://www.kernel.org/doc/html/latest/scheduler/sched-bwc.html#management
The desired effect of that change is more clarity on the default value
so users would be aware that the 10ms custom value would be
not 0.1x of the default, but 100x of it.