If a pod is killed (no longer wanted) and then a subsequent create/
add/update event is seen in the pod worker, assume that a pod UID
was reused (as it could be in static pods) and have the next
SyncKnownPods after the pod terminates remove the worker history so
that the config loop can restart the static pod, as well as return
to the caller the fact that this termination was not final.
The housekeeping loop then reconciles the desired state of the Kubelet
(pods in pod manager that are not in a terminal state, i.e. admitted
pods) with the pod worker by resubmitting those pods. This adds a
small amount of latency (2s) when a pod UID is reused and the pod
is terminated and restarted.
A pod that has been rejected by admission will have status manager
set the phase to Failed locally, which make take some time to
propagate to the apiserver. The rejected pod will be included in
admission until the apiserver propagates the change back, which
was an unintended regression when checking pod worker state as
authoritative.
A pod that is terminal in the API may still be consuming resources
on the system, so it should still be included in admission.
The configuration is deprecated and targets removal for v1.23. Tests
cases have been changed as well.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
* Change uses of whitelist to allowlist in kubelet sysctl
* Rename whitelist files to allowlist in Kubelet sysctl
* Further renames of whitelist to allowlist in Kubelet
* Rename podsecuritypolicy uses of whitelist to allowlist
* Update pkg/kubelet/kubelet.go
Co-authored-by: Danielle <dani@builds.terrible.systems>
Co-authored-by: Danielle <dani@builds.terrible.systems>
Prevent Kubelet from incorrectly interpreting "not yet started" pods as "ready to terminate pods" by unifying responsibility for pod lifecycle into pod worker
oomwatcher.NewWatcher returns "open /dev/kmsg: operation not permitted" error,
when running with sysctl value `kernel.dmesg_restrict=1`.
The error is negligible for KubeletInUserNamespace.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
A number of race conditions exist when pods are terminated early in
their lifecycle because components in the kubelet need to know "no
running containers" or "containers can't be started from now on" but
were relying on outdated state.
Only the pod worker knows whether containers are being started for
a given pod, which is required to know when a pod is "terminated"
(no running containers, none coming). Move that responsibility and
podKiller function into the pod workers, and have everything that
was killing the pod go into the UpdatePod loop. Split syncPod into
three phases - setup, terminate containers, and cleanup pod - and
have transitions between those methods be visible to other
components. After this change, to kill a pod you tell the pod worker
to UpdatePod({UpdateType: SyncPodKill, Pod: pod}).
Several places in the kubelet were incorrect about whether they
were handling terminating (should stop running, might have
containers) or terminated (no running containers) pods. The pod worker
exposes methods that allow other loops to know when to set up or tear
down resources based on the state of the pod - these methods remove
the possibility of race conditions by ensuring a single component is
responsible for knowing each pod's allowed state and other components
simply delegate to checking whether they are in the window by UID.
Removing containers now no longer blocks final pod deletion in the
API server and are handled as background cleanup. Node shutdown
no longer marks pods as failed as they can be restarted in the
next step.
See https://docs.google.com/document/d/1Pic5TPntdJnYfIpBeZndDelM-AbS4FN9H2GTLFhoJ04/edit# for details
This adds the gate `SeccompDefault` as new alpha feature. Seccomp path
and field fallbacks are now passed to the helper functions, whereas unit
tests covering those code paths have been added as well.
Beside enabling the feature gate, the feature has to be enabled by the
`SeccompDefault` kubelet configuration or its corresponding
`--seccomp-default` CLI flag.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
Apply suggestions from code review
Co-authored-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
FeatureGate acts as a secondary switch to disable cloud-controller loops
in KCM, Kubelet and KAPI.
Provide comprehensive logging information to users, so they will be
guided in adoption of out-of-tree cloud provider implementation.
GetNode() is called in a lot of places including a hot loop in
fastStatusUpdateOnce. Having a poll in it is delaying
the kubelet /readyz status=200 report.
If a client is available attempt to wait for the sync to happen,
before starting the list watch for pods at the apiserver.
If Containerd is used on Windows, then we can also mount individual
files into containers (e.g.: /etc/hosts), which was not possible with Docker.
Checks if the container runtime is containerd, and if it is, then also
mount /etc/hosts file (to C:\Windows\System32\drivers\etc\hosts).
Implements KEP 2000, Graceful Node Shutdown:
https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2000-graceful-node-shutdown
* Add new FeatureGate `GracefulNodeShutdown` to control
enabling/disabling the feature
* Add two new KubeletConfiguration options
* `ShutdownGracePeriod` and `ShutdownGracePeriodCriticalPods`
* Add new package, `nodeshutdown` that implements the Node shutdown
manager
* The node shutdown manager uses the systemd inhibit package, to
create an system inhibitor, monitor for node shutdown events, and
gracefully terminate pods upon a node shutdown.
It covers deviceplugin & cpumanager.
It has drawback, since cpuset and all other structs including cadvisor's keep
cpu as int, but for protobuf based interface is better to have fixed
int.
This patch also introduces additional interface CPUsProvider, while
DeviceProvider might have been extended too.
Checkpoint not covered by unit test.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>
Discussion is ongoing about how to best handle dual-stack with clouds
and autodetected IPs, but there is at least agreement that people on
bare metal ought to be able to specify two explicit IPs on dual-stack
hosts, so allow that.
When the dual-stack feature gate is enabled, set up dual-stack
iptables rules. (When it is not enabled, preserve the traditional
behavior of setting up IPv4 rules only, unless the user explicitly
passed an IPv6 --node-ip.)
DockerLegacyService interface is used throughout `pkg/kubelet`.
It used to live in the `pkg/kubelet/dockershim` package. While we
would eventually like to remove it entirely, we need to give users some form
of warning.
By including the interface in
`pkg/kubelet/legacy/logs.go`, we ensure the interface is
available to `pkg/kubelet`, even when we are building with the `dockerless`
tag (i.e. not compiling the dockershim).
While the interface always exists, there will be no implementations of the
interface when building with the `dockerless` tag. The lack of
implementations should not be an issue, as we only expect `pkg/kubelet` code
to need an implementation of the `DockerLegacyService` when we are using
docker. If we are using docker, but building with the `dockerless` tag, than
this will be just one of many things that breaks.
`pkg/kubelet/legacy` might not be the best name for the package... I'm
very open to finding a different package name or even an already
existing package.
We can remove all uses of `dockershim` from `cmd/kubelet`, by just
passing the docker options to the kubelet in their pure form, instead of
using them to create a `dockerClientConfig` (which is defined in
dockershim). We can then construct the `dockerClientConfig` only when we
actually need it.
Extract a `runDockershim` function into a file outside of `kubelet.go`.
We can use build tags to compile two separate functions... one which
actually runs dockershim and one that is a no-op.
As far as I can tell, nothing uses this type. As a result, it doesn't
really provide any benefit, and just clutters `kubelet.go`.
There's also the risk of it falling out of date with `NewMainKubelet`,
as nothing enforces `NewMainKubelet` being of the `Builder` type.
After a pod reaches a terminal state and all containers are complete
we can delete the pod from the API server. The dispatchWork method
needs to wait for all container status to be available before invoking
delete. Even after the worker stops, status updates will continue to
be delivered and the sync handler will continue to sync the pods, so
dispatchWork gets multiple opportunities to see status.
The previous code assumed that a pod in Failed or Succeeded had no
running containers, but eviction or deletion of running pods could
still have running containers whose status needed to be reported.
This modifies earlier test to guarantee that the "fallback" exit
code 137 is never reported to match the expectation that all pods
exit with valid status for all containers (unless some exceptional
failure like eviction were to occur while the test is running).