In the following code pattern, the log message will get logged with v=0 in JSON
output although conceptually it has a higher verbosity:
if klog.V(5).Enabled() {
klog.Info("hello world")
}
Having the actual verbosity in the JSON output is relevant, for example for
filtering out only the important info messages. The solution is to use
klog.V(5).Info or something similar.
Whether the outer if is necessary at all depends on how complex the parameters
are. The return value of klog.V can be captured in a variable and be used
multiple times to avoid the overhead for that function call and to avoid
repeating the verbosity level.
If a pod is killed (no longer wanted) and then a subsequent create/
add/update event is seen in the pod worker, assume that a pod UID
was reused (as it could be in static pods) and have the next
SyncKnownPods after the pod terminates remove the worker history so
that the config loop can restart the static pod, as well as return
to the caller the fact that this termination was not final.
The housekeeping loop then reconciles the desired state of the Kubelet
(pods in pod manager that are not in a terminal state, i.e. admitted
pods) with the pod worker by resubmitting those pods. This adds a
small amount of latency (2s) when a pod UID is reused and the pod
is terminated and restarted.
A pod that has been rejected by admission will have status manager
set the phase to Failed locally, which make take some time to
propagate to the apiserver. The rejected pod will be included in
admission until the apiserver propagates the change back, which
was an unintended regression when checking pod worker state as
authoritative.
A pod that is terminal in the API may still be consuming resources
on the system, so it should still be included in admission.
The configuration is deprecated and targets removal for v1.23. Tests
cases have been changed as well.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
* Change uses of whitelist to allowlist in kubelet sysctl
* Rename whitelist files to allowlist in Kubelet sysctl
* Further renames of whitelist to allowlist in Kubelet
* Rename podsecuritypolicy uses of whitelist to allowlist
* Update pkg/kubelet/kubelet.go
Co-authored-by: Danielle <dani@builds.terrible.systems>
Co-authored-by: Danielle <dani@builds.terrible.systems>
Prevent Kubelet from incorrectly interpreting "not yet started" pods as "ready to terminate pods" by unifying responsibility for pod lifecycle into pod worker
oomwatcher.NewWatcher returns "open /dev/kmsg: operation not permitted" error,
when running with sysctl value `kernel.dmesg_restrict=1`.
The error is negligible for KubeletInUserNamespace.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
A number of race conditions exist when pods are terminated early in
their lifecycle because components in the kubelet need to know "no
running containers" or "containers can't be started from now on" but
were relying on outdated state.
Only the pod worker knows whether containers are being started for
a given pod, which is required to know when a pod is "terminated"
(no running containers, none coming). Move that responsibility and
podKiller function into the pod workers, and have everything that
was killing the pod go into the UpdatePod loop. Split syncPod into
three phases - setup, terminate containers, and cleanup pod - and
have transitions between those methods be visible to other
components. After this change, to kill a pod you tell the pod worker
to UpdatePod({UpdateType: SyncPodKill, Pod: pod}).
Several places in the kubelet were incorrect about whether they
were handling terminating (should stop running, might have
containers) or terminated (no running containers) pods. The pod worker
exposes methods that allow other loops to know when to set up or tear
down resources based on the state of the pod - these methods remove
the possibility of race conditions by ensuring a single component is
responsible for knowing each pod's allowed state and other components
simply delegate to checking whether they are in the window by UID.
Removing containers now no longer blocks final pod deletion in the
API server and are handled as background cleanup. Node shutdown
no longer marks pods as failed as they can be restarted in the
next step.
See https://docs.google.com/document/d/1Pic5TPntdJnYfIpBeZndDelM-AbS4FN9H2GTLFhoJ04/edit# for details
This adds the gate `SeccompDefault` as new alpha feature. Seccomp path
and field fallbacks are now passed to the helper functions, whereas unit
tests covering those code paths have been added as well.
Beside enabling the feature gate, the feature has to be enabled by the
`SeccompDefault` kubelet configuration or its corresponding
`--seccomp-default` CLI flag.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
Apply suggestions from code review
Co-authored-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
FeatureGate acts as a secondary switch to disable cloud-controller loops
in KCM, Kubelet and KAPI.
Provide comprehensive logging information to users, so they will be
guided in adoption of out-of-tree cloud provider implementation.
GetNode() is called in a lot of places including a hot loop in
fastStatusUpdateOnce. Having a poll in it is delaying
the kubelet /readyz status=200 report.
If a client is available attempt to wait for the sync to happen,
before starting the list watch for pods at the apiserver.
If Containerd is used on Windows, then we can also mount individual
files into containers (e.g.: /etc/hosts), which was not possible with Docker.
Checks if the container runtime is containerd, and if it is, then also
mount /etc/hosts file (to C:\Windows\System32\drivers\etc\hosts).
Implements KEP 2000, Graceful Node Shutdown:
https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2000-graceful-node-shutdown
* Add new FeatureGate `GracefulNodeShutdown` to control
enabling/disabling the feature
* Add two new KubeletConfiguration options
* `ShutdownGracePeriod` and `ShutdownGracePeriodCriticalPods`
* Add new package, `nodeshutdown` that implements the Node shutdown
manager
* The node shutdown manager uses the systemd inhibit package, to
create an system inhibitor, monitor for node shutdown events, and
gracefully terminate pods upon a node shutdown.
It covers deviceplugin & cpumanager.
It has drawback, since cpuset and all other structs including cadvisor's keep
cpu as int, but for protobuf based interface is better to have fixed
int.
This patch also introduces additional interface CPUsProvider, while
DeviceProvider might have been extended too.
Checkpoint not covered by unit test.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>
Discussion is ongoing about how to best handle dual-stack with clouds
and autodetected IPs, but there is at least agreement that people on
bare metal ought to be able to specify two explicit IPs on dual-stack
hosts, so allow that.
When the dual-stack feature gate is enabled, set up dual-stack
iptables rules. (When it is not enabled, preserve the traditional
behavior of setting up IPv4 rules only, unless the user explicitly
passed an IPv6 --node-ip.)
DockerLegacyService interface is used throughout `pkg/kubelet`.
It used to live in the `pkg/kubelet/dockershim` package. While we
would eventually like to remove it entirely, we need to give users some form
of warning.
By including the interface in
`pkg/kubelet/legacy/logs.go`, we ensure the interface is
available to `pkg/kubelet`, even when we are building with the `dockerless`
tag (i.e. not compiling the dockershim).
While the interface always exists, there will be no implementations of the
interface when building with the `dockerless` tag. The lack of
implementations should not be an issue, as we only expect `pkg/kubelet` code
to need an implementation of the `DockerLegacyService` when we are using
docker. If we are using docker, but building with the `dockerless` tag, than
this will be just one of many things that breaks.
`pkg/kubelet/legacy` might not be the best name for the package... I'm
very open to finding a different package name or even an already
existing package.
We can remove all uses of `dockershim` from `cmd/kubelet`, by just
passing the docker options to the kubelet in their pure form, instead of
using them to create a `dockerClientConfig` (which is defined in
dockershim). We can then construct the `dockerClientConfig` only when we
actually need it.
Extract a `runDockershim` function into a file outside of `kubelet.go`.
We can use build tags to compile two separate functions... one which
actually runs dockershim and one that is a no-op.
As far as I can tell, nothing uses this type. As a result, it doesn't
really provide any benefit, and just clutters `kubelet.go`.
There's also the risk of it falling out of date with `NewMainKubelet`,
as nothing enforces `NewMainKubelet` being of the `Builder` type.
After a pod reaches a terminal state and all containers are complete
we can delete the pod from the API server. The dispatchWork method
needs to wait for all container status to be available before invoking
delete. Even after the worker stops, status updates will continue to
be delivered and the sync handler will continue to sync the pods, so
dispatchWork gets multiple opportunities to see status.
The previous code assumed that a pod in Failed or Succeeded had no
running containers, but eviction or deletion of running pods could
still have running containers whose status needed to be reported.
This modifies earlier test to guarantee that the "fallback" exit
code 137 is never reported to match the expectation that all pods
exit with valid status for all containers (unless some exceptional
failure like eviction were to occur while the test is running).
The kubelet must not allow a container that was reported failed in a
restartPolicy=Never pod to be reported to the apiserver as success.
If a client deletes a restartPolicy=Never pod, the dispatchWork and
status manager race to update the container status. When dispatchWork
(specifically podIsTerminated) returns true, it means all containers
are stopped, which means status in the container is accurate. However,
the TerminatePod method then clears this status. This results in a
pod that has been reported with status.phase=Failed getting reset to
status.phase.Succeeded, which is a violation of the guarantees around
terminal phase.
Ensure the Kubelet never reports that a container succeeded when it
hasn't run or been executed by guarding the terminate pod loop from
ever reporting 0 in the absence of container status.
GetAllocateResourcesPodAdmitHandler(). It is named as such to reflect its
new function. Also remove the Topology Manager feature gate check at higher level
kubelet.go, as it is now done in GetAllocateResourcesPodAdmitHandler().
As of https://github.com/kubernetes/kubernetes/pull/72831, the minimum
docker version is 1.13.1. (and the minimum API version is 1.26). The
only time the `RuntimeAdmitHandler` returns anything other than accept
is when the Docker API version < 1.24. In other words, we can be
confident that Docker will always support sysctl.
As a result, we can delete this unnecessary and docker-specific code.
This diff contains a strict refactor; there are no behavioral changes.
Address a long standing TODO in `oom_watcher_linux_test.go` around test
coverage. We refactor our `oom.Watcher` so it takes in a struct
fulfulling the `streamer` interface (i.e. defines `StreamOoms` method).
In production, we will continue to use the `oomparser` from `cadvisor`.
However, for testing purposes, we can now create our own `fakeStreamer`,
and control how it streams `oomparser.OomInstance`. With this fake, we
can implement richer unit testing for the `oom.Watcher` itself.
Actually adding the additional unit tests will come in a later commit.
This patch removes pkg/util/mount completely, and replaces it with the
mount package now located at k8s.io/utils/mount. The code found at
k8s.io/utils/mount was moved there from pkg/util/mount, so the code is
identical, just no longer in-tree to k/k.
Kubelet and kube-proxy both had loops to ensure that their iptables
rules didn't get deleted, by repeatedly recreating them. But on
systems with lots of iptables rules (ie, thousands of services), this
can be very slow (and thus might end up holding the iptables lock for
several seconds, blocking other operations, etc).
The specific threat that they need to worry about is
firewall-management commands that flush *all* dynamic iptables rules.
So add a new iptables.Monitor() function that handles this by creating
iptables-flush canaries and only triggering a full rule reload after
noticing that someone has deleted those chains.
The firewalld monitoring code was not well tested (and not easily
testable), would never be triggered on most platforms, and was only
being taken advantage of from one place (kube-proxy), which didn't
need it anyway since it already has its own resync loop.
Since the firewalld monitoring was the only consumer of pkg/util/dbus,
we can also now delete that.
This patch moves the HostUtil functionality from the util/mount package
to the volume/util/hostutil package.
All `*NewHostUtil*` calls are changed to return concrete types instead
of interfaces.
All callers are changed to use the `*NewHostUtil*` methods instead of
directly instantiating the concrete types.
DesiredStateOfWorldPopulator should skip a volume that is not used in any
pod. "Used" means either mounted (via volumeMounts) or used as raw block
device (via volumeDevices).
Especially when block feature is disabled, a block volume must not get into
DesiredStateOfWorld, because it would be formatted and mounted there.