Clarify the error message returned when trying to use the docker runtime
on a Kubelet that was compiled without Docker.
We removed the "w/" and "w/o", which can be confusing abbreviations, and
also add slightly more detail on the actual error.
We do not want to run the `cadvisor/docker` init when we are using the
dockerless build tags. We can ensure this by isolating into a separate
file with the proper build tag constraints.
As the final step, add the `dockerless` tags to all files in the
dockershim. Using `-tags=dockerless` in `go build`, we can compile
kubelet without the dockershim.
Once cadvisor no longer depends on `docker/docker`, compiling with
`-tags=dockerless` should be sufficient to compile the Kubelet w/o a
dependency on `docker/docker`.
DockerLegacyService interface is used throughout `pkg/kubelet`.
It used to live in the `pkg/kubelet/dockershim` package. While we
would eventually like to remove it entirely, we need to give users some form
of warning.
By including the interface in
`pkg/kubelet/legacy/logs.go`, we ensure the interface is
available to `pkg/kubelet`, even when we are building with the `dockerless`
tag (i.e. not compiling the dockershim).
While the interface always exists, there will be no implementations of the
interface when building with the `dockerless` tag. The lack of
implementations should not be an issue, as we only expect `pkg/kubelet` code
to need an implementation of the `DockerLegacyService` when we are using
docker. If we are using docker, but building with the `dockerless` tag, than
this will be just one of many things that breaks.
`pkg/kubelet/legacy` might not be the best name for the package... I'm
very open to finding a different package name or even an already
existing package.
We can remove all uses of `dockershim` from `cmd/kubelet`, by just
passing the docker options to the kubelet in their pure form, instead of
using them to create a `dockerClientConfig` (which is defined in
dockershim). We can then construct the `dockerClientConfig` only when we
actually need it.
Extract a `runDockershim` function into a file outside of `kubelet.go`.
We can use build tags to compile two separate functions... one which
actually runs dockershim and one that is a no-op.
Remove one of two uses of Dockershim in `cmd/kubelet`. The other is for
creating a docker client which we pass to the Kubelet... we will handle
that refactor in a separate diff.
I'm fairly confident, though need to double check, that no one is
actually using this experimental dockershim behavior. If they are, I
think we will want to find a new way to support it (that doesn't require
using the Kubelet only to launch Dockershim).
Following changes in #87730, Kubelet is directly hcsshim to gather stats.
However, unlike `docker stats` API that was used before, hcsshim does not
keep information about exited containers.
When the Kubelet lists containers (`docker_container.go:ListContainers()`),
it sets `All: true`, retrieving non-running containers.
When docker stats is called with such container id, it'll return a valid JSON
with all values set to 0. The non-running containers are filtered later on in the process.
When the hcsshim is called with such container id, it'll return an error, effectively
stopping the stats retrieval for all containers.
yaml has comments, so we can explain why we have certain rules or
certain prefixes
for those files that weren't already commented yaml, I converted them to
yaml and took a best guess at comments based on the PRs that introduced
or updated them
The expectation is that exclusive CPU allocations happen at pod
creation time. When a container restarts, it should not have its
exclusive CPU allocations removed, and it should not need to
re-allocate CPUs.
There are a few places in the current code that look for containers
that have exited and call CpuManager.RemoveContainer() to clean up
the container. This will end up deleting any exclusive CPU
allocations for that container, and if the container restarts within
the same pod it will end up using the default cpuset rather than
what should be exclusive CPUs.
Removing those calls and adding resource cleanup at allocation
time should get rid of the problem.
Signed-off-by: Chris Friesen <chris.friesen@windriver.com>
With the old strategy, it was possible for an init container to end up
running without some of its CPUs being exclusive if it requested more
guaranteed CPUs than the sum of all guaranteed CPUs requested by app
containers. Unfortunately, this case was not caught by our unit tests
because they didn't validate the state of the defaultCPUSet to ensure
there was no overlap with CPUs assigned to containers. This patch
updates the strategy to reuse the CPUs assigned to init containers
across into app containers, while avoiding this edge case. It also
updates the unit tests to now catch this type of error in the future.
The cpumanager file-based state backend was obsoleted since few
releases, aving the cpumanager moved to the checkpointmanager common
infrastructure.
The old test checking compatibility to/from the old format is
also no longer needed, because the checkpoint format is stable
(see
https://github.com/kubernetes/kubernetes/tree/master/pkg/kubelet/checkpointmanager).
Signed-off-by: Francesco Romani <fromani@redhat.com>
* move well-known kubelet cloud provider annotations to k8s.io/cloud-provider
Signed-off-by: andrewsykim <kim.andrewsy@gmail.com>
* cloud provider: rename AnnotationProvidedIPAddr to AnnotationAlphaProvidedIPAddr to indicate alpha status
Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
As far as I can tell, nothing uses this type. As a result, it doesn't
really provide any benefit, and just clutters `kubelet.go`.
There's also the risk of it falling out of date with `NewMainKubelet`,
as nothing enforces `NewMainKubelet` being of the `Builder` type.
No need to use summary to create statsFunc for localStorageEviction.
Just use vals from makeSignalObservations.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
When kubelet is restarted, it will now remove the resources for huge
page sizes no longer supported. This is required when:
- node disables huge pages
- changing the default huge page size in older versions of linux
(because it will then only support the newly set default).
- Software updates that change what sizes are supported (eg. by changing
boot parameters).
docker folks added NumCPU implementation for windows that
supported hot-plugging of CPUs. The implementation used the
GetProcessAffinityMask to be able to check which CPUs are
active as well.
3707a76921
The golang "runtime" package has also bene using GetProcessAffinityMask
since 1.6 beta1:
6410e67a1e
So we don't seem to need the sysinfo.NumCPU from docker/docker.
(Note that this is PR is an effort to get away from dependencies from
docker/docker)
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
containerMap is used in CPU Manager to store all containers information in the node.
containerMap provides a mapping from (pod, container) -> containerID for all containers a pod
It is reusable in another component in pkg/kubelet/cm which needs to track changes of all containers in the node.
Signed-off-by: Byonggon Chun <bg.chun@samsung.com>
do a conversion from the cgroups v1 limits to cgroups v2.
e.g. cpu.shares on cgroups v1 has a range of [2-262144] while the
equivalent on cgroups v2 is cpu.weight that uses a range [1-10000].
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
On Windows, the podAdmitHandler returned by the GetAllocateResourcesPodAdmitHandler() func
and registered by the Kubelet is nil.
We implement a noopWindowsResourceAllocator that would admit any pod for Windows in order
to be consistent with the original implementation.
When we clobber PodIP we should also overwrite PodIPs and not rely
on the apiserver to fix it for us - this caused the Kubelet status
manager to report a large string of the following warnings when
it tried to reconcile a host network pod:
```
I0309 19:41:05.283623 1326 status_manager.go:846] Pod status is inconsistent with cached status for pod "machine-config-daemon-jvwz4_openshift-machine-config-operator(61176279-f752-4e1c-ac8a-b48f0a68d54a)", a reconciliation should be triggered:
&v1.PodStatus{
... // 5 identical fields
HostIP: "10.0.32.2",
PodIP: "10.0.32.2",
- PodIPs: []v1.PodIP{{IP: "10.0.32.2"}},
+ PodIPs: []v1.PodIP{},
StartTime: s"2020-03-09 19:41:05 +0000 UTC",
InitContainerStatuses: nil,
... // 3 identical fields
}
```
With the changes to the apiserver, this only happens once, but it is
still a bug.
Most of these could have been refactored automatically but it wouldn't
have been uglier. The unsophisticated tooling left lots of unnecessary
struct -> pointer -> struct transitions.
HPA needs metrics for exited init containers before it will
take action. By setting memory and CPU usage to zero for any
containers that cAdvisor didn't provide statistics for, we
are assured that HPA will be able to correctly calculate
pod resource usage.
The status manager syncBatch() method processes the current state
of the cache, which should include all entries in the channel. Flush
the channel before we call a batch to avoid unnecessary work and
to unblock pod workers when the node is congested.
Discovered while investigating long shutdown intervals on the node
where the status channel stayed full for tens of seconds.
Add a for loop around the select statement to avoid unnecessary
invocations of the wait.Forever closure each time.
When constructing the API status of a pod, if the pod is marked for
deletion no containers should be started. Previously, if a container
inside of a terminating pod failed to start due to a container
runtime error (that populates reasonCache) the reasonCache would
remain populated (it is only updated by syncPod for non-terminating
pods) and the delete action on the pod would be delayed until the
reasonCache entry expired due to other pods.
This dramatically reduces the amount of time the Kubelet waits to
delete pods that are terminating and encountered a container runtime
error.
After a pod reaches a terminal state and all containers are complete
we can delete the pod from the API server. The dispatchWork method
needs to wait for all container status to be available before invoking
delete. Even after the worker stops, status updates will continue to
be delivered and the sync handler will continue to sync the pods, so
dispatchWork gets multiple opportunities to see status.
The previous code assumed that a pod in Failed or Succeeded had no
running containers, but eviction or deletion of running pods could
still have running containers whose status needed to be reported.
This modifies earlier test to guarantee that the "fallback" exit
code 137 is never reported to match the expectation that all pods
exit with valid status for all containers (unless some exceptional
failure like eviction were to occur while the test is running).
The kubelet must not allow a container that was reported failed in a
restartPolicy=Never pod to be reported to the apiserver as success.
If a client deletes a restartPolicy=Never pod, the dispatchWork and
status manager race to update the container status. When dispatchWork
(specifically podIsTerminated) returns true, it means all containers
are stopped, which means status in the container is accurate. However,
the TerminatePod method then clears this status. This results in a
pod that has been reported with status.phase=Failed getting reset to
status.phase.Succeeded, which is a violation of the guarantees around
terminal phase.
Ensure the Kubelet never reports that a container succeeded when it
hasn't run or been executed by guarding the terminate pod loop from
ever reporting 0 in the absence of container status.