yaml has comments, so we can explain why we have certain rules or
certain prefixes
for those files that weren't already commented yaml, I converted them to
yaml and took a best guess at comments based on the PRs that introduced
or updated them
The expectation is that exclusive CPU allocations happen at pod
creation time. When a container restarts, it should not have its
exclusive CPU allocations removed, and it should not need to
re-allocate CPUs.
There are a few places in the current code that look for containers
that have exited and call CpuManager.RemoveContainer() to clean up
the container. This will end up deleting any exclusive CPU
allocations for that container, and if the container restarts within
the same pod it will end up using the default cpuset rather than
what should be exclusive CPUs.
Removing those calls and adding resource cleanup at allocation
time should get rid of the problem.
Signed-off-by: Chris Friesen <chris.friesen@windriver.com>
With the old strategy, it was possible for an init container to end up
running without some of its CPUs being exclusive if it requested more
guaranteed CPUs than the sum of all guaranteed CPUs requested by app
containers. Unfortunately, this case was not caught by our unit tests
because they didn't validate the state of the defaultCPUSet to ensure
there was no overlap with CPUs assigned to containers. This patch
updates the strategy to reuse the CPUs assigned to init containers
across into app containers, while avoiding this edge case. It also
updates the unit tests to now catch this type of error in the future.
The cpumanager file-based state backend was obsoleted since few
releases, aving the cpumanager moved to the checkpointmanager common
infrastructure.
The old test checking compatibility to/from the old format is
also no longer needed, because the checkpoint format is stable
(see
https://github.com/kubernetes/kubernetes/tree/master/pkg/kubelet/checkpointmanager).
Signed-off-by: Francesco Romani <fromani@redhat.com>
* move well-known kubelet cloud provider annotations to k8s.io/cloud-provider
Signed-off-by: andrewsykim <kim.andrewsy@gmail.com>
* cloud provider: rename AnnotationProvidedIPAddr to AnnotationAlphaProvidedIPAddr to indicate alpha status
Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
As far as I can tell, nothing uses this type. As a result, it doesn't
really provide any benefit, and just clutters `kubelet.go`.
There's also the risk of it falling out of date with `NewMainKubelet`,
as nothing enforces `NewMainKubelet` being of the `Builder` type.
No need to use summary to create statsFunc for localStorageEviction.
Just use vals from makeSignalObservations.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
When kubelet is restarted, it will now remove the resources for huge
page sizes no longer supported. This is required when:
- node disables huge pages
- changing the default huge page size in older versions of linux
(because it will then only support the newly set default).
- Software updates that change what sizes are supported (eg. by changing
boot parameters).
docker folks added NumCPU implementation for windows that
supported hot-plugging of CPUs. The implementation used the
GetProcessAffinityMask to be able to check which CPUs are
active as well.
3707a76921
The golang "runtime" package has also bene using GetProcessAffinityMask
since 1.6 beta1:
6410e67a1e
So we don't seem to need the sysinfo.NumCPU from docker/docker.
(Note that this is PR is an effort to get away from dependencies from
docker/docker)
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
containerMap is used in CPU Manager to store all containers information in the node.
containerMap provides a mapping from (pod, container) -> containerID for all containers a pod
It is reusable in another component in pkg/kubelet/cm which needs to track changes of all containers in the node.
Signed-off-by: Byonggon Chun <bg.chun@samsung.com>
do a conversion from the cgroups v1 limits to cgroups v2.
e.g. cpu.shares on cgroups v1 has a range of [2-262144] while the
equivalent on cgroups v2 is cpu.weight that uses a range [1-10000].
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
On Windows, the podAdmitHandler returned by the GetAllocateResourcesPodAdmitHandler() func
and registered by the Kubelet is nil.
We implement a noopWindowsResourceAllocator that would admit any pod for Windows in order
to be consistent with the original implementation.
When we clobber PodIP we should also overwrite PodIPs and not rely
on the apiserver to fix it for us - this caused the Kubelet status
manager to report a large string of the following warnings when
it tried to reconcile a host network pod:
```
I0309 19:41:05.283623 1326 status_manager.go:846] Pod status is inconsistent with cached status for pod "machine-config-daemon-jvwz4_openshift-machine-config-operator(61176279-f752-4e1c-ac8a-b48f0a68d54a)", a reconciliation should be triggered:
&v1.PodStatus{
... // 5 identical fields
HostIP: "10.0.32.2",
PodIP: "10.0.32.2",
- PodIPs: []v1.PodIP{{IP: "10.0.32.2"}},
+ PodIPs: []v1.PodIP{},
StartTime: s"2020-03-09 19:41:05 +0000 UTC",
InitContainerStatuses: nil,
... // 3 identical fields
}
```
With the changes to the apiserver, this only happens once, but it is
still a bug.
Most of these could have been refactored automatically but it wouldn't
have been uglier. The unsophisticated tooling left lots of unnecessary
struct -> pointer -> struct transitions.
HPA needs metrics for exited init containers before it will
take action. By setting memory and CPU usage to zero for any
containers that cAdvisor didn't provide statistics for, we
are assured that HPA will be able to correctly calculate
pod resource usage.
The status manager syncBatch() method processes the current state
of the cache, which should include all entries in the channel. Flush
the channel before we call a batch to avoid unnecessary work and
to unblock pod workers when the node is congested.
Discovered while investigating long shutdown intervals on the node
where the status channel stayed full for tens of seconds.
Add a for loop around the select statement to avoid unnecessary
invocations of the wait.Forever closure each time.
When constructing the API status of a pod, if the pod is marked for
deletion no containers should be started. Previously, if a container
inside of a terminating pod failed to start due to a container
runtime error (that populates reasonCache) the reasonCache would
remain populated (it is only updated by syncPod for non-terminating
pods) and the delete action on the pod would be delayed until the
reasonCache entry expired due to other pods.
This dramatically reduces the amount of time the Kubelet waits to
delete pods that are terminating and encountered a container runtime
error.
After a pod reaches a terminal state and all containers are complete
we can delete the pod from the API server. The dispatchWork method
needs to wait for all container status to be available before invoking
delete. Even after the worker stops, status updates will continue to
be delivered and the sync handler will continue to sync the pods, so
dispatchWork gets multiple opportunities to see status.
The previous code assumed that a pod in Failed or Succeeded had no
running containers, but eviction or deletion of running pods could
still have running containers whose status needed to be reported.
This modifies earlier test to guarantee that the "fallback" exit
code 137 is never reported to match the expectation that all pods
exit with valid status for all containers (unless some exceptional
failure like eviction were to occur while the test is running).
The kubelet must not allow a container that was reported failed in a
restartPolicy=Never pod to be reported to the apiserver as success.
If a client deletes a restartPolicy=Never pod, the dispatchWork and
status manager race to update the container status. When dispatchWork
(specifically podIsTerminated) returns true, it means all containers
are stopped, which means status in the container is accurate. However,
the TerminatePod method then clears this status. This results in a
pod that has been reported with status.phase=Failed getting reset to
status.phase.Succeeded, which is a violation of the guarantees around
terminal phase.
Ensure the Kubelet never reports that a container succeeded when it
hasn't run or been executed by guarding the terminate pod loop from
ever reporting 0 in the absence of container status.
GetAllocateResourcesPodAdmitHandler(). It is named as such to reflect its
new function. Also remove the Topology Manager feature gate check at higher level
kubelet.go, as it is now done in GetAllocateResourcesPodAdmitHandler().
- allocatePodResources logic altered to allow for container by container
device allocation.
- New type PodReusableDevices
- New field in devicemanager devicesToReuse
- Where previously we called manager.AddContainer(), we now call both
manager.Allocate() and manager.AddContainer().
- Some test cases now have two expected errors. One each
from Allocate() and AddContainer(). Existing outcomes are unchanged.
GetTopologyPodAdmitHandler() now returns a lifecycle.PodAdmitHandler
type instead of the TopologyManager directly. The handler it returns
is generally responsible for attempting to allocate any resources that
require a pod admission check. When the TopologyManager feature gate
is on, this comes directly from the TopologyManager. When it is off,
we simply attempt the allocations ourselves and fail the admission
on an unexpected error. The higher level kubelet.go feature gate
check will be removed in an upcoming PR.
In an e2e run, out of 1857 pod status updates executed by the
Kubelet 453 (25%) were no-ops - they only contained the UID of
the pod and no status changes. If the patch is a no-op we can
avoid invoking the server and continue.
In https://github.com/kubernetes/kubernetes/pull/88372, we added the
ability to inject errors to the `FakeImageService`. Use this ability to
test the error paths executed by the `kubeGenericRuntimeManager` when
underlying `ImageService` calls fail.
I don't foresee this change having a huge impact, but it should set a
good precedent for test coverage, and should the failure case behavior
become more "interesting" or risky in the future, we already will have
the scaffolding in place with which we can expand the tests.
This implementation allows Pod to request multiple hugepage resources
of different size and mount hugepage volumes using storage medium
HugePage-<size>, e.g.
spec:
containers:
resources:
requests:
hugepages-2Mi: 2Mi
hugepages-1Gi: 2Gi
volumeMounts:
- mountPath: /hugepages-2Mi
name: hugepage-2mi
- mountPath: /hugepages-1Gi
name: hugepage-1gi
...
volumes:
- name: hugepage-2mi
emptyDir:
medium: HugePages-2Mi
- name: hugepage-1gi
emptyDir:
medium: HugePages-1Gi
NOTE: This is an alpha feature.
Feature gate HugePageStorageMediumSize must be enabled for it to work.
The pod, container, and emptyDir volumes can all trigger evictions
when their limits are breached. To ensure that administrators can
alert on these type of evictions, update kubelet_evictions to include
the following signal types:
* ephemeralcontainerfs.limit - container ephemeral storage breaches its limit
* ephemeralpodfs.limit - pod ephemeral storage breaches its limit
* emptydirfs.limit - pod emptyDir storage breaches its limit
I think the TODO here may have actually been unnecessary. There isn't a
ton of interest around merging
https://github.com/kubernetes/kubernetes/pull/87425, which contains a
fix. Delete the TODO so we don't devote time to working on this area in
the future.
Filesystem mismatch is a special event. This could indicate
either user has asked for incorrect filesystem or there is a error
from which mount operation can not recover on retry.
Co-Authored-By: Jordan Liggitt <jordan@liggitt.net>
This change will not work on its own. Higher level code needs to make
sure and call Allocate() before AddContainer is called. This is already
being done in cases when the TopologyManager feature gate is enabled (in
the PodAdmitHandler of the TopologyManager). However, we need to make
sure we add proper logic to call it in cases when the TopologyManager
feature gate is disabled.
This change will not work on its own. Higher level code needs to make
sure and call Allocate() before AddContainer is called. This is already
being done in cases when the TopologyManager feature gate is enabled (in
the PodAdmitHandler of the TopologyManager). However, we need to make
sure we add proper logic to call it in cases when the TopologyManager
feature gate is disabled.
Having this interface allows us to perform a tight loop of:
for each container {
containerHints = {}
for each provider {
containerHints[provider] = provider.GatherHints(container)
}
containerHints.MergeAndPublish()
for each provider {
provider.Allocate(container)
}
}
With this in place we can now be sure that the hints gathered in one
iteration of the loop always consider the allocations made in the
previous.
Instead of having a single call for Allocate(), we now split this into two
functions Allocate() and UpdatePluginResources().
The semantics split across them:
// Allocate configures and assigns devices to a pod. From the requested
// device resources, Allocate will communicate with the owning device
// plugin to allow setup procedures to take place, and for the device
// plugin to provide runtime settings to use the device (environment
// variables, mount points and device files).
Allocate(pod *v1.Pod) error
// UpdatePluginResources updates node resources based on devices already
// allocated to pods. The node object is provided for the device manager to
// update the node capacity to reflect the currently available devices.
UpdatePluginResources(
node *schedulernodeinfo.NodeInfo,
attrs *lifecycle.PodAdmitAttributes) error
As we move to a model in which the TopologyManager is able to ensure
aligned allocations from the CPUManager, devicemanger, and any
other TopologManager HintProviders in the same synchronous loop, we will
need to be able to call Allocate() independently from an
UpdatePluginResources(). This commit makes that possible.
The types were different so the diff output is not useful, both
should be pointers:
```
Feb 05 19:44:40 ci-ln-6k7l4-w-c-w9wbb.c.openshift-gce-devel-ci.internal hyperkube[2737]: I0205 19:44:40.222259 2737 status_manager.go:642] Pod status is inconsistent with cached status for pod "prometheus-k8s-1_openshift-monitoring(0e9137b8-3bd2-4353-b7f5-672749106dc1)", a reconciliation should be triggered:
Feb 05 19:44:40 ci-ln-6k7l4-w-c-w9wbb.c.openshift-gce-devel-ci.internal hyperkube[2737]: interface{}(
Feb 05 19:44:40 ci-ln-6k7l4-w-c-w9wbb.c.openshift-gce-devel-ci.internal hyperkube[2737]: - s"&PodStatus{Phase:Running,Conditions:[]PodCondition{PodCondition{Type:Initialized,Status:True,LastProbeTime:0001-01-01 00:00:00 +0000 UTC,LastTransitionTime:2020-02-05 19:13:30 +0000 UTC,Reason:,Message:,},PodCondit>
Feb 05 19:44:40 ci-ln-6k7l4-w-c-w9wbb.c.openshift-gce-devel-ci.internal hyperkube[2737]: + v1.PodStatus{
Feb 05 19:44:40 ci-ln-6k7l4-w-c-w9wbb.c.openshift-gce-devel-ci.internal hyperkube[2737]: + Phase: "Running",
Feb 05 19:44:40 ci-ln-6k7l4-w-c-w9wbb.c.openshift-gce-devel-ci.internal hyperkube[2737]: + Conditions: []v1.PodCondition{
```
Previously, the verious Merge() policies of the TopologyManager all
eturned their own lifecycle.PodAdmitResult result. However, for
consistency in any failed admits, this is better handled in the
top-level Topology manager, with each policy only returning a boolean
about whether or not they would like to admit the pod or not. This
commit changes the semantics to match this logic.
Previously, we unconditionally removed *all* topology hints from a pod
whenever just one container was being removed. This commit makes it so
we only remove the hints for the single container being removed, and
then conditionally remove the pod from the podTopologyHints[podUID] when
no containers left in it.
Unit test for updating container hugepage limit
Add warning message about ignoring case.
Update error handling about hugepage size requirements
Signed-off-by: sewon.oh <sewon.oh@samsung.com>
As of https://github.com/kubernetes/kubernetes/pull/72831, the minimum
docker version is 1.13.1. (and the minimum API version is 1.26). The
only time the `RuntimeAdmitHandler` returns anything other than accept
is when the Docker API version < 1.24. In other words, we can be
confident that Docker will always support sysctl.
As a result, we can delete this unnecessary and docker-specific code.
- Move remaining logic from mergeProvidersHints to generic top level
mergeFilteredHints function.
- Add numaNodes as parameter in order to make generic.
- Move single NUMA node specific check to single-numa-node Merge
function.
- Move initial 'filtering' functionality to generic function
filterProvidersHints level policy.go.
- Call new function from top level Merge function.
- Rename some variables/parameters to reflect changes.