GetAllocateResourcesPodAdmitHandler(). It is named as such to reflect its
new function. Also remove the Topology Manager feature gate check at higher level
kubelet.go, as it is now done in GetAllocateResourcesPodAdmitHandler().
- allocatePodResources logic altered to allow for container by container
device allocation.
- New type PodReusableDevices
- New field in devicemanager devicesToReuse
- Where previously we called manager.AddContainer(), we now call both
manager.Allocate() and manager.AddContainer().
- Some test cases now have two expected errors. One each
from Allocate() and AddContainer(). Existing outcomes are unchanged.
GetTopologyPodAdmitHandler() now returns a lifecycle.PodAdmitHandler
type instead of the TopologyManager directly. The handler it returns
is generally responsible for attempting to allocate any resources that
require a pod admission check. When the TopologyManager feature gate
is on, this comes directly from the TopologyManager. When it is off,
we simply attempt the allocations ourselves and fail the admission
on an unexpected error. The higher level kubelet.go feature gate
check will be removed in an upcoming PR.
In an e2e run, out of 1857 pod status updates executed by the
Kubelet 453 (25%) were no-ops - they only contained the UID of
the pod and no status changes. If the patch is a no-op we can
avoid invoking the server and continue.
In https://github.com/kubernetes/kubernetes/pull/88372, we added the
ability to inject errors to the `FakeImageService`. Use this ability to
test the error paths executed by the `kubeGenericRuntimeManager` when
underlying `ImageService` calls fail.
I don't foresee this change having a huge impact, but it should set a
good precedent for test coverage, and should the failure case behavior
become more "interesting" or risky in the future, we already will have
the scaffolding in place with which we can expand the tests.
This implementation allows Pod to request multiple hugepage resources
of different size and mount hugepage volumes using storage medium
HugePage-<size>, e.g.
spec:
containers:
resources:
requests:
hugepages-2Mi: 2Mi
hugepages-1Gi: 2Gi
volumeMounts:
- mountPath: /hugepages-2Mi
name: hugepage-2mi
- mountPath: /hugepages-1Gi
name: hugepage-1gi
...
volumes:
- name: hugepage-2mi
emptyDir:
medium: HugePages-2Mi
- name: hugepage-1gi
emptyDir:
medium: HugePages-1Gi
NOTE: This is an alpha feature.
Feature gate HugePageStorageMediumSize must be enabled for it to work.
The pod, container, and emptyDir volumes can all trigger evictions
when their limits are breached. To ensure that administrators can
alert on these type of evictions, update kubelet_evictions to include
the following signal types:
* ephemeralcontainerfs.limit - container ephemeral storage breaches its limit
* ephemeralpodfs.limit - pod ephemeral storage breaches its limit
* emptydirfs.limit - pod emptyDir storage breaches its limit
Filesystem mismatch is a special event. This could indicate
either user has asked for incorrect filesystem or there is a error
from which mount operation can not recover on retry.
Co-Authored-By: Jordan Liggitt <jordan@liggitt.net>
This change will not work on its own. Higher level code needs to make
sure and call Allocate() before AddContainer is called. This is already
being done in cases when the TopologyManager feature gate is enabled (in
the PodAdmitHandler of the TopologyManager). However, we need to make
sure we add proper logic to call it in cases when the TopologyManager
feature gate is disabled.
This change will not work on its own. Higher level code needs to make
sure and call Allocate() before AddContainer is called. This is already
being done in cases when the TopologyManager feature gate is enabled (in
the PodAdmitHandler of the TopologyManager). However, we need to make
sure we add proper logic to call it in cases when the TopologyManager
feature gate is disabled.
Having this interface allows us to perform a tight loop of:
for each container {
containerHints = {}
for each provider {
containerHints[provider] = provider.GatherHints(container)
}
containerHints.MergeAndPublish()
for each provider {
provider.Allocate(container)
}
}
With this in place we can now be sure that the hints gathered in one
iteration of the loop always consider the allocations made in the
previous.
Instead of having a single call for Allocate(), we now split this into two
functions Allocate() and UpdatePluginResources().
The semantics split across them:
// Allocate configures and assigns devices to a pod. From the requested
// device resources, Allocate will communicate with the owning device
// plugin to allow setup procedures to take place, and for the device
// plugin to provide runtime settings to use the device (environment
// variables, mount points and device files).
Allocate(pod *v1.Pod) error
// UpdatePluginResources updates node resources based on devices already
// allocated to pods. The node object is provided for the device manager to
// update the node capacity to reflect the currently available devices.
UpdatePluginResources(
node *schedulernodeinfo.NodeInfo,
attrs *lifecycle.PodAdmitAttributes) error
As we move to a model in which the TopologyManager is able to ensure
aligned allocations from the CPUManager, devicemanger, and any
other TopologManager HintProviders in the same synchronous loop, we will
need to be able to call Allocate() independently from an
UpdatePluginResources(). This commit makes that possible.
The types were different so the diff output is not useful, both
should be pointers:
```
Feb 05 19:44:40 ci-ln-6k7l4-w-c-w9wbb.c.openshift-gce-devel-ci.internal hyperkube[2737]: I0205 19:44:40.222259 2737 status_manager.go:642] Pod status is inconsistent with cached status for pod "prometheus-k8s-1_openshift-monitoring(0e9137b8-3bd2-4353-b7f5-672749106dc1)", a reconciliation should be triggered:
Feb 05 19:44:40 ci-ln-6k7l4-w-c-w9wbb.c.openshift-gce-devel-ci.internal hyperkube[2737]: interface{}(
Feb 05 19:44:40 ci-ln-6k7l4-w-c-w9wbb.c.openshift-gce-devel-ci.internal hyperkube[2737]: - s"&PodStatus{Phase:Running,Conditions:[]PodCondition{PodCondition{Type:Initialized,Status:True,LastProbeTime:0001-01-01 00:00:00 +0000 UTC,LastTransitionTime:2020-02-05 19:13:30 +0000 UTC,Reason:,Message:,},PodCondit>
Feb 05 19:44:40 ci-ln-6k7l4-w-c-w9wbb.c.openshift-gce-devel-ci.internal hyperkube[2737]: + v1.PodStatus{
Feb 05 19:44:40 ci-ln-6k7l4-w-c-w9wbb.c.openshift-gce-devel-ci.internal hyperkube[2737]: + Phase: "Running",
Feb 05 19:44:40 ci-ln-6k7l4-w-c-w9wbb.c.openshift-gce-devel-ci.internal hyperkube[2737]: + Conditions: []v1.PodCondition{
```
Previously, the verious Merge() policies of the TopologyManager all
eturned their own lifecycle.PodAdmitResult result. However, for
consistency in any failed admits, this is better handled in the
top-level Topology manager, with each policy only returning a boolean
about whether or not they would like to admit the pod or not. This
commit changes the semantics to match this logic.
Previously, we unconditionally removed *all* topology hints from a pod
whenever just one container was being removed. This commit makes it so
we only remove the hints for the single container being removed, and
then conditionally remove the pod from the podTopologyHints[podUID] when
no containers left in it.
Unit test for updating container hugepage limit
Add warning message about ignoring case.
Update error handling about hugepage size requirements
Signed-off-by: sewon.oh <sewon.oh@samsung.com>
As of https://github.com/kubernetes/kubernetes/pull/72831, the minimum
docker version is 1.13.1. (and the minimum API version is 1.26). The
only time the `RuntimeAdmitHandler` returns anything other than accept
is when the Docker API version < 1.24. In other words, we can be
confident that Docker will always support sysctl.
As a result, we can delete this unnecessary and docker-specific code.
- Move remaining logic from mergeProvidersHints to generic top level
mergeFilteredHints function.
- Add numaNodes as parameter in order to make generic.
- Move single NUMA node specific check to single-numa-node Merge
function.
- Move initial 'filtering' functionality to generic function
filterProvidersHints level policy.go.
- Call new function from top level Merge function.
- Rename some variables/parameters to reflect changes.
A recent change made it so that the CPUManager receives a list of
initial containers that exist on the system at startup. This list can be
non-empty, for example, after a kubelet retart.
This commit ensures that the CPUManagers containerMap structure is
initialized with the containers from this list.
Previously, `pkg/kubelet/qos` contained two different docker-specific
OOM constants. One set the oom adj for sandbox docker containers and the
other set the oom adj for the docker daemon. Move both to be closer to
their actual usages in `dockershim`.
This change addresses a TODO and leads us towards the overall goal of
making `pkg/kubelet` runtime agnostic except for `dockershim`.
The logic has been updated to match the logic of the best-effort policy
except in two places:
1) The hint filtering frunction has been updated to allow "don't care"
hints encoded with a `nil` affinity mask, to pass through the filter in
addition to hints that have just a single NUMA bit set.
2) After calculating the `bestHint` we transform "don't care" affinities
encoded as having all NUMA bits set in their affinity masks into "don't
care" affinities encoded as `nil`.
- Initialize best Hint to TopologyHint{}
- Update checks.
- Move generic unit test case into policy specific tests and updated
expected outcome to reflect changes.
- Restructure function
- Remove bug fix for catching {nil true} - To be fixed in later commit
- Restore unit tests to original state for testing filterHints
This is to keep consistency with the other policies.
This change may be made across all policies in a future PR, but removing it
from the scope of this PR for now.
- Best Effort Policy: Return hint with nil affinity as opposed to
defaultAffinity when provider has no preference for NUMA affinty or no
possible NUMA affinities.
- Single NUMA Node Policy: Remove defaultHint from mergeProvidersHints.
Instead return appropriate TopologyHint where required.
- Update unit tests to reflect changes. Some test cases moved into
individual policy test functions due to differing returned affinties
per policy.
- Remove getHintMatch method.
- Replace with simplified versions of mergePermutation and
iterateAllProviderTopologyHints methods - as used in best-effort.
- Remove getHintMatch unit tests.
- Update filterHints test to reflect changes in previous commit.
- Some common test cases achieve differing expected results based on
policy due to independent merge strategies. These cases are moved into
individual policy based test functions.
- Only append valid preferred-true hints to filtered
- Return true if allResourceHints only consist of
nil-affinity/preferred-true hints: {nil true}, update defaultHint
preference accordingly.
Explanation taken from original commit:
- Change the current method of finding the best hint.
Instead of going over all permutations, sort the hints and find
the narrowest hint common to all resources.
- Break out early when merging to a preferred hint is not possible
- Remove need to pass policy and numaNodes as arguments
- Remove PolicySingleNUMANode special case check in policy_best_effort
- Add mergeProviderHints base to policy_single_numa_node for upcoming
commit
This check is redundant since we protect this call with a call to
`m.sourcesReady.AllReady()` earlier on. Moreover, having this check in
place means that we will leave some stale state around in cases where
there are actually no active pods in the system and this loop hasn't
cleaned them up yet. This can happen, for example, if a pod exits while
the kubelet is down for some reason. We see this exact case being
triggered in our e2e tests, where a test has been failing since October
when this change was first introduced.