The package says:
> the libcontainer SELinux package is only built for Linux, so it is
> necessary to have a NOP wrapper which is built for non-Linux platforms
This is not true, Kubernetes now imports
github.com/opencontainers/selinux/go-selinux and it has proper
multiplatform support (i.e. NOOP on non-Linux platforms).
Removing the whole package and calling go-selinux directly.
Before this fix, hint permutations such as:
permutation: [{11 true} {0101 true}]
Could result in merged hints of:
mergedHint: {01 true}
This was possible because both hints in the permutation container a "preferred"
allocation (i.e. the full set of NUMA nodes set in the affinity bitmask are
*required* to satisfy the allocation). With this in place, the simplified logic
we had simply kept the merged hint as preferred as well.
However, what we really want is to ensure that the merged hint is only
preferred if *true* alignment of all resources is possible (i.e. if all hints
in the permutation are preferred AND their affinities are exactly equal).
The only exception to this is if *no* topology information is provided by a
given hint provider. In this case, we assume alignment doesn't matter and only
consider the resources that actually have hints provided for them.
This changes the semantics of permutations of the form:
permutation: [{111 true} {011 true}]
To now result in the merged hint of:
mergedHint: {011 false}
Instead of:
mergedHint: {011 true}
This is arguably how it should always have been though (because a hint should
not be preferred if true alignment isn't possible), and two tests have had to
change to accomodate these new semantics.
This commit changes the merge function to implement the updated logic, adds a
test to verify it is functioning correctly, and updates the two tests mentioned
above to adjust to the new semantics.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
The remote runtime implementation now supports the `verbose` fields,
which are required for consumers like cri-tools to enable multi CRI
version support.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
TestStart was previously flaky. In approx 100_000 local runs, it failed
about 70% of the time, and has been mentioned as a flaky unit test in
the past.
This flake was due to a race condition with the logic as written and the
go scheduler. UpdateThreshold calls `notifier.Start(events)` in a new Go
Routine, which is not guarunteed to be called immediately.
This meant that if `m.Start()` was called before `notifier.Start()`, the
test would fail, as the notifier would not have been started before the
4 events were processed and lock released.
Here, we update the test to more closely match the intended application
behaviour, and have events passed to the channel when `Start` is called
on the notifier.
This ensures that -Start gets called and additionally validates
that the correct channel is provided to the notifier.
Stop was never called previously, as it only gets called on a subsequent
call to UpdateThreshold. `AnyTimes()` hid that this did not occur.
cAdvisor has code to expose OOM metrics since 0.40.0, but this was not
included in Kubelet so far. This commit enables it.
Signed-off-by: Jorik Jonker <jorik.jonker@eu.equinix.com>
- Allow a podWorker to start if it is blocked by a pod that has been
terminated before starting
- When a pod can't start AND has already been terminated, exit cleanly
- Add a unit test that exercises race conditions in pod workers
If CRI returns a container that has been created but is not running,
it is not safe to assume it is terminal, as our connection to CRI
may have failed. Instead, created is treated as waiting, as in
"waiting for this container to start". Either syncPod or
syncTerminatingPod is responsible for handling this state.
host-network pods IPs are obtained from the reported kubelet nodeIPs.
Historically, host-network podIPs are immutable once set, but when
we've added dual-stack support, we didn't consider that the secondary
IP address may not be present at the same time that the primary nodeIP.
If a secondary IP address is added to a node after the host-network pods
IPs are set, we can add the secondary host-network pod IP address
maintaining the current behavior of not updating the current podIPs on
host-network pods.
In the following code pattern, the log message will get logged with v=0 in JSON
output although conceptually it has a higher verbosity:
if klog.V(5).Enabled() {
klog.Info("hello world")
}
Having the actual verbosity in the JSON output is relevant, for example for
filtering out only the important info messages. The solution is to use
klog.V(5).Info or something similar.
Whether the outer if is necessary at all depends on how complex the parameters
are. The return value of klog.V can be captured in a variable and be used
multiple times to avoid the overhead for that function call and to avoid
repeating the verbosity level.
Revert to previous behavior in 1.21/1.20 of setting pod phase to failed
during graceful node shutdown.
Setting pods to failed phase will ensure that external controllers that
manage pods like deployments will create new pods to replace those that
are shutdown. Many customers have taken a dependency on this behavior
and it was breaking change in 1.22, so this change reverts back to the
previous behavior.
Signed-off-by: David Porter <david@porter.me>
We should not touch the dockershim ahead of removal and therefore
default to `v1alpha2` CRI instead of `v1`.
Partially reverts changes from https://github.com/kubernetes/kubernetes/pull/106501
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
Without this fix, the algorithm may decide to allocate "remainder" CPUs from a
NUMA node that has no more CPUs to allocate. Moreover, it was only considering
allocation of remainder CPUs from NUMA nodes such that each NUMA node in the
remainderSet could only allocate 1 (i.e. 'cpuGroupSize') more CPUs. With these
two issues in play, one could end up with an accounting error where not enough
CPUs were allocated by the time the algorithm runs to completion.
The updated algorithm will now omit any NUMA nodes that have 0 CPUs left from
the set of NUMA nodes considered for allocating remainder CPUs. Additionally,
we now consider *all* combinations of nodes from the remainder set of size
1..len(remainderSet). This allows us to find a better solution if allocating
CPUs from a smaller set leads to a more balanced allocation. Finally, we loop
through all NUMA nodes 1-by-1 in the remainderSet until all rmeainer CPUs have
been accounted for and allocated. This ensure that we will not hit an
accounting error later on because we explicitly remove CPUs from the remainder
set until there are none left.
A follow-on commit adds a set of unit tests that will fail before these
changes, but succeeds after them.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Previously the algorithm was too restrictive because it tried to calculate the
minimum based on the number of *available* NUMA nodes and the number of
*available* CPUs on those NUMA nodes. Since there was no (easy) way to tell how
many CPUs an individual NUMA node happened to have, the average across them was
used. Using this value however, could result in thinking you need more NUMA
nodes to possibly satisfy a request than you actually do.
By using the *total* number of NUMA nodes and CPUs per NUMA node, we can get
the true minimum number of nodes required to satisfy a request. For a given
"current" allocation this may not be the true minimum, but its better to start
with fewer and move up than to start with too many and miss out on a better
option.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Now that the algorithm for balancing CPU distributions across NUMA nodes is
correct, this test actually behaves differently for the "packed" vs.
"distributed" allocation algorithms (as it should).
In the "packed" case we need to ensure that CPUs are allocated such that they
are packed onto cores. Since one CPU is already allocated from a core on NUMA
node 0, we want the next CPU to be its hyperthreaded pair (even though the
first available CPU id is on Socket 1).
In the "distributed" case, however, we want to ensure CPUs are allocated such
that we have an balanced distribution of CPUs across all NUMA nodes. This
points to allocating from Socket 1 if the only other CPU allocated has been
done on Socket 0.
To allow CPUs allocations to be packed onto full cores, one can allocate them
from the "distributed" algorithm with a 'cpuGroupSize' equal to the number of
hypthreads per core (in this case 2). We added an explicit test case for this,
demonstrating that we get the same result as the "packed" algorithm does, even
though the "distributed" algorithm is in use.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
This fixes two related tests to better test our "balanced" distribution algorithm.
The first test originally provided an input with the following number of CPUs
available on each NUMA node:
Node 0: 16
Node 1: 20
Node 2: 20
Node 3: 20
It then attempted to distribute 48 CPUs across them with an expectation that
each of the first 3 NUMA nodes would have 16 CPUs taken from them (leaving Node
0 with no more CPUs in the end).
This would have resulted in the following amount of CPUs on each node:
Node 0: 0
Node 1: 4
Node 2: 4
Node 3: 20
Which results in a standard deviation of 7.6811
However, a more balanced solution would actually be to pull 16 CPUs from NUMA
nodes 1, 2, and 3, and leave 0 untouched, i.e.:
Node 0: 16
Node 1: 4
Node 2: 4
Node 3: 4
Which results in a standard deviation of 5.1961524227066
To fix this test we changed the original number of available CPUs to start with
4 less CPUs on NUMA node 3, and 2 more CPUs on NUMA node 0, i.e.:
Node 0: 18
Node 1: 20
Node 2: 20
Node 3: 16
So that we end up with a result of:
Node 0: 2
Node 1: 4
Node 2: 4
Node 3: 16
Which pulls the CPUs from where we want and results in a standard deviation of 5.5452
For the second test, we simply reverse the number of CPUs available for Nodes 0
and 3 as:
Node 0: 16
Node 1: 20
Node 2: 20
Node 3: 18
Which forces the allocation to happen just as it did for the first test, except
now on NUMA nodes 1, 2, and 3 instead of NUMA nodes 0,1, and 2.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Previously these would return lists that were too long because we appended to
pre-initialized lists with a specific size.
Since the primary place these functions are used is in the mean and standard
deviation calculations for the NUMA distribution algorithm, it meant that the
results of these calculations were often incorrect.
As a result, some of the unit tests we have are actually incorrect (because the
results we expect do not actually produce the best balanced
distribution of CPUs across all NUMA nodes for the input provided).
These tests will be patched up in subsequent commits.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
This patch makes the CRI `v1` API the new project-wide default version.
To allow backwards compatibility, a fallback to `v1alpha2` has been added
as well. This fallback can either used by automatically determined by
the kubelet.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
On systems where the calculated cpu shares results in a value above the
max value in linux, containers getting that value are unable to start.
This occur on systems with 300+ cpu cores, and where containers are
given such a value.
This issue was fixed for the pod and qos control groups in the similar
cm.MilliCPUToShares that also has tests verifying the behavior. Since
this code already has an dependency on kubelet/cm, lets reuse that code
instead.
The current memory notifier on cgroupv2 relies on reading
`cgroup.event_control` which is unsupported on cgroupv2. For now, let's
disable the feature on cgroupv2.
Once kubernetes#104613 and kubernetes#104693
merge, we'll have OS field in pod spec. Kubelet should start rejecting pods
where pod.Spec.OS and node's OS(using runtime.GOOS) won't match
These three options are the ones from logs.AddFlags which are not deprecated.
Therefore it makes sense to make them available also via the configuration file
support in the one command which currently supports that (kubelet).
Long-term, all commands should use LoggingConfiguration, either with a
configuration file (as in kubelet) or via flags (kube-scheduler,
kube-apiserver, kube-controller-manager).
Short-term, both approaches have to be supported. As the majority of the
commands only use logs.AddFlags, that function by default continues to register
the flags and only leaves that to Options.AddFlags when explicitly requested.
A drive-by bug fix is done for log flushing: the periodic flushing called
klog.Flush and therefore missed explicit flushing of the newer logr
backend. This bug was never present in any release Kubernetes and therefore the
fix is not submitted in a separate PR.
This feature has graduated to GA in v1.11 and will always be
enabled. So no longe need to check if enabled.
Signed-off-by: Konstantin Misyutin <konstantin.misyutin@huawei.com>
* De-share the Handler struct in core API
An upcoming PR adds a handler that only applies on one of these paths.
Having fields that don't work seems bad.
This never should have been shared. Lifecycle hooks are like a "write"
while probes are more like a "read". HTTPGet and TCPSocket don't really
make sense as lifecycle hooks (but I can't take that back). When we add
gRPC, it is EXPLICITLY a health check (defined by gRPC) not an arbitrary
RPC - so a probe makes sense but a hook does not.
In the future I can also see adding lifecycle hooks that don't make
sense as probes. E.g. 'sleep' is a common lifecycle request. The only
option is `exec`, which requires having a sleep binary in your image.
* Run update scripts
This is partially to allow the kube alpha tests to pass until CRI implementations have support, but also to handle this error situation a bit more elegantly
Signed-off-by: Peter Hunt <pehunt@redhat.com>
This commit adds an initial implementation of translating from the new CRI fields
to the /stats/summary PodStats object
Signed-off-by: Peter Hunt <pehunt@redhat.com>
The commit a8b8995ef2
changed the content of the data kubelet writes in the checkpoint.
Unfortunately, the checkpoint restore code was not updated,
so if we upgrade kubelet from pre-1.20 to 1.20+, the
device manager cannot anymore restore its state correctly.
The only trace of this misbehaviour is this line in the
kubelet logs:
```
W0615 07:31:49.744770 4852 manager.go:244] Continue after failing to read checkpoint file. Device allocation info may NOT be up-to-date. Err: json: cannot unmarshal array into Go struct field PodDevicesEntry.Data.PodDeviceEntries.DeviceIDs of type checkpoint.DevicesPerNUMA
```
If we hit this bug, the device allocation info is
indeed NOT up-to-date up until the device plugins register
themselves again. This can take up to few minutes, depending
on the specific device plugin.
While the device manager state is inconsistent:
1. the kubelet will NOT update the device availability to zero, so
the scheduler will send pods towards the inconsistent kubelet.
2. at pod admission time, the device manager allocation will not
trigger, so pods will be admitted without devices actually
being allocated to them.
To fix these issues, we add support to the device manager to
read pre-1.20 checkpoint data. We retroactively call this
format "v1".
Signed-off-by: Francesco Romani <fromani@redhat.com>
Seperate the CPU/Memory req/limit -> linux resource conversion into its
own function for better reuse.
Elsewhere in kuberuntime pkg, we will want to leverage this
requests/limits to Linux Resource type conversion.
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
This parameter ensures that CPUs are always allocated in groups of size
'cpuGroupSize'. This is important, for example, to ensure that all CPUs (i.e.
hyperthreads) from the same core are handed out together.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
As part of this, pull out all of the existing "TakeByTopology" tests and have
them be called by the original TestTakeByTopologyNUMAPacked() as well as the
new TestTakeByTopologyNUMADistributed() test. In a subsequent commit, we will
add some tests that should differ between these two algorithms.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
The first implements the original algorithm which packs CPUs onto NUMA nodes if
more than one NUMA node is required to satisfy the allocation. The second
disitributes CPUs across NUMA nodes if they can't all fit into one.
The "distributing" algorithm is currently a noop and just returns an error of
"unimplemented". A subsequent commit will add the logic to implement this
algorithm according to KEP 2902:
https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2902-cpumanager-distribute-cpus-policy-option
Signed-off-by: Kevin Klues <kklues@nvidia.com>
This batch of tests adds a fake topology on which each numa node
has multiple sockets. We didn't find yet a real HW topology in the wild
like this, but we need one to fully exercise the code.
So, until we find a HW topology, we add a fake one flipping
the NUMA/socket config of the existing xeon dual gold 6320.
Signed-off-by: Francesco Romani <fromani@redhat.com>
This batch of tests adds a real topology on which each physical socket
has multiple NUMA zones. Taken by a real dual xeon 6320 gold.
Signed-off-by: Francesco Romani <fromani@redhat.com>
The exisiting unit tests where performing subtests without
actually using the full features of the testing package
(https://pkg.go.dev/testing#hdr-Subtests_and_Sub_benchmarks)
Update them with fairly minimal changes. The patch is deceptively
large because we need to move the code inside a new block.
Signed-off-by: Francesco Romani <fromani@redhat.com>
User the `cmp.Diff` package in the unit tests, moving away from
`reflect.DeepEqual`. This gives us a clearer picture of the differences
when the tests fail.
Signed-off-by: Francesco Romani <fromani@redhat.com>
This allows us to get rid of the check for determining which one is higher all
throughout the code. Now we just check once and instantiate an interface of the
appropriate type that makes sure the ordering in the hierarchy is preserved
through the appropriate calls.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
The feature gate gets locked to "true", with the goal to remove it in two
releases.
All code now can assume that the feature is enabled. Tests for "feature
disabled" are no longer needed and get removed.
Some code wasn't using the new helper functions yet. That gets changed while
touching those lines.
This implements the replacement of klog output to different files per level
with optionally splitting JSON output into two streams: one for info messages
on stdout, one for error messages on stderr. The info messages can get buffered
to increase performance. Because stdout and stderr might be merged by the
consumer, the info stream gets flushed before writing an error, to ensure that
the order of messages is preserved.
This also ensures that the following code pattern doesn't leak info messages:
klog.ErrorS(err, ...)
os.Exit(1)
Commands explicitly have to flush before exiting via logs.FlushLogs. Most
already do. But buffered info messages can still get lost during an unexpected
program termination, therefore buffering is off by default.
The new options get added to the v1alpha1 LoggingConfiguration with new command
line flags. Because it is an alpha field, changing it inside the v1beta kubelet
config should be okay as long as the fields are clearly marked as alpha.
The name concatenation and ownership check were originally considered small
enough to not warrant dedicated functions, but the intent of the code is more
readable with them.
When adding the ephemeral volume feature, the special case for
PersistentVolumeClaim volume sources in kubelet's host path and node
limits checks was overlooked. An ephemeral volume source is another
way of referencing a claim and has to be treated the same way.
* Use utilpointer to get a pointer
* Add tests for kubelet default configs
* Change copyright year from 2015 to 2021
* Run gofmt
* Add all negative and all positive test cases
We graduate the `CPUManagerPolicyOptions` feature to beta
in the 1.23 cycle, and we add new experimental feature gates
to guard new options which are planned in the 1.23 and in the
following cycles.
We introduce additional feature gate called `CPUManagerPolicyAlphaOptions` and
`CPUManagerPolicyBetaOptions`. The basic idea is to avoid the
cumbersome process of adding a feature gate for each option, and to have
feature gates which track the maturity level of _groups_ of options.
Besides this change, the graduation process, and the process in general,
for adding new policy options is still unchanged.
The `full-pcpus-only` option added in the 1.22 cycle is intentionally
moved into the beta policy options
For more details:
- KEP: https://github.com/kubernetes/enhancements/pull/2933
- sig-arch discussion:
https://groups.google.com/u/1/g/kubernetes-sig-architecture/c/Nxsc7pfe5rw
Signed-off-by: Francesco Romani <fromani@redhat.com>
The GetAllocatableDevices, needed to support the podresources
API, doesn't take into account the device health when computing
its output.
In this PR we address this gap and add unit tests along the way
to prevent regressions. This gives us a good initial coverage,
E2E tests to cover this case are much harder to write, because
we would need to inject faults to trigger the unhealthy status.
We will evaluate if adding these tests into later PRs.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Remove the VolumeSubpath feature gate.
Feature gate convention has been updated since this was introduced to
indicate that they "are intended to be deprecated and removed after a
feature becomes GA or is dropped.".
If a pod is killed (no longer wanted) and then a subsequent create/
add/update event is seen in the pod worker, assume that a pod UID
was reused (as it could be in static pods) and have the next
SyncKnownPods after the pod terminates remove the worker history so
that the config loop can restart the static pod, as well as return
to the caller the fact that this termination was not final.
The housekeeping loop then reconciles the desired state of the Kubelet
(pods in pod manager that are not in a terminal state, i.e. admitted
pods) with the pod worker by resubmitting those pods. This adds a
small amount of latency (2s) when a pod UID is reused and the pod
is terminated and restarted.
A pod that has been rejected by admission will have status manager
set the phase to Failed locally, which make take some time to
propagate to the apiserver. The rejected pod will be included in
admission until the apiserver propagates the change back, which
was an unintended regression when checking pod worker state as
authoritative.
A pod that is terminal in the API may still be consuming resources
on the system, so it should still be included in admission.
Prevent starting pods with resources satisfied by a single NUMA node on multiple NUMA nodes.
The code returned before it updated the minimal amount of NUMA nodes that can satisfy the container
requests.
Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
The configuration is deprecated and targets removal for v1.23. Tests
cases have been changed as well.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
Fixes two issues with how the pod worker refactor calculated the
pods that admission could see (GetActivePods() and
filterOutTerminatedPods())
First, completed pods must be filtered from the "desired" state
for admission, which arguably should be happening earlier in
config. Exclude the two terminal pods states from GetActivePods()
Second, the previous check introduced with the pod worker lifecycle
ownership changes was subtly wrong for the admission use case.
Admission has to include pods that haven't yet hit the pod worker,
which CouldHaveRunningContainers was filtering out (because the
pod worker hasn't seen them). Introduce a weaker check -
IsPodKnownTerminated() - that returns true only if the pod is in
a known terminated state (no running containers AND known to pod
worker). This weaker check may only be called from components that
need admitted pods, not other kubelet subsystems.
This commit does not fix the long standing bug that force deleted
pods are omitted from admission checks, which must be fixed by
having GetActivePods() also include pods "still terminating".
If device plugin returns device without topology, keep it internaly
as NUMA node -1, it helps at podresources level to not export NUMA
topology, otherwise topology is exported with NUMA node id 0,
which is not accurate.
It's imposible to unveile this bug just by tracing json.Marshal(resp)
in podresource client, because NUMANodes field ID has json property
omitempty, in this case when ID=0 shown as emtpy NUMANode.
To reproduce it, better to iterate on devices and just
trace dev.Topology.Nodes[0].ID.
Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>
This is a knob added by runc 1.0.2 specifically for kubernetes,
which tells runc/libcontainer/cgroups/systemd v1 manager to not
freeze the cgroup in Set().
We set this knob here because this code is only used for pods
(rather than containers) management, and in this place we create or
update the pod cgroup with no device limits set, so we can skip the
freeze.
If this knob is not set, libcontainer's cgroup v1 manager tries to
figure out whether the freeze is needed or not, but it's a somewhat
expensive check to perform, thus the knob is a shortcut.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>