The types were different so the diff output is not useful, both
should be pointers:
```
Feb 05 19:44:40 ci-ln-6k7l4-w-c-w9wbb.c.openshift-gce-devel-ci.internal hyperkube[2737]: I0205 19:44:40.222259 2737 status_manager.go:642] Pod status is inconsistent with cached status for pod "prometheus-k8s-1_openshift-monitoring(0e9137b8-3bd2-4353-b7f5-672749106dc1)", a reconciliation should be triggered:
Feb 05 19:44:40 ci-ln-6k7l4-w-c-w9wbb.c.openshift-gce-devel-ci.internal hyperkube[2737]: interface{}(
Feb 05 19:44:40 ci-ln-6k7l4-w-c-w9wbb.c.openshift-gce-devel-ci.internal hyperkube[2737]: - s"&PodStatus{Phase:Running,Conditions:[]PodCondition{PodCondition{Type:Initialized,Status:True,LastProbeTime:0001-01-01 00:00:00 +0000 UTC,LastTransitionTime:2020-02-05 19:13:30 +0000 UTC,Reason:,Message:,},PodCondit>
Feb 05 19:44:40 ci-ln-6k7l4-w-c-w9wbb.c.openshift-gce-devel-ci.internal hyperkube[2737]: + v1.PodStatus{
Feb 05 19:44:40 ci-ln-6k7l4-w-c-w9wbb.c.openshift-gce-devel-ci.internal hyperkube[2737]: + Phase: "Running",
Feb 05 19:44:40 ci-ln-6k7l4-w-c-w9wbb.c.openshift-gce-devel-ci.internal hyperkube[2737]: + Conditions: []v1.PodCondition{
```
Previously, the verious Merge() policies of the TopologyManager all
eturned their own lifecycle.PodAdmitResult result. However, for
consistency in any failed admits, this is better handled in the
top-level Topology manager, with each policy only returning a boolean
about whether or not they would like to admit the pod or not. This
commit changes the semantics to match this logic.
Previously, we unconditionally removed *all* topology hints from a pod
whenever just one container was being removed. This commit makes it so
we only remove the hints for the single container being removed, and
then conditionally remove the pod from the podTopologyHints[podUID] when
no containers left in it.
Unit test for updating container hugepage limit
Add warning message about ignoring case.
Update error handling about hugepage size requirements
Signed-off-by: sewon.oh <sewon.oh@samsung.com>
As of https://github.com/kubernetes/kubernetes/pull/72831, the minimum
docker version is 1.13.1. (and the minimum API version is 1.26). The
only time the `RuntimeAdmitHandler` returns anything other than accept
is when the Docker API version < 1.24. In other words, we can be
confident that Docker will always support sysctl.
As a result, we can delete this unnecessary and docker-specific code.
- Move remaining logic from mergeProvidersHints to generic top level
mergeFilteredHints function.
- Add numaNodes as parameter in order to make generic.
- Move single NUMA node specific check to single-numa-node Merge
function.
- Move initial 'filtering' functionality to generic function
filterProvidersHints level policy.go.
- Call new function from top level Merge function.
- Rename some variables/parameters to reflect changes.
A recent change made it so that the CPUManager receives a list of
initial containers that exist on the system at startup. This list can be
non-empty, for example, after a kubelet retart.
This commit ensures that the CPUManagers containerMap structure is
initialized with the containers from this list.
Previously, `pkg/kubelet/qos` contained two different docker-specific
OOM constants. One set the oom adj for sandbox docker containers and the
other set the oom adj for the docker daemon. Move both to be closer to
their actual usages in `dockershim`.
This change addresses a TODO and leads us towards the overall goal of
making `pkg/kubelet` runtime agnostic except for `dockershim`.
The logic has been updated to match the logic of the best-effort policy
except in two places:
1) The hint filtering frunction has been updated to allow "don't care"
hints encoded with a `nil` affinity mask, to pass through the filter in
addition to hints that have just a single NUMA bit set.
2) After calculating the `bestHint` we transform "don't care" affinities
encoded as having all NUMA bits set in their affinity masks into "don't
care" affinities encoded as `nil`.
- Initialize best Hint to TopologyHint{}
- Update checks.
- Move generic unit test case into policy specific tests and updated
expected outcome to reflect changes.
- Restructure function
- Remove bug fix for catching {nil true} - To be fixed in later commit
- Restore unit tests to original state for testing filterHints
This is to keep consistency with the other policies.
This change may be made across all policies in a future PR, but removing it
from the scope of this PR for now.
- Best Effort Policy: Return hint with nil affinity as opposed to
defaultAffinity when provider has no preference for NUMA affinty or no
possible NUMA affinities.
- Single NUMA Node Policy: Remove defaultHint from mergeProvidersHints.
Instead return appropriate TopologyHint where required.
- Update unit tests to reflect changes. Some test cases moved into
individual policy test functions due to differing returned affinties
per policy.
- Remove getHintMatch method.
- Replace with simplified versions of mergePermutation and
iterateAllProviderTopologyHints methods - as used in best-effort.
- Remove getHintMatch unit tests.
- Update filterHints test to reflect changes in previous commit.
- Some common test cases achieve differing expected results based on
policy due to independent merge strategies. These cases are moved into
individual policy based test functions.
- Only append valid preferred-true hints to filtered
- Return true if allResourceHints only consist of
nil-affinity/preferred-true hints: {nil true}, update defaultHint
preference accordingly.
Explanation taken from original commit:
- Change the current method of finding the best hint.
Instead of going over all permutations, sort the hints and find
the narrowest hint common to all resources.
- Break out early when merging to a preferred hint is not possible
- Remove need to pass policy and numaNodes as arguments
- Remove PolicySingleNUMANode special case check in policy_best_effort
- Add mergeProviderHints base to policy_single_numa_node for upcoming
commit
This check is redundant since we protect this call with a call to
`m.sourcesReady.AllReady()` earlier on. Moreover, having this check in
place means that we will leave some stale state around in cases where
there are actually no active pods in the system and this loop hasn't
cleaned them up yet. This can happen, for example, if a pod exits while
the kubelet is down for some reason. We see this exact case being
triggered in our e2e tests, where a test has been failing since October
when this change was first introduced.
Add unit tests for OomWatcher that actually test the logic defined in
the `Start` method. As a result of an earlier refactor, its now trivial
to mock the OOMInstance events which the `oom_watcher` is supposed to be
watching.
Clean up code in PLEG which was only necessary for the `rkt` runtime.
Rkt is no longer a built-in runtime and docker(shim) uses the CRI, so
its safe to remove this code entirely.
This diff removes the last mentions of `rkt` in the kubelet.
This diff contains a strict refactor; there are no behavioral changes.
Address a long standing TODO in `oom_watcher_linux_test.go` around test
coverage. We refactor our `oom.Watcher` so it takes in a struct
fulfulling the `streamer` interface (i.e. defines `StreamOoms` method).
In production, we will continue to use the `oomparser` from `cadvisor`.
However, for testing purposes, we can now create our own `fakeStreamer`,
and control how it streams `oomparser.OomInstance`. With this fake, we
can implement richer unit testing for the `oom.Watcher` itself.
Actually adding the additional unit tests will come in a later commit.
Part of efforts to clean up mentions of rkt in kubelet.
rkt was removed entirely in 1.11, in favor of using `rktlet` and CRI
instead. It should no longer be listed at all as a runtime.
This commit is part of a larger effort to clean up references to `rkt`
in the kubelet.
Previously, this comment hard-coded which integrations required
the cadvisor stats provider. The comment has grown stale
(i.e. referenced rkt and did not reference cri-o).
Update the comment to instead point to the code which determines which
integrations need the cadvisor stats provider.
The `FakeDockerClient` had a number of methods defined on it which were
not being called anywhere. The majority were of the form `Assert...`.
In the spirit of removing dead code, remove the methods which aren't
being called.
Expose the measurement that kubelet uses to judge that "PLEG is
unhealthy". If we can observe the measurement growing then we can
alert before the node goes unhealthy.
Note that the existing metrics PLEGRelistInterval and
PLEGRelistDuration are poor for this, because when relist() gets
stuck they are never updated.
Signed-off-by: Bryan Boreham <bryan@weave.works>
The `recorder.PastEventf` method wasn't actually working as advertised.
It was supposed to accept a timestamp, which would be used when
generating the event. However, as the
[source code](https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/record/event.go#L316)
shows, this `timestamp` was never actually used.
In other words, `PastEventf` is identical to `Eventf`.
We have two options: one would be to fix `PastEventf` so that it works
as advertised. The other would be to delete `PastEventf` and only
support `Eventf`.
Ultimately, I could only find one use of `PastEventf` in the code base,
so I propose we just delete `PastEventf` and convert all uses to
`Eventf`.
This change is to prevent problems when we remove the V1->V2 migration
code in the future. Without this, the checksums of all checkpoints would
be hashed with the name CPUManagerCheckpointV2 embedded inside of them,
which is undesirable. We want the checkpoints to be hashed with the name
CPUManagerCheckpoint instead.
The updated CPUManager from PR #84462 implements logic to migrate the
CPUManager checkpoint file from an old format to a new one. To do so, it
defines the following types:
```
type CPUManagerCheckpoint = CPUManagerCheckpointV2
type CPUManagerCheckpointV1 struct { ... }
type CPUManagerCheckpointV2 struct { ... }
```
This replaces the old definition of just:
```
type CPUManagerCheckpoint struct { ... }
```
Code was put in place to ensure proper migration from checkpoints in V1
format to checkpoints in V2 format. However (and this is a big however),
all of the unit tests were performed on V1 checkpoints that were
generated using the type name `CPUManagerCheckpointV1` and not the
original type name of `CPUManagerCheckpoint`. As such, the checksum in
the checkpoint file uses the `CPUManagerCheckpointV1` type to calculate
its checksum and not the original type name of `CPUManagerCheckpoint`.
This causes problems in the real world since all pre-1.18 checkpoint
files will have been generated with the original type name of
`CPUManagerCheckpoint`. When verifying the checksum of the checkpoint
file across an upgrade to 1.18, the checksum is calculated assuming
a type name of `CPUManagerCheckpointV1` (which is incorrect) and the
file is seen to be corrupt.
This patch ensures that all V1 checksums are verified against a type
name of `CPUManagerCheckpoint` instead of ``CPUManagerCheckpointV1`.
It also locks the algorithm used to calculate the checksum in place,
since it wil never change in the future (for pre-1.18 checkpoint
files at least).
These information associatedd with these containers is used to migrate
the CPUManager state from it's old format to its new (i.e. keyed off of
podUID and containerName instead of containerID).
For now, we just pass 'nil' as the set of 'initialContainers' for
migrating from old state semantics to new ones. In a subsequent commit
will we pull this information from higher layers so that we can pass it
down at this stage properly.
Previously, the state was keyed off of containerID intead of podUID and
containerName. Unfortunately, this is no longer possible as we move to a
to model where we we allocate CPUs to containers at pod adit time rather
than container start time.
This patch is the first step towards full migration to the new
semantics. Only the unit tests in cpumanager/state are passing. In
subsequent commits we will update the CPUManager itself to use these new
semantics.
This patch also includes code to do migration from the old checkpoint format
to the new one, assuming the existence of a ContainerMap with the proper
mapping of (containerID)->(podUID, containerName). A subsequent commit
will update code in higher layers to make sure that this ContainerMap is
made available to this state logic.
The trailing for loop removed within this PR can include active and
inactive references. This modification will gauarantee at most only one
reference per container is returned.
For Windows, CPU Requests ( Shares, Count and Maximum ) are mutually exclusive, however
Kubernetes sends them all anyway in the pod spec.
When using dockershim this is not an issue, as Docker checks for this specific situation
here: 1bd184a4c2/daemon/daemon_windows.go (L87-L106)
However, when using CRI-Containerd this pods fail to spawn with an error from hcsshim.
This PR intends to filter these values before they are sent to the CRI and not rely on the
runtime for it.
Related to: https://github.com/kubernetes/kubernetes/issues/84804