These three options are the ones from logs.AddFlags which are not deprecated.
Therefore it makes sense to make them available also via the configuration file
support in the one command which currently supports that (kubelet).
Long-term, all commands should use LoggingConfiguration, either with a
configuration file (as in kubelet) or via flags (kube-scheduler,
kube-apiserver, kube-controller-manager).
Short-term, both approaches have to be supported. As the majority of the
commands only use logs.AddFlags, that function by default continues to register
the flags and only leaves that to Options.AddFlags when explicitly requested.
A drive-by bug fix is done for log flushing: the periodic flushing called
klog.Flush and therefore missed explicit flushing of the newer logr
backend. This bug was never present in any release Kubernetes and therefore the
fix is not submitted in a separate PR.
This feature has graduated to GA in v1.11 and will always be
enabled. So no longe need to check if enabled.
Signed-off-by: Konstantin Misyutin <konstantin.misyutin@huawei.com>
* De-share the Handler struct in core API
An upcoming PR adds a handler that only applies on one of these paths.
Having fields that don't work seems bad.
This never should have been shared. Lifecycle hooks are like a "write"
while probes are more like a "read". HTTPGet and TCPSocket don't really
make sense as lifecycle hooks (but I can't take that back). When we add
gRPC, it is EXPLICITLY a health check (defined by gRPC) not an arbitrary
RPC - so a probe makes sense but a hook does not.
In the future I can also see adding lifecycle hooks that don't make
sense as probes. E.g. 'sleep' is a common lifecycle request. The only
option is `exec`, which requires having a sleep binary in your image.
* Run update scripts
This is partially to allow the kube alpha tests to pass until CRI implementations have support, but also to handle this error situation a bit more elegantly
Signed-off-by: Peter Hunt <pehunt@redhat.com>
This commit adds an initial implementation of translating from the new CRI fields
to the /stats/summary PodStats object
Signed-off-by: Peter Hunt <pehunt@redhat.com>
The commit a8b8995ef2
changed the content of the data kubelet writes in the checkpoint.
Unfortunately, the checkpoint restore code was not updated,
so if we upgrade kubelet from pre-1.20 to 1.20+, the
device manager cannot anymore restore its state correctly.
The only trace of this misbehaviour is this line in the
kubelet logs:
```
W0615 07:31:49.744770 4852 manager.go:244] Continue after failing to read checkpoint file. Device allocation info may NOT be up-to-date. Err: json: cannot unmarshal array into Go struct field PodDevicesEntry.Data.PodDeviceEntries.DeviceIDs of type checkpoint.DevicesPerNUMA
```
If we hit this bug, the device allocation info is
indeed NOT up-to-date up until the device plugins register
themselves again. This can take up to few minutes, depending
on the specific device plugin.
While the device manager state is inconsistent:
1. the kubelet will NOT update the device availability to zero, so
the scheduler will send pods towards the inconsistent kubelet.
2. at pod admission time, the device manager allocation will not
trigger, so pods will be admitted without devices actually
being allocated to them.
To fix these issues, we add support to the device manager to
read pre-1.20 checkpoint data. We retroactively call this
format "v1".
Signed-off-by: Francesco Romani <fromani@redhat.com>
Seperate the CPU/Memory req/limit -> linux resource conversion into its
own function for better reuse.
Elsewhere in kuberuntime pkg, we will want to leverage this
requests/limits to Linux Resource type conversion.
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
This parameter ensures that CPUs are always allocated in groups of size
'cpuGroupSize'. This is important, for example, to ensure that all CPUs (i.e.
hyperthreads) from the same core are handed out together.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
As part of this, pull out all of the existing "TakeByTopology" tests and have
them be called by the original TestTakeByTopologyNUMAPacked() as well as the
new TestTakeByTopologyNUMADistributed() test. In a subsequent commit, we will
add some tests that should differ between these two algorithms.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
The first implements the original algorithm which packs CPUs onto NUMA nodes if
more than one NUMA node is required to satisfy the allocation. The second
disitributes CPUs across NUMA nodes if they can't all fit into one.
The "distributing" algorithm is currently a noop and just returns an error of
"unimplemented". A subsequent commit will add the logic to implement this
algorithm according to KEP 2902:
https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2902-cpumanager-distribute-cpus-policy-option
Signed-off-by: Kevin Klues <kklues@nvidia.com>
This batch of tests adds a fake topology on which each numa node
has multiple sockets. We didn't find yet a real HW topology in the wild
like this, but we need one to fully exercise the code.
So, until we find a HW topology, we add a fake one flipping
the NUMA/socket config of the existing xeon dual gold 6320.
Signed-off-by: Francesco Romani <fromani@redhat.com>
This batch of tests adds a real topology on which each physical socket
has multiple NUMA zones. Taken by a real dual xeon 6320 gold.
Signed-off-by: Francesco Romani <fromani@redhat.com>
The exisiting unit tests where performing subtests without
actually using the full features of the testing package
(https://pkg.go.dev/testing#hdr-Subtests_and_Sub_benchmarks)
Update them with fairly minimal changes. The patch is deceptively
large because we need to move the code inside a new block.
Signed-off-by: Francesco Romani <fromani@redhat.com>
User the `cmp.Diff` package in the unit tests, moving away from
`reflect.DeepEqual`. This gives us a clearer picture of the differences
when the tests fail.
Signed-off-by: Francesco Romani <fromani@redhat.com>
This allows us to get rid of the check for determining which one is higher all
throughout the code. Now we just check once and instantiate an interface of the
appropriate type that makes sure the ordering in the hierarchy is preserved
through the appropriate calls.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
The feature gate gets locked to "true", with the goal to remove it in two
releases.
All code now can assume that the feature is enabled. Tests for "feature
disabled" are no longer needed and get removed.
Some code wasn't using the new helper functions yet. That gets changed while
touching those lines.
This implements the replacement of klog output to different files per level
with optionally splitting JSON output into two streams: one for info messages
on stdout, one for error messages on stderr. The info messages can get buffered
to increase performance. Because stdout and stderr might be merged by the
consumer, the info stream gets flushed before writing an error, to ensure that
the order of messages is preserved.
This also ensures that the following code pattern doesn't leak info messages:
klog.ErrorS(err, ...)
os.Exit(1)
Commands explicitly have to flush before exiting via logs.FlushLogs. Most
already do. But buffered info messages can still get lost during an unexpected
program termination, therefore buffering is off by default.
The new options get added to the v1alpha1 LoggingConfiguration with new command
line flags. Because it is an alpha field, changing it inside the v1beta kubelet
config should be okay as long as the fields are clearly marked as alpha.
The name concatenation and ownership check were originally considered small
enough to not warrant dedicated functions, but the intent of the code is more
readable with them.
When adding the ephemeral volume feature, the special case for
PersistentVolumeClaim volume sources in kubelet's host path and node
limits checks was overlooked. An ephemeral volume source is another
way of referencing a claim and has to be treated the same way.
* Use utilpointer to get a pointer
* Add tests for kubelet default configs
* Change copyright year from 2015 to 2021
* Run gofmt
* Add all negative and all positive test cases
We graduate the `CPUManagerPolicyOptions` feature to beta
in the 1.23 cycle, and we add new experimental feature gates
to guard new options which are planned in the 1.23 and in the
following cycles.
We introduce additional feature gate called `CPUManagerPolicyAlphaOptions` and
`CPUManagerPolicyBetaOptions`. The basic idea is to avoid the
cumbersome process of adding a feature gate for each option, and to have
feature gates which track the maturity level of _groups_ of options.
Besides this change, the graduation process, and the process in general,
for adding new policy options is still unchanged.
The `full-pcpus-only` option added in the 1.22 cycle is intentionally
moved into the beta policy options
For more details:
- KEP: https://github.com/kubernetes/enhancements/pull/2933
- sig-arch discussion:
https://groups.google.com/u/1/g/kubernetes-sig-architecture/c/Nxsc7pfe5rw
Signed-off-by: Francesco Romani <fromani@redhat.com>
The GetAllocatableDevices, needed to support the podresources
API, doesn't take into account the device health when computing
its output.
In this PR we address this gap and add unit tests along the way
to prevent regressions. This gives us a good initial coverage,
E2E tests to cover this case are much harder to write, because
we would need to inject faults to trigger the unhealthy status.
We will evaluate if adding these tests into later PRs.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Remove the VolumeSubpath feature gate.
Feature gate convention has been updated since this was introduced to
indicate that they "are intended to be deprecated and removed after a
feature becomes GA or is dropped.".
If a pod is killed (no longer wanted) and then a subsequent create/
add/update event is seen in the pod worker, assume that a pod UID
was reused (as it could be in static pods) and have the next
SyncKnownPods after the pod terminates remove the worker history so
that the config loop can restart the static pod, as well as return
to the caller the fact that this termination was not final.
The housekeeping loop then reconciles the desired state of the Kubelet
(pods in pod manager that are not in a terminal state, i.e. admitted
pods) with the pod worker by resubmitting those pods. This adds a
small amount of latency (2s) when a pod UID is reused and the pod
is terminated and restarted.
A pod that has been rejected by admission will have status manager
set the phase to Failed locally, which make take some time to
propagate to the apiserver. The rejected pod will be included in
admission until the apiserver propagates the change back, which
was an unintended regression when checking pod worker state as
authoritative.
A pod that is terminal in the API may still be consuming resources
on the system, so it should still be included in admission.
Prevent starting pods with resources satisfied by a single NUMA node on multiple NUMA nodes.
The code returned before it updated the minimal amount of NUMA nodes that can satisfy the container
requests.
Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
The configuration is deprecated and targets removal for v1.23. Tests
cases have been changed as well.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
Fixes two issues with how the pod worker refactor calculated the
pods that admission could see (GetActivePods() and
filterOutTerminatedPods())
First, completed pods must be filtered from the "desired" state
for admission, which arguably should be happening earlier in
config. Exclude the two terminal pods states from GetActivePods()
Second, the previous check introduced with the pod worker lifecycle
ownership changes was subtly wrong for the admission use case.
Admission has to include pods that haven't yet hit the pod worker,
which CouldHaveRunningContainers was filtering out (because the
pod worker hasn't seen them). Introduce a weaker check -
IsPodKnownTerminated() - that returns true only if the pod is in
a known terminated state (no running containers AND known to pod
worker). This weaker check may only be called from components that
need admitted pods, not other kubelet subsystems.
This commit does not fix the long standing bug that force deleted
pods are omitted from admission checks, which must be fixed by
having GetActivePods() also include pods "still terminating".
If device plugin returns device without topology, keep it internaly
as NUMA node -1, it helps at podresources level to not export NUMA
topology, otherwise topology is exported with NUMA node id 0,
which is not accurate.
It's imposible to unveile this bug just by tracing json.Marshal(resp)
in podresource client, because NUMANodes field ID has json property
omitempty, in this case when ID=0 shown as emtpy NUMANode.
To reproduce it, better to iterate on devices and just
trace dev.Topology.Nodes[0].ID.
Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>
This is a knob added by runc 1.0.2 specifically for kubernetes,
which tells runc/libcontainer/cgroups/systemd v1 manager to not
freeze the cgroup in Set().
We set this knob here because this code is only used for pods
(rather than containers) management, and in this place we create or
update the pod cgroup with no device limits set, so we can skip the
freeze.
If this knob is not set, libcontainer's cgroup v1 manager tries to
figure out whether the freeze is needed or not, but it's a somewhat
expensive check to perform, thus the knob is a shortcut.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
* Change uses of whitelist to allowlist in kubelet sysctl
* Rename whitelist files to allowlist in Kubelet sysctl
* Further renames of whitelist to allowlist in Kubelet
* Rename podsecuritypolicy uses of whitelist to allowlist
* Update pkg/kubelet/kubelet.go
Co-authored-by: Danielle <dani@builds.terrible.systems>
Co-authored-by: Danielle <dani@builds.terrible.systems>
The Kubelet always clears reason and message in generateAPIPodStatus
even when the phase is unchanged. It is reasonable that we preserve
the previous values when the phase does not change, and clear it
when the phase does change.
When a pod is evicted, this ensurse that the eviction message and
reason are propagated even in the face of subsequent updates. It also
preserves the message and reason if components beyond the Kubelet
choose to set that value.
To preserve the value we need to know the old phase, which requires
a change to convertStatusToAPIStatus so that both methods have
access to it.
If a pod is already in terminated and the housekeeping loop sees an
out of date cache entry for a running container, the pod worker
should ignore that running pod termination request. Once the worker
completes, a subsequent housekeeping invocation will then invoke
terminating because the worker is no longer processing any pod
with that UID.
This does leave the possibility of syncTerminatedPod being blocked
if a container in the pod is started after killPod successfully
completes but before syncTerminatedPod can exit successfully,
perhaps because the terminated flow (detach volumes) is blocked on
that running container. A future change will address that issue.
Prevent Kubelet from incorrectly interpreting "not yet started" pods as "ready to terminate pods" by unifying responsibility for pod lifecycle into pod worker
should mark volume mount in actual state even if volume expansion fails so that
reconciler can tear down the volume when needed. To avoid pods start
using it, mark volume as uncertain instead of mounted.
Will add unit test after the logic is reviewed.
Change-Id: I5aebfa11ec93235a87af8f17bea7f7b1570b603d
Consume in the static policy the cpu manager policy options from
the cpumanager instance.
Validate in the none policy if any option is given, and fail if so -
this is almost surely a configuration mistake.
Add new cpumanager.Options type to hold the options and translate from
user arguments to flags.
Co-authored-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Francesco Romani <fromani@redhat.com>
Introduce a new `admission` subpackage to factor out the responsability
to create `PodAdmitResult` objects. This enables resource manager
to report specific errors in Allocate() and to bubble up them
in the relevant fields of the `PodAdmitResult`.
To demonstrate the approach we refactor TopologyAffinityError as a
proper error.
Co-authored-by: Kevin Klues <kklues@nvidia.com>
Co-authored-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Francesco Romani <fromani@redhat.com>
The CPUManagerPolicyOptions received from the kubelet config/command line args
is propogated to the Container Manager.
We defer the consumption of the options to a later patch(set).
Co-authored-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Francesco Romani <fromani@redhat.com>
Files generate after running `make generated_files`.
Co-authored-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Francesco Romani <fromani@redhat.com>
In this patch we enhance the kubelet configuration to support
cpuManagerPolicyOptions.
In order to introduce SMT-awareness in CPU Manager, we introduce a
new flag in Kubelet to allow the user to specify an additional flag
called `cpumanager-policy-options` to allow the user to modify the
behaviour of static policy to strictly guarantee allocation of whole
core.
Co-authored-by: Francesco Romani <fromani@redhat.com>
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
oomwatcher.NewWatcher returns "open /dev/kmsg: operation not permitted" error,
when running with sysctl value `kernel.dmesg_restrict=1`.
The error is negligible for KubeletInUserNamespace.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Errors during setting the following sysctl values are ignored:
- vm.overcommit_memory
- vm.panic_on_oom
- kernel.panic
- kernel.panic_on_oops
- kernel.keys.root_maxkeys
- kernel.keys.root_maxbytes
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
runtimes may return an arbitrary number of Pod IPs, however, kubernetes
only takes into consideration the first one of each IP family.
The order of the IPs are the one defined by the Kubelet:
- default prefer IPv4
- if NodeIPs are defined, matching the first nodeIP family
PodIP is always the first IP of PodIPs.
The downward API must expose the same IPs and in the same order than
the pod.Status API object.
A number of race conditions exist when pods are terminated early in
their lifecycle because components in the kubelet need to know "no
running containers" or "containers can't be started from now on" but
were relying on outdated state.
Only the pod worker knows whether containers are being started for
a given pod, which is required to know when a pod is "terminated"
(no running containers, none coming). Move that responsibility and
podKiller function into the pod workers, and have everything that
was killing the pod go into the UpdatePod loop. Split syncPod into
three phases - setup, terminate containers, and cleanup pod - and
have transitions between those methods be visible to other
components. After this change, to kill a pod you tell the pod worker
to UpdatePod({UpdateType: SyncPodKill, Pod: pod}).
Several places in the kubelet were incorrect about whether they
were handling terminating (should stop running, might have
containers) or terminated (no running containers) pods. The pod worker
exposes methods that allow other loops to know when to set up or tear
down resources based on the state of the pod - these methods remove
the possibility of race conditions by ensuring a single component is
responsible for knowing each pod's allowed state and other components
simply delegate to checking whether they are in the window by UID.
Removing containers now no longer blocks final pod deletion in the
API server and are handled as background cleanup. Node shutdown
no longer marks pods as failed as they can be restarted in the
next step.
See https://docs.google.com/document/d/1Pic5TPntdJnYfIpBeZndDelM-AbS4FN9H2GTLFhoJ04/edit# for details
- provide tests for static policy allocation, when init containers
requested memory bigger than the memory requested by app containers
- provide tests for static policy allocation, when init containers
requested memory smaller than the memory requested by app containers
- provide tests to verify that init containers removed from the state
file once the app container started
Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
Remove init containers from the state file once the app container started,
it will release the memory allocated for the init container and can intense
the density of containers on the NUMA node in cases when the memory allocated
for init containers is bigger than the memory allocated for app containers.
Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
The idea that during allocation phase we will:
- during call to `Allocate` and `GetTopologyHints` we will take into account the init containers reusable memory,
which means that we will re-use the memory and update container memory blocks accordingly.
For example for the pod with two init containers that requested: 1Gi and 2Gi,
and app container that requested 4Gi, we can re-use 2Gi of memory.
Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
This change updates the CSR API to add a new, optional field called
expirationSeconds. This field is a request to the signer for the
maximum duration the client wishes the cert to have. The signer is
free to ignore this request based on its own internal policy. The
signers built-in to KCM will honor this field if it is not set to a
value greater than --cluster-signing-duration. The minimum allowed
value for this field is 600 seconds (ten minutes).
This change will help enforce safer durations for certificates in
the Kube ecosystem and will help related projects such as
cert-manager with their migration to the Kube CSR API.
Future enhancements may update the Kubelet to take advantage of this
field when it is configured in a way that can tolerate shorter
certificate lifespans with regular rotation.
Signed-off-by: Monis Khan <mok@vmware.com>
Update mounter interface in volume manager's ActualStateOfWorld every time.
Otherwise kubelet uses the first mounter it gets, which may not have the
latest information.
This fixes set up of CSI volumes, which store information about SELinux
support in their `mounter` interface implementation. With each MountVolume()
retry, a new mounter is instantiated and only the final mounter that succeeds
has the right info if the volume supports SELinux or not and can later
return the right attributes on GetAttributes() call.
This adds the gate `SeccompDefault` as new alpha feature. Seccomp path
and field fallbacks are now passed to the helper functions, whereas unit
tests covering those code paths have been added as well.
Beside enabling the feature gate, the feature has to be enabled by the
`SeccompDefault` kubelet configuration or its corresponding
`--seccomp-default` CLI flag.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
Apply suggestions from code review
Co-authored-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
desiredStateOfWorldPopulator.findAndRemoveDeletedPods() should remove
volumes from DSW when a pod is deleted on the API server and the volume is
uncertain in ASW.
podVolumesExist() should consider also uncertain volumes (where kubelet
does not know if a volume was fully unmounted) when checking for pod's
volumes. Added GetPossiblyMountedVolumesForPod for that.
Adding uncertain mounts to GetMountedVolumesForPod would potentially break
other callers (e.g. `verifyVolumesMountedFunc`).
If the cpumanager feature gate is disabled, the corresponsing field
of the containerManager will be nil.
A couple functions don't check for this occurrence and happily
deference the pointer unconditionally, leading to possible segfaults.
The relevant functions were introduced to support the podresources API,
so to trigger this segfault all the following are needed:
- cpumanager feature gate has to be disabled explicitely
- any podresources API must be called
Worth pointing out that when the new functions were introduced (around
kubernetes 1.20) the default feature gate for cpumanager was already set
to true, hence this bug is expected to be triggered rarely.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Update dependencies and the test images to use pause 3.5. We also
provide a changelog entry for the new container image version.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
Commit cc50aa9dfb introduced GetResourceStats, a method which collected
all the statistics from various cgroup controllers, only to discard all
of the info collected except a single value (memory usage).
While one may argue that this method can potentially be used from other
places, this did not happen since it was added 4+ years ago.
Let's streamline this code and only collect what we need, i.e. memory
usage. Rename the method accordingly.
While at it, fix pkg/kubelet/cm/cgroup_manager_unsupported.go to not
instantiate a new error every time a method is called.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This was added by commit a9772b2290.
In the current codebase, the cgroup being updated was created using
runc/opencontainers' manager.Apply(), which already does controllers
propagation, so there is no need to repeat that on every update.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
FeatureGate acts as a secondary switch to disable cloud-controller loops
in KCM, Kubelet and KAPI.
Provide comprehensive logging information to users, so they will be
guided in adoption of out-of-tree cloud provider implementation.
Volumes that are provisioned with `VolumeMode: Block` often have a
MetrucsProvider interface declared in their type. However, the
MetricsProvider should implement a GetMetrics() function. In the cases
where the storage drivers do not implement GetMetrics(), a panic can
occur.
Usual type-assertions are not sufficient in this case. All assertions
assume the interface is present. There is no straight forward way to
verify that a valid GetMetrics() function is provided.
By adding SupportsMetrics(), storage driver implementations require
careful reviewing for metrics support.
This sets cgroup config via libcontainer to make sure we apply the
correct values to the systemd slices and scopes.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
runc rc95 contains a fix for CVE-2021-30465.
runc rc94 provides fixes and improvements.
One notable change is cgroup manager's Set now accept Resources rather
than Cgroup (see https://github.com/opencontainers/runc/pull/2906).
Modify the code accordingly.
Also update runc dependencies (as hinted by hack/lint-depdendencies.sh):
github.com/cilium/ebpf v0.5.0
github.com/containerd/console v1.0.2
github.com/coreos/go-systemd/v22 v22.3.1
github.com/godbus/dbus/v5 v5.0.4
github.com/moby/sys/mountinfo v0.4.1
golang.org/x/sys v0.0.0-20210426230700-d19ff857e887
github.com/google/go-cmp v0.5.4
github.com/kr/pretty v0.2.1
github.com/opencontainers/runtime-spec v1.0.3-0.20210326190908-1c3f411f0417
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
dbus 5.0.4 adds StoreProperty method which needs to be implemented for
the mock.
Fixes the errors like
> pkg/kubelet/nodeshutdown/systemd/inhibit_linux_test.go:88:9: cannot use f.fakeDBusObject (variable of type *fakeDBusObject) as dbus.BusObject value in return statement: missing method StoreProperty
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
suppose there are two devices dev1 and dev2, each has NUMA Nodes associated as below:
dev1: numa1
dev2: numa1, numa2
and we request a device from numa2, currently filterByAffinity() will return
[], [dev1, dev2], [] if loop of available devices produce a sequence of [dev1, dev2],
that is is not desirable as what we truely expect is an allocation of dev2 from numa2.
One notable change is cgroup manager's Set now accept Resources rather
than Cgroup (see https://github.com/opencontainers/runc/pull/2906).
Modify the code accordingly.
Also update runc dependencies (as hinted by hack/lint-depdendencies.sh):
github.com/cilium/ebpf v0.5.0
github.com/containerd/console v1.0.2
github.com/coreos/go-systemd/v22 v22.3.1
github.com/godbus/dbus/v5 v5.0.4
github.com/moby/sys/mountinfo v0.4.1
golang.org/x/sys v0.0.0-20210426230700-d19ff857e887
github.com/google/go-cmp v0.5.4
github.com/kr/pretty v0.2.1
github.com/opencontainers/runtime-spec v1.0.3-0.20210326190908-1c3f411f0417
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
dbus 5.0.4 adds StoreProperty method which needs to be implemented for
the mock.
Fixes the errors like
> pkg/kubelet/nodeshutdown/systemd/inhibit_linux_test.go:88:9: cannot use f.fakeDBusObject (variable of type *fakeDBusObject) as dbus.BusObject value in return statement: missing method StoreProperty
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The cpuset.Parse function missed a couple bad input cases, specifically
"1--3" and "10-6". These were silently ignored when they should instead
be flagged as invalid.
This now catches these cases and expands the unit tests for cpuset to
cover them (and other negative test cases as well).
Signed-off-by: Jim Ramsay <jramsay@redhat.com>
For the simplicity and clarity, I think we can safely delete the
`delete(serviceEnv, envVar.Name)` and the duplicate comments at
function makeEnvironmentVariables of kubelet_pods.go:774-779.
1. `delete(serviceEnv, envVar.Name)` and `if _, present := tmpEnv[k]; !present`
of line 796 are the same logic that is to merge the non-present keys
of serviceEnv into tmpEnv.
2. And the keys deleted from serviceEnv are guarantee to be in tmpEnv,
this doesn't affect mappingFunc.
3. the delete may miss some key from container.EnvFrom
The method is used only for testing purposes. Given Resource data type
exposes all its fields, any invoker of ResourceList that is still
using the method outside of kubernetes/kubernetes can still either
copy paste the original implementation or implement a custom method
that's converting resources into proper Quantity data type.
Given the hugepage resource is a scalar resource, it's sufficient
the underlying code under fit_test.go to take into account any
extended resources. For predicate_test.go, the hugepage
resource does not play any role as the General predicates test cases
does not set any scaler resource at all.
Additionally, by removing ResourceList method, pkg/scheduler/framework
can get rid of dependency on k8s.io/kubernetes/pkg/apis/core/v1/helper.
GetNode() is called in a lot of places including a hot loop in
fastStatusUpdateOnce. Having a poll in it is delaying
the kubelet /readyz status=200 report.
If a client is available attempt to wait for the sync to happen,
before starting the list watch for pods at the apiserver.
This implements a stream cleanup when using portforwardings. Before
applying this patch, the streams []httpstream.Stream within
`spdy/connection.go` would fill-up for each streaming request. This
could result in heavy memory usage. Now we use the stream identifier to
keep track of them and finally remove them again once they're no longer
needed.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
Before the addition of GetAllocatableResources, the
podresources API had just one endpoint `List()`, thus we could just
account for the total of the calls to have a good pulse of the API usage.
Now that we extend the API with more endpoints
(`GetAlloctableResources`), in order to improve the observability we add
per-endpoint counters, in addition to the existing counter of the total
API calls.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Add feature gate to disable the GetAllocatableResources API.
The feature gate isd alpha stage, disabled by default.
Add e2e test to demonstrate the behaviour with feature gate disabled.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Add test to reflect the correct behaviour according to
review comments.
Most notably, we should consider that -as the device plugin API
allows to express- a device ID can have multiple "NUMA" node IDs.
(example: AMD Rome).
More details:
https://github.com/kubernetes/kubernetes/pull/95734#discussion_r539545041
Signed-off-by: Francesco Romani <fromani@redhat.com>
From https://github.com/kubernetes/kubernetes/pull/96553
we are reminded we need to handle the case on which
a device plugin reports nil Topology, which is legal.
Add unit test to ensure this case is handled.
Signed-off-by: Francesco Romani <fromani@redhat.com>
It's legal for device plugins to not expose topology informations.
Previously, the code was just skipping these devices.
Review highlighted is better to report them anyway and let the
client application decide if they still want somehow to track them
or skip them entirely.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Add e2e tests for the new GetAllocatableResources API.
The tests are added in the `podresources_test` suite
created previously in this series.
Signed-off-by: Francesco Romani <fromani@redhat.com>
during the review, we convened that the manager types
(CPUSet, ResourceDeviceInstances) should not cross the
containermanager API boundary; thus, the ContainerManager layer
is the correct place to do the type conversion
We push back the type conversions from the podresources server
layer, fixing tests accordingly.
Signed-off-by: Francesco Romani <fromani@redhat.com>
We want to make the return type of the GetDevices() method of the
podresources DevicesProvider interface consistent with
the newly added GetAllocatableDevices type.
This makes the code easier to read and reduces the coupling between
the podresourcesapi server and the devicemanager code.
No intended changes in behaviour, but the different return types
now requires some data massaging. Tests are updated accordingly.
Signed-off-by: Francesco Romani <fromani@redhat.com>
a upcoming patch wants to add GetAllocatableCPUs() returning a cpuset.
To make the code consistent and a bit more flexible, we change the
existing interface to also return a cpuset.
Signed-off-by: Francesco Romani <fromani@redhat.com>
As discussed during the alpha review, the ReadOnly field is not really
needed because volume mounts can also be read-only. It's a historical
oddity that can be avoided for generic ephemeral volumes as part
of the promotion to beta.
- libcontainer renamed
`github.com/opencontainers/runc/libcontainer/configs` to
`github.com/opencontainers/runc/libcontainer/devices` so use the new
references
- Update `dockershim` `ContainerCreate` call after docker update to
v20.10.2
- Change the feature gate from alpha to beta and enable it by default
- Update a few of the unit tests due to feature gate being enabled by
default
- Small refactor in `nodeshutdown_manager` which adds `featureEnabled`
function (which checks that feature gate and that
`kubeletConfig.ShutdownGracePeriod > 0`).
- Use `featureEnabled()` to exit early from shutdown manager in the case
that the feature is disabled
- Update kubelet config defaulting to be explicit that
`ShutdownGracePeriod` and `ShutdownGracePeriodCriticalPods` default to
zero and update the godoc comments.
- Update defaults and add featureGate tag in api config godoc.
With this feature now in beta and the feature gate enabled by default,
to enable graceful shutdown all that will be required is to configure
`ShutdownGracePeriod` and `ShutdownGracePeriodCriticalPods` in the
kubelet config. If not configured, they will be defaulted to zero, and
graceful shutdown will effectively be disabled.