The changes (mostly in pkg/kubelet/cm) are there to adopt changed
runc 1.1 API, and simplify things a bit. In particular:
1. simplify cgroup manager instantiation, using a new, easier way of
libcontainers/cgroups/manager.New;
2. replace libcontainerAdapter with a boolean variable (all it did
was passing on whether systemd manager should be used);
3. trivial change due to removed cgroupfs.HugePageSizes and added
cgroups.HugePageSizes();
4. do not calculate cgroup paths in update / destroy, since libcontainer
cgroup managers now calculate the paths upon creation (previously,
they were doing that only in Apply, so using e.g. Set or Destroy right
after creation was impossible without specifying paths).
We currently still calculate cgroup paths in Exists -- this is to be
addressed separately.
Co-Authored-By: Elana Hashman <ehashman@redhat.com>
Previously, callers of `Exists()` would not know why the cGroup was or
was not existing. In one call-site in particular, the `kubelet` would
entirely fail to start if the cGroup validation did not succeed. In
these cases we MUST explain what went wrong and pass that information
clearly to the caller. Previously, some but not all of the reasons for
invalidation were logged at a low log-level instead. This led to poor
UX.
The original method was retained on the interface so as to make this
diff small.
Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
Instead of doing (almost) the same thing from the three different
methods (Create, Update, Destroy), move the functionality to
libctCgroupConfig, replacing updateSystemdCgroupInfo.
The needResources bool is needed because we do not need resources
during Destroy, so we skip the unneeded resource conversion.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit 79be8be10e made hugetlb settings optional if cgroup v2 is used and
hugetlb is not available, fixing issue 92933. Note at that time this was only
needed for v2, because for v1 the resources were set one-by-one, and only for
supported resources.
Commit d312ef7eb6 switched the code to using Set from runc/libcontainer
cgroups manager, and expanded the check to cgroup v1 as well.
Move this check earlier, to inside m.toResources, so instead of
converting all hugetlb resources from ResourceConfig to libcontainers's
Resources.HugetlbLimit, and then setting it to nil, we can skip the
conversion entirely if hugetlb is not supported, thus not doing the work
that is not needed.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit ecd6361f added setting PidsLimit to Create and Update.
Commit bce9d5f2 added setting PidsLimit to m.toResources.
Now, PidsLimit is assigned twice.
Remove the duplicate.
Fixes: bce9d5f2
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
There's no need to call m.Update (which will create another instance of
libcontainer cgroup manager, convert all the resources and then set
them). All this is already done here, except for Set().
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is a knob added by runc 1.0.2 specifically for kubernetes,
which tells runc/libcontainer/cgroups/systemd v1 manager to not
freeze the cgroup in Set().
We set this knob here because this code is only used for pods
(rather than containers) management, and in this place we create or
update the pod cgroup with no device limits set, so we can skip the
freeze.
If this knob is not set, libcontainer's cgroup v1 manager tries to
figure out whether the freeze is needed or not, but it's a somewhat
expensive check to perform, thus the knob is a shortcut.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
* Change uses of whitelist to allowlist in kubelet sysctl
* Rename whitelist files to allowlist in Kubelet sysctl
* Further renames of whitelist to allowlist in Kubelet
* Rename podsecuritypolicy uses of whitelist to allowlist
* Update pkg/kubelet/kubelet.go
Co-authored-by: Danielle <dani@builds.terrible.systems>
Co-authored-by: Danielle <dani@builds.terrible.systems>
Commit cc50aa9dfb introduced GetResourceStats, a method which collected
all the statistics from various cgroup controllers, only to discard all
of the info collected except a single value (memory usage).
While one may argue that this method can potentially be used from other
places, this did not happen since it was added 4+ years ago.
Let's streamline this code and only collect what we need, i.e. memory
usage. Rename the method accordingly.
While at it, fix pkg/kubelet/cm/cgroup_manager_unsupported.go to not
instantiate a new error every time a method is called.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This was added by commit a9772b2290.
In the current codebase, the cgroup being updated was created using
runc/opencontainers' manager.Apply(), which already does controllers
propagation, so there is no need to repeat that on every update.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This sets cgroup config via libcontainer to make sure we apply the
correct values to the systemd slices and scopes.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
runc rc95 contains a fix for CVE-2021-30465.
runc rc94 provides fixes and improvements.
One notable change is cgroup manager's Set now accept Resources rather
than Cgroup (see https://github.com/opencontainers/runc/pull/2906).
Modify the code accordingly.
Also update runc dependencies (as hinted by hack/lint-depdendencies.sh):
github.com/cilium/ebpf v0.5.0
github.com/containerd/console v1.0.2
github.com/coreos/go-systemd/v22 v22.3.1
github.com/godbus/dbus/v5 v5.0.4
github.com/moby/sys/mountinfo v0.4.1
golang.org/x/sys v0.0.0-20210426230700-d19ff857e887
github.com/google/go-cmp v0.5.4
github.com/kr/pretty v0.2.1
github.com/opencontainers/runtime-spec v1.0.3-0.20210326190908-1c3f411f0417
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
One notable change is cgroup manager's Set now accept Resources rather
than Cgroup (see https://github.com/opencontainers/runc/pull/2906).
Modify the code accordingly.
Also update runc dependencies (as hinted by hack/lint-depdendencies.sh):
github.com/cilium/ebpf v0.5.0
github.com/containerd/console v1.0.2
github.com/coreos/go-systemd/v22 v22.3.1
github.com/godbus/dbus/v5 v5.0.4
github.com/moby/sys/mountinfo v0.4.1
golang.org/x/sys v0.0.0-20210426230700-d19ff857e887
github.com/google/go-cmp v0.5.4
github.com/kr/pretty v0.2.1
github.com/opencontainers/runtime-spec v1.0.3-0.20210326190908-1c3f411f0417
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
- libcontainer renamed
`github.com/opencontainers/runc/libcontainer/configs` to
`github.com/opencontainers/runc/libcontainer/devices` so use the new
references
- Update `dockershim` `ContainerCreate` call after docker update to
v20.10.2
This fixes issues where kubelet enforces qos and nodeAllocatable on the
worng hierarchy. Kublet will now create the files
/sys/fs/cgroup/kubepods/{burstable,besteffort,}/pod-xyz
when running with systemd as the driver, making it impossible to enforce
the limits on nodeAllocatable.
use the new libcontainer feature of skipping setting the devices
cgroup. This is necessary on cgroup v2 to avoid leaking a eBPF
program every time the cgroup is re-configured.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
do a conversion from the cgroups v1 limits to cgroups v2.
e.g. cpu.shares on cgroups v1 has a range of [2-262144] while the
equivalent on cgroups v2 is cpu.weight that uses a range [1-10000].
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Unit test for updating container hugepage limit
Add warning message about ignoring case.
Update error handling about hugepage size requirements
Signed-off-by: sewon.oh <sewon.oh@samsung.com>
cause by kubelet startup be interrupted on setting list of cgroups
In the 'cgroupManagerImpl.Exists' not check&recreate the hugetlb cgroup dir. Then setting the limits in non-exist cgroup dir will cause kubelet start failed.
Signed-off-by: bingshen.wbs <bingshen.wbs@alibaba-inc.com>
Use the exported list from runc that uses "KB" and not "kB".
This issue breaks kubelet on AArch64 (arm 64).
var HugePageSizeUnitList = []string{"B", "KB", "MB", "GB", "TB", "PB"}
The hugetlb cgroup control files (introduced here in 2012:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=abb8206cb0773)
use "KB" and not "kB"
(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/hugetlb_cgroup.c?h=v5.0#n349).
The behavior in the kernel has not changed since the introduction, and
the current code using "kB" will therefore fail on devices with huge
pages smaller than 1MiB. This is the case for AArch64.
As seen from the code in "mem_fmt" inside hugetlb_cgroup.c, only "KB",
"MB" and "GB" are used, so the others may be removed as well.
Here is a real world example of the files inside the
"/sys/kernel/mm/hugepages/" directory:
- "hugepages-64kB"
- "hugepages-2048kB"
- "hugepages-32768kB"
- "hugepages-1048576kB"
And the corresponding cgroup files:
- "hugetlb.64KB._____"
- "hugetlb.2MB._____"
- "hugetlb.32MB._____"
- "hugetlb.1GB._____"
Signed-off-by: Odin Ugedal <odin@ugedal.com>
- Move from the old github.com/golang/glog to k8s.io/klog
- klog as explicit InitFlags() so we add them as necessary
- we update the other repositories that we vendor that made a similar
change from glog to klog
* github.com/kubernetes/repo-infra
* k8s.io/gengo/
* k8s.io/kube-openapi/
* github.com/google/cadvisor
- Entirely remove all references to glog
- Fix some tests by explicit InitFlags in their init() methods
Change-Id: I92db545ff36fcec83afe98f550c9e630098b3135