where pod sandbox won't have HostProcess bit set if pod does not have a
security context but containers specify HostProcess.
Signed-off-by: Mark Rossetti <marosset@microsoft.com>
This resolves a couple of issues for CSI volume reconstruction.
1. IsLikelyNotMountPoint is known not to work for bind mounts and was
causing problems for subpaths and hostpath volumes.
2. Inline volumes were failing reconstruction due to calling
GetVolumeName, which only works when there is a PV spec.
v1.43.0 marked grpc.WithInsecure() deprecated so this commit moves to use
what is the recommended replacement:
grpc.WithTransportCredentials(insecure.NewCredentials())
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
The means by which we extract and parse the version of an API object is
not specific to etcd3. In order to allow for a generic suite of tests
against any storage.Interface imlpementation, we need this logic to live
outside of the etcd3 package, or import cycles will exist.
Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
This is the first step towards being able to support a new plugin API version
in parallel with the existing one.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
When parsing a resolv.conf file that has "search .", parseResolvConf should
accept the "." entry verbatim. Before this commit, parseResolvConf
unconditionally trimmed the "." suffix, which in the case of "." resulted
in a "" entry (that is, the empty string). This empty entry could lead
parseResolvConf to produce a resolv.conf file with "search ". Resolvers
could fail to parse such a resolv.conf file from parseResolvConf, thus
breaking DNS resolution in pods. After this commit, parseResolvConf
accepts a resolv.conf file with "search ." and passes the "." entry through
verbatim to produce a valid resolv.conf file. The "." suffix is still
trimmed for any entry that does not solely comprise ".".
Follow-up to commit a215a88d91.
* pkg/kubelet/network/dns/dns.go (parseResolvConf): Handle a "." entry in
the search path by copying it verbatim.
* pkg/kubelet/network/dns/dns_test.go (TestParseResolvConf): Add a test
case for "search .".
* Add FeatureGate PodHostIPs
* Add HostIPs field and update PodIPs field
* Types conversion
* Add dropDisabledStatusFields
* Add HostIPs for kubelet
* Add fuzzer for PodStatus
* Add status.hostIPs in ConvertDownwardAPIFieldLabel
* Add status.hostIPs in validEnvDownwardAPIFieldPathExpressions
* Downward API support for status.hostIPs
* Add DownwardAPI validation for status.hostIPs
* Add e2e to check that hostIPs works
* Add e2e to check that Downward API works
* Regenerate
The changes (mostly in pkg/kubelet/cm) are there to adopt changed
runc 1.1 API, and simplify things a bit. In particular:
1. simplify cgroup manager instantiation, using a new, easier way of
libcontainers/cgroups/manager.New;
2. replace libcontainerAdapter with a boolean variable (all it did
was passing on whether systemd manager should be used);
3. trivial change due to removed cgroupfs.HugePageSizes and added
cgroups.HugePageSizes();
4. do not calculate cgroup paths in update / destroy, since libcontainer
cgroup managers now calculate the paths upon creation (previously,
they were doing that only in Apply, so using e.g. Set or Destroy right
after creation was impossible without specifying paths).
We currently still calculate cgroup paths in Exists -- this is to be
addressed separately.
Co-Authored-By: Elana Hashman <ehashman@redhat.com>
Components that run in a container but modify the host network
namespace iptables rules need to know whether the system is using
iptables-legacy or iptables-nft. Given that kubelet will run before
any container-based components, it is well-positioned to help them
figure this out. So create a chain with a well-known name that they
can look for.
Some of these changes are cosmetic (repeatedly calling klog.V instead of
reusing the result), others address real issues:
- Logging a message only above a certain verbosity threshold without
recording that verbosity level (if klog.V().Enabled() { klog.Info... }):
this matters when using a logging backend which records the verbosity
level.
- Passing a format string with parameters to a logging function that
doesn't do string formatting.
All of these locations where found by the enhanced logcheck tool from
https://github.com/kubernetes/klog/pull/297.
In some cases it reports false positives, but those can be suppressed with
source code comments.
When using a legacy cloud provider, if kubelet is passed a node address
in --node-ip it will use this address in preference out the the
addresses by the cloud provider.
When using an external cloud provider, kubelet will annotate the Node
with the first --node-ip for use by the cloud provider. The cloud
provider validates this annotation but does not otherwise use it,
meaning that --node-ip has no effect.
This change moves the node address filtering code from kubelet to
component-helpers and updates both kubelet and cloud-provider to use it.
There is no functional change to kubelet, but cloud-provider now honours
kubelet's --node-ip.
Create an E2E test that creates a job that spawns a pod that should
succeed. The job reserves a fixed amount of CPU and has a large number
of completions and parallelism. Use to repro github.com/kubernetes/kubernetes/issues/106884
Signed-off-by: David Porter <david@porter.me>
Other components must know when the Kubelet has released critical
resources for terminal pods. Do not set the phase in the apiserver
to terminal until all containers are stopped and cannot restart.
As a consequence of this change, the Kubelet must explicitly transition
a terminal pod to the terminating state in the pod worker which is
handled by returning a new isTerminal boolean from syncPod.
Finally, if a pod with init containers hasn't been initialized yet,
don't default container statuses or not yet attempted init containers
to the unknown failure state.
Previously, callers of `Exists()` would not know why the cGroup was or
was not existing. In one call-site in particular, the `kubelet` would
entirely fail to start if the cGroup validation did not succeed. In
these cases we MUST explain what went wrong and pass that information
clearly to the caller. Previously, some but not all of the reasons for
invalidation were logged at a low log-level instead. This led to poor
UX.
The original method was retained on the interface so as to make this
diff small.
Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
Instead of doing (almost) the same thing from the three different
methods (Create, Update, Destroy), move the functionality to
libctCgroupConfig, replacing updateSystemdCgroupInfo.
The needResources bool is needed because we do not need resources
during Destroy, so we skip the unneeded resource conversion.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit 79be8be10e made hugetlb settings optional if cgroup v2 is used and
hugetlb is not available, fixing issue 92933. Note at that time this was only
needed for v2, because for v1 the resources were set one-by-one, and only for
supported resources.
Commit d312ef7eb6 switched the code to using Set from runc/libcontainer
cgroups manager, and expanded the check to cgroup v1 as well.
Move this check earlier, to inside m.toResources, so instead of
converting all hugetlb resources from ResourceConfig to libcontainers's
Resources.HugetlbLimit, and then setting it to nil, we can skip the
conversion entirely if hugetlb is not supported, thus not doing the work
that is not needed.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit ecd6361f added setting PidsLimit to Create and Update.
Commit bce9d5f2 added setting PidsLimit to m.toResources.
Now, PidsLimit is assigned twice.
Remove the duplicate.
Fixes: bce9d5f2
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
There's no need to call m.Update (which will create another instance of
libcontainer cgroup manager, convert all the resources and then set
them). All this is already done here, except for Set().
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
For the 'single-numa' and 'restricted' TopologyManager policies, pods are only
admitted if all of their containers have perfect alignment across the set of
resources they are requesting. The best-effort policy, on the other hand, will
prefer allocations that have perfect alignment, but fall back to a non-preferred
alignment if perfect alignment can't be achieved.
The existing algorithm of how to choose the best hint from the set of
"non-preferred" hints is fairly naive and often results in choosing a
sub-optimal hint. It works fine in cases where all resources would end up
coming from a single NUMA node (even if its not the same NUMA nodes), but
breaks down as soon as multiple NUMA nodes are required for the "best"
alignment. We will never be able to achieve perfect alignment with these
non-preferred hints, but we should try and do something more intelligent than
simply choosing the hint with the narrowest mask.
In an ideal world, we would have the TopologyManager return a set of
"resources-relative" hints (as opposed to a common hint for all resources as is
done today). Each resource-relative hint would indicate how many other
resources could be aligned to it on a given NUMA node, and a hint provider
would use this information to allocate its resources in the most aligned way
possible. There are likely some edge cases to consider here, but such an
algorithm would allow us to do partial-perfect-alignment of "some" resources,
even if all resources could not be perfectly aligned.
Unfortunately, supporting something like this would require a major redesign to
how the TopologyManager interacts with its hint providers (as well as how those
hint providers make decisions based on the hints they get back).
That said, we can still do better than the naive algorithm we have today, and
this patch provides a mechanism to do so.
We start by looking at the set of hints passed into the TopologyManager for
each resource and generate a list of the minimum number of NUMA nodes required
to satisfy an allocation for a given resource. Each entry in this list then
contains the 'minNUMAAffinity.Count()' for a given resources. Once we have this
list, we find the *maximum* 'minNUMAAffinity.Count()' from the list and mark
that as the 'bestNonPreferredAffinityCount' that we would like to have
associated with whatever "bestHint" we ultimately generate. The intuition being
that we would like to (at the very least) get alignment for those resources
that *require* multiple NUMA nodes to satisfy their allocation. If we can't
quite get there, then we should try to come as close to it as possible.
Once we have this 'bestNonPreferredAffinityCount', the algorithm proceeds as
follows:
If the mergedHint and bestHint are both non-preferred, then try and find a hint
whose affinity count is as close to (but not higher than) the
bestNonPreferredAffinityCount as possible. To do this we need to consider the
following cases and react accordingly:
1. bestHint.NUMANodeAffinity.Count() > bestNonPreferredAffinityCount
2. bestHint.NUMANodeAffinity.Count() == bestNonPreferredAffinityCount
3. bestHint.NUMANodeAffinity.Count() < bestNonPreferredAffinityCount
For case (1), the current bestHint is larger than the
bestNonPreferredAffinityCount, so updating to any narrower mergeHint is
preferred over staying where we are.
For case (2), the current bestHint is equal to the
bestNonPreferredAffinityCount, so we would like to stick with what we have
*unless* the current mergedHint is also equal to bestNonPreferredAffinityCount
and it is narrower.
For case (3), the current bestHint is less than bestNonPreferredAffinityCount,
so we would like to creep back up to bestNonPreferredAffinityCount as close as
we can. There are three cases to consider here:
3a. mergedHint.NUMANodeAffinity.Count() > bestNonPreferredAffinityCount
3b. mergedHint.NUMANodeAffinity.Count() == bestNonPreferredAffinityCount
3c. mergedHint.NUMANodeAffinity.Count() < bestNonPreferredAffinityCount
For case (3a), we just want to stick with the current bestHint because choosing
a new hint that is greater than bestNonPreferredAffinityCount would be
counter-productive.
For case (3b), we want to immediately update bestHint to the current
mergedHint, making it now equal to bestNonPreferredAffinityCount.
For case (3c), we know that *both* the current bestHint and the current
mergedHint are less than bestNonPreferredAffinityCount, so we want to choose
one that brings us back up as close to bestNonPreferredAffinityCount as
possible. There are three cases to consider here:
3ca. mergedHint.NUMANodeAffinity.Count() > bestHint.NUMANodeAffinity.Count()
3cb. mergedHint.NUMANodeAffinity.Count() < bestHint.NUMANodeAffinity.Count()
3cc. mergedHint.NUMANodeAffinity.Count() == bestHint.NUMANodeAffinity.Count()
For case (3ca), we want to immediately update bestHint to mergedHint because
that will bring us closer to the (higher) value of
bestNonPreferredAffinityCount.
For case (3cb), we want to stick with the current bestHint because choosing the
current mergedHint would strictly move us further away from the
bestNonPreferredAffinityCount.
Finally, for case (3cc), we know that the current bestHint and the current
mergedHint are equal, so we simply choose the narrower of the 2.
This patch implements this algorithm for the case where we must choose from a
set of non-preferred hints and provides a set of unit-tests to verify its
correctness.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
The field in fact says that the container runtime should relabel a volume
when running a container with it, it does not say that the volume supports
SELinux. For example, NFS can support SELinux, but we don't want NFS
volumes relabeled, because they can be shared among several Pods.
this commit updates checkEphemeralStorage to be able to add container log stats, if applicable.
It also updates the old check when container log stats aren't found to be more accurate.
Specifically, this check previously worked because of a fluke programming accident:
according to this block in pkg/kubelet/stats/helper.go:113
```
if result.Rootfs != nil {
rootfsUsage := *cfs.BaseUsageBytes
result.Rootfs.UsedBytes = &rootfsUsage
}
```
BaseUsageBytes should be the value added, not TotalUsageBytes. However, since in this case
one also needs to account for the calculated log size, which is TotalUsageBytes - BaseUsageBytes
using TotalUsageBytes value accidentally worked.
Updating the case to use the correct value AND log offset fixes this accident and makes
the behavior more in line with what happens when calculating ephemeral storage.
Signed-off-by: Peter Hunt <pehunt@redhat.com>
in https://github.com/kubernetes/kubernetes/pull/74441,
the namespace and name were added to the pod log location.
However, cAdvisor stats provider wasn't correspondingly updated.
since CRI-O uses cAdvisor stats provider by default, despite being a CRI implementation,
eviction with ephemeral storage and container logs doesn't work as expected, until now!
Signed-off-by: Peter Hunt <pehunt@redhat.com>
The package says:
> the libcontainer SELinux package is only built for Linux, so it is
> necessary to have a NOP wrapper which is built for non-Linux platforms
This is not true, Kubernetes now imports
github.com/opencontainers/selinux/go-selinux and it has proper
multiplatform support (i.e. NOOP on non-Linux platforms).
Removing the whole package and calling go-selinux directly.
Before this fix, hint permutations such as:
permutation: [{11 true} {0101 true}]
Could result in merged hints of:
mergedHint: {01 true}
This was possible because both hints in the permutation container a "preferred"
allocation (i.e. the full set of NUMA nodes set in the affinity bitmask are
*required* to satisfy the allocation). With this in place, the simplified logic
we had simply kept the merged hint as preferred as well.
However, what we really want is to ensure that the merged hint is only
preferred if *true* alignment of all resources is possible (i.e. if all hints
in the permutation are preferred AND their affinities are exactly equal).
The only exception to this is if *no* topology information is provided by a
given hint provider. In this case, we assume alignment doesn't matter and only
consider the resources that actually have hints provided for them.
This changes the semantics of permutations of the form:
permutation: [{111 true} {011 true}]
To now result in the merged hint of:
mergedHint: {011 false}
Instead of:
mergedHint: {011 true}
This is arguably how it should always have been though (because a hint should
not be preferred if true alignment isn't possible), and two tests have had to
change to accomodate these new semantics.
This commit changes the merge function to implement the updated logic, adds a
test to verify it is functioning correctly, and updates the two tests mentioned
above to adjust to the new semantics.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
The remote runtime implementation now supports the `verbose` fields,
which are required for consumers like cri-tools to enable multi CRI
version support.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
TestStart was previously flaky. In approx 100_000 local runs, it failed
about 70% of the time, and has been mentioned as a flaky unit test in
the past.
This flake was due to a race condition with the logic as written and the
go scheduler. UpdateThreshold calls `notifier.Start(events)` in a new Go
Routine, which is not guarunteed to be called immediately.
This meant that if `m.Start()` was called before `notifier.Start()`, the
test would fail, as the notifier would not have been started before the
4 events were processed and lock released.
Here, we update the test to more closely match the intended application
behaviour, and have events passed to the channel when `Start` is called
on the notifier.
This ensures that -Start gets called and additionally validates
that the correct channel is provided to the notifier.
Stop was never called previously, as it only gets called on a subsequent
call to UpdateThreshold. `AnyTimes()` hid that this did not occur.
cAdvisor has code to expose OOM metrics since 0.40.0, but this was not
included in Kubelet so far. This commit enables it.
Signed-off-by: Jorik Jonker <jorik.jonker@eu.equinix.com>
- Allow a podWorker to start if it is blocked by a pod that has been
terminated before starting
- When a pod can't start AND has already been terminated, exit cleanly
- Add a unit test that exercises race conditions in pod workers
If CRI returns a container that has been created but is not running,
it is not safe to assume it is terminal, as our connection to CRI
may have failed. Instead, created is treated as waiting, as in
"waiting for this container to start". Either syncPod or
syncTerminatingPod is responsible for handling this state.
host-network pods IPs are obtained from the reported kubelet nodeIPs.
Historically, host-network podIPs are immutable once set, but when
we've added dual-stack support, we didn't consider that the secondary
IP address may not be present at the same time that the primary nodeIP.
If a secondary IP address is added to a node after the host-network pods
IPs are set, we can add the secondary host-network pod IP address
maintaining the current behavior of not updating the current podIPs on
host-network pods.
In the following code pattern, the log message will get logged with v=0 in JSON
output although conceptually it has a higher verbosity:
if klog.V(5).Enabled() {
klog.Info("hello world")
}
Having the actual verbosity in the JSON output is relevant, for example for
filtering out only the important info messages. The solution is to use
klog.V(5).Info or something similar.
Whether the outer if is necessary at all depends on how complex the parameters
are. The return value of klog.V can be captured in a variable and be used
multiple times to avoid the overhead for that function call and to avoid
repeating the verbosity level.