* Cleanup FeatureGate skippers
* Perform changes requested by review
* some more review related changes
* Rename skipper functions to make code more readable
* add utilfeature back in
Each e2e test knows it wants to restart a running kubelet or a
non-running kubelet. The vast majority of times, we want to
restart a running kubelet (e.g. to change config or to check
some properties hold across kubelet crashes/restarts), but sometimes
we stop the kubelet, do some actions and only then restart.
To accomodate both use cases, we just expose the `running` boolean
flag to the e2e tests.
Having the `restartKubelet` explicitly restarting a running kubelet
helps us to trobuleshoot e2e failures on which the kubelet
was supposed to be running, while it was not; attempting a restart
in such cases only murkied the waters further, making the
troubleshooting and the eventual fix harder.
In the happy path, no expected change in behaviour.
Signed-off-by: Francesco Romani <fromani@redhat.com>
In the `restartKubelet` helper, we use `exec.Command`, whose
return value is the output as the command, but as `[]byte`.
The way we logged the output of the command was as value, making
the output, meant to be human readable, unnecessarily hard to read.
We fix this annoying behaviour converting the output to string before
to log it out, making pretty obvious to understand the outcome of
the command.
Signed-off-by: Francesco Romani <fromani@redhat.com>
This patch changes cpuCount to cpuRequest in order to cater to cases
where guaranteed pods make non-integral CPU Requests.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
This commit fixes the LocalStorageCapacityIsolationEviction test by
acknowledging that in its default configuration kubelet will no-longer
evict memory-backed volume pods as they cannot use more than their
assigned limit with SizeMemoryBackedVolumes enabled.
To account for the old behaviour, we also add a test that explicitly
disables the feature to test the behaviour of memory backed local
volumes in those scenarios. That test can be removed when/if the feature
gate is removed.
Currently the storage eviction tests fail for a few reasons:
- They re-enter storage exhaustion after pulling the images during
cleanup (increasing test storage reqs, and adding verification for
future diagnosis)
- They were timing out, as in practice it seems that eviction takes just
over 10 minutes on an n1-standard in many cases. I'm raising these to
15 to provide some padding.
This should ideally bring these tests to passing on CI, as they've now
passed locally for me several times with the remote GCE env.
Follow up work involves diagnosing why these take so long, and
restructuring them to be less finicky.
It wasn't documented that InitLogs already uses the log flush frequency, so
some commands have called it before parsing (for example, kubectl in the
original code for logs.go). The flag never had an effect in such commands.
Fixing this turned into a major refactoring of how commands set up flags and
run their Cobra command:
- component-base/logs: implicitely registering flags during package init is an
anti-pattern that makes it impossible to use the package in commands which
want full control over their command line. Logging flags must be added
explicitly now, something that the new cli.Run does automatically.
- component-base/logs: AddFlags would have crashed in kubectl-convert if it
had been called because it relied on the global pflag.CommandLine. This
has been fixed and kubectl-convert now has the same --log-flush-frequency
flag as other commands.
- component-base/logs/testinit: an exception are tests where flag.CommandLine has
to be used. This new package can be imported to add flags to that
once per test program.
- Normalization of the klog command line flags was inconsistent. Some commands
unintentionally didn't normalize to the recommended format with hyphens. This
gets fixed for sample programs, but not for production programs because
it would be a breaking change.
This refactoring has the following user-visible effects:
- The validation error for `go run ./cmd/kube-apiserver --logging-format=json
--add-dir-header` now references `add-dir-header` instead of `add_dir_header`.
- `staging/src/k8s.io/cloud-provider/sample` uses flags with hyphen instead of
underscore.
- `--log-flush-frequency` is not listed anymore in the --logging-format flag's
`non-default formats don't honor these flags` usage text because it will also
work for non-default formats once it is needed.
- `cmd/kubelet`: the description of `--logging-format` uses hyphens instead of
underscores for the flags, which now matches what the command is using.
- `staging/src/k8s.io/component-base/logs/example/cmd`: added logging flags.
- `apiextensions-apiserver` no longer prints a useless stack trace for `main`
when command line parsing raises an error.
We graduate the `CPUManagerPolicyOptions` feature to beta
in the 1.23 cycle, and we add new experimental feature gates
to guard new options which are planned in the 1.23 and in the
following cycles.
We introduce additional feature gate called `CPUManagerPolicyAlphaOptions` and
`CPUManagerPolicyBetaOptions`. The basic idea is to avoid the
cumbersome process of adding a feature gate for each option, and to have
feature gates which track the maturity level of _groups_ of options.
Besides this change, the graduation process, and the process in general,
for adding new policy options is still unchanged.
The `full-pcpus-only` option added in the 1.22 cycle is intentionally
moved into the beta policy options
For more details:
- KEP: https://github.com/kubernetes/enhancements/pull/2933
- sig-arch discussion:
https://groups.google.com/u/1/g/kubernetes-sig-architecture/c/Nxsc7pfe5rw
Signed-off-by: Francesco Romani <fromani@redhat.com>
The Topology Manager e2e tests wants to run on real multi-NUMA system
and want to consume real devices supported by device plugins; SRIOV
devices happen to be the most commonly available of such devices.
CI machines aren't multi NUMA nor expose SRIOV devices, so the biggest portion
of the tests will just skip, and we need to keep it like this until we
figure out how to enable these features.
However, some organizations can and want to run the testsuite on bare metal;
in this case, the current test will skip (not fail) with misconfigured
boxes, and this reports a misleading result. It will be much better to
fail if the test preconditions aren't met.
To satisfy both needs, we add an option, controlled by an environment
variable, to fail (not skip) if the machine on which the test run
doesn't meet the expectations (multi-NUMA, 4+ cores per NUMA cell,
expose SRIOV VFs).
We keep the old behaviour as default to keep being CI friendly.
Signed-off-by: Francesco Romani <fromani@redhat.com>
In older versions of Kubernetes (at least pre-0.19, it's the earliest
this test will run unmodified on), Pods that depended on devices could be
restarted after the device plugin had been removed. Currently however,
this isn't possible, as during ContainerManager.GetResources(), we
attempt to DeviceManager.GetDeviceRunContainerOptions() which fails as
there's no cached endpoint information for the plugin type.
This commit therefore breaks apart the existing test into two:
- One active test that validates that assignments are maintained across
restarts
- One skipped test that validates the behaviour after GPUs have been
removed, in case we decide that this is a bug that should be fixed in
the future.