Commit Graph

2067 Commits

Author SHA1 Message Date
Kubernetes Prow Robot
fe62fcc9b4 Merge pull request #105516 from fromanirh/e2e-kubelet-restart-improvements
e2e: node: kubelet restart improvements
2021-10-14 17:58:54 -07:00
Kubernetes Prow Robot
548e37278c Merge pull request #105313 from giuseppe/fix-flaky-pagefaults-summary-test
test, cgroupv2: adjust pagefaults test
2021-10-14 00:13:30 -07:00
Kubernetes Prow Robot
894ceb63d0 Merge pull request #105003 from swatisehgal/getallocatable-to-beta
podresource-api: getAllocatableResources to Beta
2021-10-13 17:43:27 -07:00
Kubernetes Prow Robot
63f66e6c99 Merge pull request #105012 from fromanirh/cpumanager-policy-options-beta
node: graduate CPUManagerPolicyOptions to beta
2021-10-08 07:32:59 -07:00
Kubernetes Prow Robot
2face135c7 Merge pull request #97415 from AlexeyPerevalov/ExcludeSharedPoolFromPodResources
Return only isolated cpus in podresources interface
2021-10-08 05:58:58 -07:00
Kubernetes Prow Robot
dd650bd41f Merge pull request #105527 from rphillips/fixes/filter_terminated_pods
kubelet: set terminated podWorker status for terminated pods
2021-10-07 22:19:51 -07:00
Ryan Phillips
3982fcae64 go fmt 2021-10-07 20:13:43 -05:00
Elana Hashman
c771698de3 Add e2e test to verify kubelet restart behaviour
Succeeded pods should not be counted as running on restart.
2021-10-07 18:30:17 -05:00
Francesco Romani
d15bff2839 e2e: node: expose the running flag
Each e2e test knows it wants to restart a running kubelet or a
non-running kubelet. The vast majority of times, we want to
restart a running kubelet (e.g. to change config or to check
some properties hold across kubelet crashes/restarts), but sometimes
we stop the kubelet, do some actions and only then restart.

To accomodate both use cases, we just expose the `running` boolean
flag to the e2e tests.

Having the `restartKubelet` explicitly restarting a running kubelet
helps us to trobuleshoot e2e failures on which the kubelet
was supposed to be running, while it was not; attempting a restart
in such cases only murkied the waters further, making the
troubleshooting and the eventual fix harder.

In the happy path, no expected change in behaviour.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-10-07 22:15:28 +02:00
Francesco Romani
e878c20ac7 e2e: node: improve error logging
In the `restartKubelet` helper, we use `exec.Command`, whose
return value is the output as the command, but as `[]byte`.
The way we logged the output of the command was as value, making
the output, meant to be human readable, unnecessarily hard to read.

We fix this annoying behaviour converting the output to string before
to log it out, making pretty obvious to understand the outcome of
the command.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-10-07 22:13:49 +02:00
Swati Sehgal
5043b431b4 excludesharedpool: e2e tests: Test cases for pods with non-integral CPUs
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2021-10-07 15:39:41 +01:00
Swati Sehgal
42dd01aa3f excludesharedpool: e2e tests: code refactor to handle non-integral CPUs
This patch changes cpuCount to cpuRequest in order to cater to cases
where guaranteed pods make non-integral CPU Requests.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2021-10-07 15:39:40 +01:00
Kubernetes Prow Robot
c4d802b0b5 Merge pull request #103289 from AlexeyPerevalov/DoNotExportEmptyTopology
podresources: do not export empty NUMA topology
2021-10-07 07:11:46 -07:00
Swati Sehgal
9337902648 podresource: move the checkForTopology logic inline
As per the recommendation here: https://github.com/kubernetes/kubernetes/pull/103289#pullrequestreview-766949859
we move the check inline.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2021-10-06 11:31:48 +01:00
Giuseppe Scrivano
f23e2a8c7f test, cgroupv2: adjust pagefaults test
on cgroup v2 the reported metric is recursive and it includes all the
sub cgroups.

Closes: https://github.com/kubernetes/kubernetes/issues/105301

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2021-10-05 18:00:57 +02:00
Kubernetes Prow Robot
c5ad58d8a1 Merge pull request #103372 from verb/1.22-e2e-node
Create node_e2e test for ephemeral containers
2021-10-05 05:41:09 -07:00
Kubernetes Prow Robot
9eaabb6b2e Merge pull request #104304 from endocrimes/dani/eviction
[Failing Test] Fix Kubelet Storage Eviction Tests
2021-10-04 15:16:40 -07:00
Lee Verberne
2a82228e33 Apply suggestions from code review
Co-authored-by: Sergey Kanzhelev <S.Kanzhelev@live.com>
2021-10-04 15:07:37 +02:00
Danielle Lancashire
7b91337068 e2e_node: eviction: Include names of pending-eviction pods in error 2021-10-04 13:07:40 +02:00
Danielle Lancashire
b5c2d3b389 e2e_node: eviction: Memory-backed Volumes seperation
This commit fixes the LocalStorageCapacityIsolationEviction test by
acknowledging that in its default configuration kubelet will no-longer
evict memory-backed volume pods as they cannot use more than their
assigned limit with SizeMemoryBackedVolumes enabled.

To account for the old behaviour, we also add a test that explicitly
disables the feature to test the behaviour of memory backed local
volumes in those scenarios. That test can be removed when/if the feature
gate is removed.
2021-10-04 13:07:40 +02:00
Danielle Lancashire
a8168ed543 e2e_node: Fix LocalStorage and PriorityLocalStorage eviction tests
Currently the storage eviction tests fail for a few reasons:
- They re-enter storage exhaustion after pulling the images during
  cleanup (increasing test storage reqs, and adding verification for
future diagnosis)
- They were timing out, as in practice it seems that eviction takes just
  over 10 minutes on an n1-standard in many cases. I'm raising these to
15 to provide some padding.

This should ideally bring these tests to passing on CI, as they've now
passed locally for me several times with the remote GCE env.

Follow up work involves diagnosing why these take so long, and
restructuring them to be less finicky.
2021-10-04 13:07:40 +02:00
Swati Sehgal
01dacd0463 podresource-api: getAllocatableResources to Beta
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2021-10-01 16:48:29 +01:00
Patrick Ohly
21d1bcd6b8 initialize logging after flag parsing
It wasn't documented that InitLogs already uses the log flush frequency, so
some commands have called it before parsing (for example, kubectl in the
original code for logs.go). The flag never had an effect in such commands.

Fixing this turned into a major refactoring of how commands set up flags and
run their Cobra command:

- component-base/logs: implicitely registering flags during package init is an
  anti-pattern that makes it impossible to use the package in commands which
  want full control over their command line. Logging flags must be added
  explicitly now, something that the new cli.Run does automatically.

- component-base/logs: AddFlags would have crashed in kubectl-convert if it
  had been called because it relied on the global pflag.CommandLine. This
  has been fixed and kubectl-convert now has the same --log-flush-frequency
  flag as other commands.

- component-base/logs/testinit: an exception are tests where flag.CommandLine has
  to be used. This new package can be imported to add flags to that
  once per test program.

- Normalization of the klog command line flags was inconsistent. Some commands
  unintentionally didn't normalize to the recommended format with hyphens. This
  gets fixed for sample programs, but not for production programs because
  it would be a breaking change.

This refactoring has the following user-visible effects:

- The validation error for `go run ./cmd/kube-apiserver --logging-format=json
  --add-dir-header` now references `add-dir-header` instead of `add_dir_header`.

- `staging/src/k8s.io/cloud-provider/sample` uses flags with hyphen instead of
  underscore.

- `--log-flush-frequency` is not listed anymore in the --logging-format flag's
  `non-default formats don't honor these flags` usage text because it will also
  work for non-default formats once it is needed.

- `cmd/kubelet`: the description of `--logging-format` uses hyphens instead of
  underscores for the flags, which now matches what the command is using.

- `staging/src/k8s.io/component-base/logs/example/cmd`: added logging flags.

- `apiextensions-apiserver` no longer prints a useless stack trace for `main`
  when command line parsing raises an error.
2021-09-30 13:46:49 +02:00
Francesco Romani
077c0aa1be node: graduate CPUManagerPolicyOptions to beta
We graduate the `CPUManagerPolicyOptions` feature to beta
in the 1.23 cycle, and we add new experimental feature gates
to guard new options which are planned in the 1.23 and in the
following cycles.

We introduce additional feature gate called `CPUManagerPolicyAlphaOptions` and
`CPUManagerPolicyBetaOptions`. The basic idea is to avoid the
cumbersome process of adding a feature gate for each option, and to have
feature gates which track the maturity level of _groups_ of options.
Besides this change, the graduation process, and the process in general,
for adding new policy options is still unchanged.

The `full-pcpus-only` option added in the 1.22 cycle is intentionally
moved into the beta policy options

For more details:
- KEP: https://github.com/kubernetes/enhancements/pull/2933
- sig-arch discussion:
  https://groups.google.com/u/1/g/kubernetes-sig-architecture/c/Nxsc7pfe5rw

Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-09-29 11:40:03 +02:00
Lee Verberne
da8ddb7485 Create node_e2e test for ephemeral containers 2021-09-27 13:13:33 +02:00
Kubernetes Prow Robot
e5c4defa8e Merge pull request #103370 from verb/1.22-cleanup-shareprocesses-e2e
Remove ShareProcessNamespace tags from e2e_node tests
2021-09-23 10:11:14 -07:00
Elana Hashman
47086a6623 Add test for recreating a static pod 2021-09-15 14:01:48 -04:00
Francesco Romani
54c7d8fbb1 e2e: TM: add option to fail instead of skip
The Topology Manager e2e tests wants to run on real multi-NUMA system
and want to consume real devices supported by device plugins; SRIOV
devices happen to be the most commonly available of such devices.

CI machines aren't multi NUMA nor expose SRIOV devices, so the biggest portion
of the tests will just skip, and we need to keep it like this until we
figure out how to enable these features.

However, some organizations can and want to run the testsuite on bare metal;
in this case, the current test will skip (not fail) with misconfigured
boxes, and this reports a misleading result. It will be much better to
fail if the test preconditions aren't met.

To satisfy both needs, we add an option, controlled by an environment
variable, to fail (not skip) if the machine on which the test run
doesn't meet the expectations (multi-NUMA, 4+ cores per NUMA cell,
expose SRIOV VFs).
We keep the old behaviour as default to keep being CI friendly.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-09-13 13:23:36 +02:00
Kubernetes Prow Robot
5261433627 Merge pull request #104606 from endocrimes/dani/device-driver-deflake
[Failing Test] Fix GPU Device Driver test in kubelet-serial
2021-09-10 04:20:00 -07:00
Danielle Lancashire
b970bb5fe0 e2e_node: Update GPU tests to reflect reality
In older versions of Kubernetes (at least pre-0.19, it's the earliest
this test will run unmodified on), Pods that depended on devices could be
restarted after the device plugin had been removed. Currently however,
this isn't possible, as during ContainerManager.GetResources(), we
attempt to DeviceManager.GetDeviceRunContainerOptions() which fails as
there's no cached endpoint information for the plugin type.

This commit therefore breaks apart the existing test into two:
- One active test that validates that assignments are maintained across
  restarts
- One skipped test that validates the behaviour after GPUs have been
  removed, in case we decide that this is a bug that should be fixed in
  the future.
2021-09-06 19:03:15 +02:00
Danielle Lancashire
3884dcb909 e2e_node: run gpu pod long enough to become ready 2021-08-26 14:24:23 +02:00
Danielle Lancashire
7d7884c0e6 e2e_node: install gpu pod with PodClient
Prior to this change, the pod was not getting scheduled on the node as
we don't have a running scheduler in e2e_node. PodClient solves this
problem by manually assigning the pod to the node.
2021-08-26 14:22:22 +02:00
Danielle Lancashire
0cc8af82a1 e2e_node: use upstream gpu installer
The current GPU installer was built in 2017, from source that no longer
exists in Kubernetes ([adding commit][1]. The image was built on 2017-06-13.

Unfortunately, this installer no longer appears to work. When debugging
on the same node type as used by test-infra, it failed to build the
driver as the kernel sha was no longer available.

This lead to needing to find a new way to install GPUs. The smallest
logical change was switching to [cos-gpu-installer][2]
. There is a newer version of this available on [googlesource][3] that
I have not yet tested as it's not clear what the state of the project
is, as I couldn't find docs outside of the source itself.

We install things to the same location as previously to avoid needing
extra downstream changes. There are a couple of weird issues here
however, like needing to run the container twice to correctly update the
LD Cache.

[1]: 1e77594958/cluster/gce/gci/nvidia-gpus/Dockerfile
[2]: https://github.com/GoogleCloudPlatform/cos-gpu-installer
[3]: https://cos.googlesource.com/cos/tools/+/refs/heads/master/src/cmd/cos_gpu_installer/
2021-08-26 14:09:45 +02:00
Stephen Augustus
481cf6fbe7 generated: Run hack/update-gofmt.sh
Signed-off-by: Stephen Augustus <foo@auggie.dev>
2021-08-24 15:47:49 -04:00
Alexey Perevalov
bb81101570 podresource: do not export NUMA topology if it's empty
If device plugin returns device without topology, keep it internaly
as NUMA node -1, it helps at podresources level to not export NUMA
topology, otherwise topology is exported with NUMA node id 0,
which is not accurate.

It's imposible to unveile this bug just by tracing json.Marshal(resp)
in podresource client, because NUMANodes field ID has json property
omitempty, in this case when ID=0 shown as emtpy NUMANode.
To reproduce it, better to iterate on devices and just
trace dev.Topology.Nodes[0].ID.

Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>
2021-08-24 15:38:21 +00:00
Kubernetes Prow Robot
499a1f99a9 Merge pull request #104489 from liggitt/signal-buffer
Fix buffered signal channel go vet error
2021-08-20 14:53:58 -07:00
Jordan Liggitt
322bc82777 Fix buffered signal channel go vet error 2021-08-20 16:47:56 -04:00
Antonio Ojea
0cd75e8fec run hack/update-netparse-cve.sh 2021-08-20 10:42:09 +02:00
Kubernetes Prow Robot
40a9914801 Merge pull request #102916 from odinuge/serial-tests
Ensure images are pulled after eviction tests
2021-08-17 11:41:13 -07:00
Elana Hashman
c69f55519e Revert "E2E test for kubelet exit-on-lock-contention" 2021-08-11 10:45:46 -07:00
Imran Pochi
2c2661a411 e2e test: lock-file and exit-on-lock-contention
This commit adds an e2e test for the kubelet flags `--lock-file` and
`exit-on-lock-contention`. Eventually we would like to move them to the
kubelet configuration file rather than flags.

This test is based on the premise that whenever there is a lock
contention of the lock file (e.g. /var/run/kubelet.lock), the running
kubelet must terminate and the waiting for the lock on the lock file to
be released before starting again.

In this test we simulate that behaviour of a file contention. The test
would try to acquire the lock on the lock file.

Success of the test is determined kubelet health check when the lock is
acquired by the test and passes when the lock on the lock file is
released.

Signed-off-by: Imran Pochi <imran@kinvolk.io>
2021-08-09 15:27:54 +05:30
Elana Hashman
d2ed3b28b7 Revert "revert Bump DynamicKubeConfig metric deprecation to 1.23 by delta update" 2021-08-06 08:38:56 -07:00
Kubernetes Prow Robot
d4179be611 Merge pull request #104183 from SergeyKanzhelev/SergeyKanzhelev-node-e2e-approver
Add SergeyKanzhelev to node e2e test approvers
2021-08-05 20:55:28 -07:00
Kubernetes Prow Robot
4d87be3ec4 Merge pull request #104121 from dims/skip-node-e2e-test-for-recovering-from-ip-leak-with-docker
Skip node e2e test for recovering from ip leak with docker/ubuntu
2021-08-05 16:36:46 -07:00
Sergey Kanzhelev
023f6a90db Add SergeyKanzhelev to node e2e test approvers 2021-08-05 21:32:55 +00:00
Kubernetes Prow Robot
7f231f899b Merge pull request #103883 from ehashman/slow-e2es
Mark "update Node.Spec.ConfigSource" node e2es as slow
2021-08-05 14:10:37 -07:00
Kubernetes Prow Robot
01cd315f3e Merge pull request #104106 from ehashman/ehashman-node-e2e-owners
Add ehashman to node e2e test approvers
2021-08-05 08:18:49 -07:00
Kubernetes Prow Robot
3b84cc9e6b Merge pull request #104075 from kerthcet/cleanup/revert-dynamickubeconfig-metric
revert Bump DynamicKubeConfig metric deprecation to 1.23 by delta update
2021-08-05 08:18:40 -07:00
Davanum Srinivas
9351b57def Skip node e2e test for recovering from ip leak with docker
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2021-08-05 07:11:07 -04:00
kerthcet
8cf10d9a20 set showHiddenMetricsForVersion=1.22 in dynamicKubeletConfiguration test
Signed-off-by: kerthcet <kerthcet@gmail.com>
2021-08-05 01:04:54 +08:00