kubernetes

Author	SHA1	Message	Date
Clayton Coleman	6b9a381185	kubelet: Force deleted pods can fail to move out of terminating If a CRI error occurs during the terminating phase after a pod is force deleted (API or static) then the housekeeping loop will not deliver updates to the pod worker which prevents the pod's state machine from progressing. The pod will remain in the terminating phase but no further attempts to terminate or cleanup will occur until the kubelet is restarted. The pod worker now maintains a store of the pods state that it is attempting to reconcile and uses that to resync unknown pods when SyncKnownPods() is invoked, so that failures in sync methods for unknown pods no longer hang forever. The pod worker's store tracks desired updates and the last update applied on podSyncStatuses. Each goroutine now synchronizes to acquire the next work item, context, and whether the pod can start. This synchronization moves the pending update to the stored last update, which will ensure third parties accessing pod worker state don't see updates before the pod worker begins synchronizing them. As a consequence, the update channel becomes a simple notifier (struct{}) so that SyncKnownPods can coordinate with the pod worker to create a synthetic pending update for unknown pods (i.e. no one besides the pod worker has data about those pods). Otherwise the pending update info would be hidden inside the channel. In order to properly track pending updates, we have to be very careful not to mix RunningPods (which are calculated from the container runtime and are missing all spec info) and config- sourced pods. Update the pod worker to avoid using ToAPIPod() and instead require the pod worker to directly use update.Options.Pod or update.Options.RunningPod for the correct methods. Add a new SyncTerminatingRuntimePod to prevent accidental invocations of runtime only pod data. Finally, fix SyncKnownPods to replay the last valid update for undesired pods which drives the pod state machine towards termination, and alter HandlePodCleanups to: - terminate runtime pods that aren't known to the pod worker - launch admitted pods that aren't known to the pod worker Any started pods receive a replay until they reach the finished state, and then are removed from the pod worker. When a desired pod is detected as not being in the worker, the usual cause is that the pod was deleted and recreated with the same UID (almost always a static pod since API UID reuse is statistically unlikely). This simplifies the previous restartable pod support. We are careful to filter for active pods (those not already terminal or those which have been previously rejected by admission). We also force a refresh of the runtime cache to ensure we don't see an older version of the state. Future changes will allow other components that need to view the pod worker's actual state (not the desired state the podManager represents) to retrieve that info from the pod worker. Several bugs in pod lifecycle have been undetectable at runtime because the kubelet does not clearly describe the number of pods in use. To better report, add the following metrics: kubelet_desired_pods: Pods the pod manager sees kubelet_active_pods: "Admitted" pods that gate new pods kubelet_mirror_pods: Mirror pods the kubelet is tracking kubelet_working_pods: Breakdown of pods from the last sync in each phase, orphaned state, and static or not kubelet_restarted_pods_total: A counter for pods that saw a CREATE before the previous pod with the same UID was finished kubelet_orphaned_runtime_pods_total: A counter for pods detected at runtime that were not known to the kubelet. Will be populated at Kubelet startup and should never be incremented after. Add a metric check to our e2e tests that verifies the values are captured correctly during a serial test, and then verify them in detail in unit tests. Adds 23 series to the kubelet /metrics endpoint.	2023-03-08 22:03:51 -06:00
David Porter	c5a1f0188b	test: Add node e2e test to verify static pod termination Add node e2e test to verify that static pods can be started after a previous static pod with the same config temporarily failed termination. The scenario is: 1. Static pod is started 2. Static pod is deleted 3. Static pod termination fails (internally `syncTerminatedPod` fails) 4. At later time, pod termination should succeed 5. New static pod with the same config is (re)-added 6. New static pod is expected to start successfully To repro this scenario, setup a pod using a NFS mount. The NFS server is stopped which will result in volumes failing to unmount and `syncTerminatedPod` to fail. The NFS server is later started, allowing the volume to unmount successfully. xref: 1. https://github.com/kubernetes/kubernetes/pull/113145#issuecomment-1289587988 2. https://github.com/kubernetes/kubernetes/pull/113065 3. https://github.com/kubernetes/kubernetes/pull/113093 Signed-off-by: David Porter <david@porter.me>	2023-03-03 10:00:48 -06:00
David Porter	1c75c2cda8	test: Add e2e to verify static pod termination Add a node e2e to verify that if a static pod is terminated while the container runtime or CRI returns an error, the pod is eventually terminated successfully. This test serves as a regression test for k8s.io/issue/113145 which fixes an issue where force deleted pods may not be terminated if the container runtime fails during a `syncTerminatingPod`. To test this behavior, start a static pod, stop the container runtime, and later start the container runtime. The static pod is expected to eventually terminate successfully. To start and stop the container runtime, we need to find the container runtime systemd unit name. Introduce a util function `findContainerRuntimeServiceName` which finds the unit name by getting the pid of the container runtime from the existing `ContainerRuntimeProcessName` flag passed into node e2e and using systemd dbus `GetUnitNameByPID` function to convert the pid of the container runtime to a unit name. Using the unit name, introduce helper functions to start and stop the container runtime. Signed-off-by: David Porter <david@porter.me>	2023-03-03 10:00:48 -06:00
Kubernetes Prow Robot	74f0819069	Merge pull request #116152 from torredil/fix-windows-e2e-test Add windows nodeSelector to provisioning functions	2023-03-02 11:36:56 -08:00
Kubernetes Prow Robot	ab002db788	Merge pull request #116223 from logicalhan/metric-docs include beta metrics in documentation and update docs for metrics	2023-03-02 10:31:04 -08:00
Kubernetes Prow Robot	b6d102d634	Merge pull request #116071 from yuanchen8911/symlink Add symlink data verification to statefulset e2e	2023-03-02 05:43:07 -08:00
Kubernetes Prow Robot	78e5db0931	Merge pull request #115107 from swatisehgal/handle-device-mgr-recovery-sample-dp-changes node: device-mgr: sample device plugin: Add support to control registration process	2023-03-02 05:42:55 -08:00
Kubernetes Prow Robot	949bee0118	Merge pull request #116189 from marosset/windows-hyperv-basic-e2e-test Adding e2e test to verify hyperv container is running inside a VM on Windows	2023-03-01 22:27:07 -08:00
Kubernetes Prow Robot	d788d436c9	Merge pull request #115893 from mgoltzsche/go-jose-update-2.6 bump go-jose to v2.6.0	2023-03-01 20:23:06 -08:00
Kubernetes Prow Robot	59a7e34052	Merge pull request #115442 from bobbypage/unknown_pods_test test: Add e2e node test to check for unknown pods	2023-03-01 19:08:55 -08:00
Max Goltzsche	df8fa2eab5	bump go-jose to v2.6.0 Update go-jose from v2.2.2 to v2.6.0. This is to make the kubernetes code compatible with newer go-jose versions that have a small breaking change (`jwt.NewNumericDate()` returns a pointer). Signed-off-by: Max Goltzsche <max.goltzsche@gmail.com>	2023-03-02 02:53:17 +01:00
Kubernetes Prow Robot	1646ed8222	Merge pull request #116057 from bobbypage/nodee2elog test: Add log artifact for ginkgo node e2e and tune default ginkgo flags	2023-03-01 16:55:16 -08:00
Kubernetes Prow Robot	dfa03231da	Merge pull request #116110 from knabben/knabben/polling-hpc-stats Poll for stats until Windows kubelet present it in the stats endpoint	2023-03-01 15:11:27 -08:00
Kubernetes Prow Robot	51dedff4f3	Merge pull request #115277 from pohly/klog-update klog update	2023-03-01 15:11:16 -08:00
Mark Rossetti	ab020ee628	Adding e2e test to verify hyperv container is running inside a VM on Windows Signed-off-by: Mark Rossetti <marosset@microsoft.com>	2023-03-01 14:08:46 -08:00
Kubernetes Prow Robot	b0c949d9dd	Merge pull request #116148 from aramase/aramase/f/ci-metrics [KMSv2] update ci script to create cluster and gather metrics	2023-03-01 12:39:30 -08:00
Amim Knabben	3fd3a76eb9	Poll for stats until Windows kubelet present it in the stats endpoint	2023-03-01 17:17:23 -03:00
Han Kang	0199276f85	include beta metrics in documentation and update docs for metrics	2023-03-01 11:32:19 -08:00
Kubernetes Prow Robot	60eefa8066	Merge pull request #115425 from pohly/scheduler-perf-benchstat scheduler perf: benchstat support	2023-03-01 11:19:29 -08:00
Kubernetes Prow Robot	fe671737ec	Merge pull request #116181 from pohly/dra-test-driver-update e2e: dra test driver update	2023-03-01 10:10:39 -08:00
Patrick Ohly	961819a4d0	dependencies: update klog v2.90.1 This improves performance of the text formatting and ktesting. Because ktesting no longer buffers messages by default, one unit test needs to ask for that explicitly.	2023-03-01 19:03:50 +01:00
Anish Ramasekar	c52ac0d59d	[KMSv2] update ci script to create cluster and gather metrics Signed-off-by: Anish Ramasekar <anish.ramasekar@gmail.com>	2023-03-01 18:03:37 +00:00
Patrick Ohly	74785074c6	e2e dra: update logging When running as part of the scheduler_perf benchmark testing, we want to print less information by default, so we should use V to limit verbosity Pretty-printing doesn't belong into "application" code. I am moving that into the ktesting formatting (https://github.com/kubernetes/kubernetes/pull/116180).	2023-03-01 15:02:03 +01:00
Patrick Ohly	106fce6fae	e2e dra: improve goroutine handling There is an API now to wait for informer factory goroutine termination. While at it, an incorrect comment for mutex locking gets removed.	2023-03-01 15:00:30 +01:00
Justin SB	50a025acdb	e2e: Remove dead code in tests We were building a local pod variable that we were no longer using. Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>	2023-03-01 08:08:33 -05:00
Kubernetes Prow Robot	9ef145d3a7	Merge pull request #116127 from pacoxu/negative-grace-period retry for negative TerminationGracePeriodSeconds update	2023-03-01 04:29:16 -08:00
Swati Sehgal	7ea35d0cd8	node: device-mgr: sample device plugin: manifest to avoid registration Signed-off-by: Swati Sehgal <swsehgal@redhat.com>	2023-03-01 10:01:34 +00:00
Swati Sehgal	2c8fc26b89	node: device-mgr: sample device plugin: control registration process Update the sample device plugin to enable the e2e node tests (or any other entity with full access to the node filesystem) to control the registration process. We add a new environment variable `REGISTER_CONTROL_FILE`. The value of this variable must be a file which prevents the plugin to register itself while it's present. Once removed, the plugin will go on and complete the registration. The plugin will automatically detect the parent directory on which the file resides and detect deletions, unblocking the registration process. If the file is specified but unaccessible, the plugin will fail. If the file is not specified, the registration process will progress as usual and never pause. The plugin will need read access to the parent directory. This feature is useful because it is not possible to control the order in which the pods are recovered after node reboot/kubelet restart. In this approach, the testing environment will create a directory and then a empty file to pause the registration process of the plugin. Once pointed to that file, the plugin will start and wait for it to be deleted. Only after the directory has been deleted, the plugin would proceed to registration. This feature is used in #114640 where e2e test is implemented to simulate scenarios where application pods requesting devices come up before the device plugin pod on node reboot/ kubelet restart. Co-authored-by: Francesco Romani <fromani@redhat.com> Signed-off-by: Swati Sehgal <swsehgal@redhat.com>	2023-03-01 10:00:52 +00:00
Paco Xu	7d8437933e	retry on conflict for negative TerminationGracePeriodSeconds update	2023-03-01 12:55:58 +08:00
Kubernetes Prow Robot	93a5181871	Merge pull request #116022 from nilekhc/reference-implementation-provider [kmsv2] feat: add kms mock plugin for e2e tests	2023-02-28 17:57:17 -08:00
Nilekh Chaudhari	43acba8084	feat: kms base64 plugin for e2e tests Signed-off-by: Nilekh Chaudhari <1626598+nilekhc@users.noreply.github.com>	2023-03-01 00:11:17 +00:00
Kubernetes Prow Robot	9b213330f5	Merge pull request #116153 from alexzielenski/podsecurity-featuregate-re-enable skip special features in TestPodSecurityGAOnly	2023-02-28 16:07:23 -08:00
Kubernetes Prow Robot	6e202d6fdb	Merge pull request #116116 from ahg-g/ahg-mutable-job-ga Graduate JobMutableNodeSchedulingDirectives feature to GA	2023-02-28 14:53:52 -08:00
Kubernetes Prow Robot	0469455ff7	Merge pull request #116082 from mimowo/fix-oomkiller-test Fix the flaky OOMKiller test by sleep at start	2023-02-28 14:53:37 -08:00
Patrick Ohly	cc4bcd1d8e	scheduler_perf: report data items as benchmark results This replaces the pretty useless us/op metric (useless because it includes setup and teardown times) with the same values that also get stored in the JSON file. The main advantage is that benchstat can be used to analyze and compare results.	2023-02-28 23:08:23 +01:00
Patrick Ohly	961129c5f1	scheduler_perf: add logging flags This enables testing of different real production configurations (JSON vs. text, different log levels, contextual logging).	2023-02-28 23:08:17 +01:00
Patrick Ohly	00d1459530	test/utils: extend ktesting The upstream ktesting has to be very flexible to accommodate different ways of using it. In Kubernetes, we can be opinionated and make certain choices, like using klog flags, and only those.	2023-02-28 23:06:00 +01:00
Patrick Ohly	c008732948	test/integration: add StartEtcd In contrast to EtcdMain, it can be called by individual tests or benchmarks and each caller will get a fresh etcd instance. However, it uses the same underlying code and the same port for all instances, so tests cannot run in parallel.	2023-02-28 23:05:17 +01:00
Alexander Zielenski	9ef1fc543f	skip special features in TestPodSecurityGAOnly was causing some alpha/beta features to be disabled after running sometimes	2023-02-28 13:21:35 -08:00
torredil	42909af615	Add windows nodeSelector to provisioning functions Signed-off-by: torredil <torredil@amazon.com>	2023-02-28 21:21:29 +00:00
ahg-g	2ecd24011a	Graduate JobMutableNodeSchedulingDirectives feature to GA	2023-02-28 15:47:13 +00:00
Kubernetes Prow Robot	6f68a13696	Merge pull request #115961 from pohly/e2e-framework-deprecate-gomega-wrappers e2e framework: deprecate gomega wrappers	2023-02-28 06:27:29 -08:00
Kubernetes Prow Robot	04e7021d06	Merge pull request #114625 from Divya063/feature-private-image-registry [E2E] Add support for pulling images from private registry	2023-02-28 06:27:17 -08:00
Kubernetes Prow Robot	806b215cce	Merge pull request #115987 from yuanchen8911/cleanup Replace closures in test packages	2023-02-28 01:47:29 -08:00
Divya Rani	a8b1e57246	add support for pulling images from private registry	2023-02-28 00:40:51 -08:00
David Porter	e001884594	test: Add some default flags ginkgo flags for node e2e Add the following ginkgo flags for each node e2e similar to the existing hack/ginkgo-e2e.sh script. * --no-color, colors aren't rendered properly in prow and make examining the log in text editors more difficult, so let's disable them. `hack/ginkgo-e2e.sh` (used for kind e2e tests) also disables them already. * -v, enable verbose logs. This is needed so we get more detailed info even when the tests pass. This is useful so we can compare successful runs to failed runs. Signed-off-by: David Porter <david@porter.me>	2023-02-28 00:24:40 -08:00
David Porter	e9ecdf3534	test: Emit ginkgo log for each node e2e When running multiple node e2e with multiple machine images, the tests are run separately for each node. The final build log has all of the results for each of the hosts combined together which make debugging the log difficult. To make it easier, emit a log for each host that was run. This log will be written to the results directory and uploaded as an artifact in prow jobs. Signed-off-by: David Porter <david@porter.me>	2023-02-28 00:21:34 -08:00
David Porter	0980f026c9	test: Remove tests argument from node e2e image config This was never being used, the only config that used it was deleted in https://github.com/kubernetes/test-infra/pull/26017 so we don't need this anymore, so let's delete it. Signed-off-by: David Porter <david@porter.me>	2023-02-28 00:21:03 -08:00
Kubernetes Prow Robot	b9fd1802ba	Merge pull request #102884 from vinaykul/restart-free-pod-vertical-scaling In-place Pod Vertical Scaling feature	2023-02-27 22:53:15 -08:00
Kevin Delgado	9828f62237	turn field validation e2e tests into conformance tests	2023-02-27 14:39:21 -08:00

1 2 3 4 5 ...

22913 Commits