kubernetes

Author	SHA1	Message	Date
Kubernetes Prow Robot	5550064bc2	Merge pull request #115063 from kannon92/tracking-remove-comments tracking with finalizers is the default way for the job controller so comments are not needed that say we are tracking with finalizers	2023-01-17 07:56:44 -08:00
Kubernetes Prow Robot	9af5ae0365	Merge pull request #115030 from kannon92/remove-pod-error-job-tracking Update SyncJob with PodControllerError updates in job unit tests	2023-01-13 12:08:14 -08:00
Kubernetes Prow Robot	70217a4083	Merge pull request #114944 from mimowo/fix-active-deadline-test Fix the job controller unit test for enforcing ActiveDeadlineSeconds	2023-01-13 10:46:26 -08:00
kannon92	4890928b78	tracking with finalizers is the default way for the job controller	2023-01-13 16:48:35 +00:00
kannon92	3a838033f8	Update SyncJob with PodControllerError updates in job unit tests	2023-01-13 16:39:18 +00:00
Michal Wozniak	7065b42bb2	Fix the job controller unit test for enforcing ActiveDeadlineSeconds	2023-01-13 16:48:15 +01:00
Nikhita Raghunath	fd8d92a29d	pkg/controller/job: re-honor exponential backoff This commit makes the job controller re-honor exponential backoff for failed pods. Before this commit, the controller created pods without any backoff. This is a regression because the controller used to create pods with an exponential backoff delay before (10s, 20s, 40s ...). The issue occurs only when the JobTrackingWithFinalizers feature is enabled (which is enabled by default right now). With this feature, we get an extra pod update event when the finalizer of a failed pod is removed. Note that the pod failure detection and new pod creation happen in the same reconcile loop so the 2nd pod is created immediately after the 1st pod fails. The backoff is only applied on 2nd pod failure, which means that the 3rd pod created 10s after the 2nd pod, 4th pod is created 20s after the 3rd pod and so on. This commit fixes a few bugs: 1. Right now, each time `uncounted != nil` and the job does not see a _new_ failure, `forget` is set to true and the job is removed from the queue. Which means that this condition is also triggered each time the finalizer for a failed pod is removed and `NumRequeues` is reset, which results in a backoff of 0s. 2. Updates `updatePod` to only apply backoff when we see a particular pod failed for the first time. This is necessary to ensure that the controller does not apply backoff when it sees a pod update event for finalizer removal of a failed pod. 3. If `JobsReadyPods` feature is enabled and backoff is 0s, the job is now enqueued after `podUpdateBatchPeriod` seconds, instead of 0s. The unit test for this check also had a few bugs: - `DefaultJobBackOff` is overwritten to 0 in certain unit tests, which meant that `DefaultJobBackOff` was considered to be 0, effectively not running any meaningful checks. - `JobsReadyPods` was not enabled for test cases that ran tests which required the feature gate to be enabled. - The check for expected and actual backoff had incorrect calculations.	2023-01-12 20:34:10 +05:30
kannon92	6dfaeff33c	Remove Legacy Job Tracking	2023-01-10 14:52:54 +00:00
Kubernetes Prow Robot	e7549eae87	Merge pull request #114905 from kannon92/sync-job-test-fix Fix SyncPastDeadlineJobFinished for enabling finalizer path	2023-01-09 12:47:28 -08:00
kannon92	0362c67859	Fix SyncPastDeadlineJobFinished for enabling finalizer path	2023-01-09 17:12:52 +00:00
Aldo Culquicondor	4c1b95ddfa	Ensure job is up to date in informer cache in test The fake client doesn't guarantee that the informer cache is updated. If it's not up-to-date, the controller always tries to set the StartTime, leading to a broken test. Change-Id: I71f26d46ea44beff88f0d03517985348654aec95	2023-01-09 10:53:19 -05:00
Harsha Narayana	208c3868cf	job controller: refactored job controller to be able to inject FakeClock for Unit Test	2022-12-20 21:29:24 +05:30
ialidzhikov	aede3fbf40	pkg/controller: Replace deprecated func usage from the `k8s.io/utils/pointer` pkg	2022-11-23 17:40:23 +02:00
Aldo Culquicondor	7dc36bdf82	Wait for Pods to finish before considering Failed in Job (#113860 ) * Wait for Pods to finish before considering Failed Limit behavior to feature gates PodDisruptionConditions and JobPodFailurePolicy and jobs with a podFailurePolicy. Change-Id: I926391cc2521b389c8e52962afb0d4a6a845ab8f * Remove check for unsheduled terminating pod Change-Id: I3dc05bb4ea3738604f01bf8cb5fc8cc0f6ea54ec	2022-11-15 09:44:53 -08:00
Michal Wozniak	c803892bd8	Enable the feature into beta	2022-11-09 09:02:40 +01:00
Aldo Culquicondor	4948918155	Graduate JobTrackingWithFinalizers to stable Change-Id: Ifc749a85b1270c0155ac511b91d4681d53236820	2022-11-04 17:05:53 -04:00
Michal Wozniak	bf9ce70de3	Support handling of pod failures with respect to the specified rules	2022-08-04 18:39:08 +02:00
Aldo Culquicondor	ca8cebe5ba	Fix JobTrackingWithFinalizers when a pod succeeds after the job fails Change-Id: I3be351fb3b53216948a37b1d58224f8fbbf22b47	2022-08-02 19:33:06 -04:00
Aldo Culquicondor	b492f49c9f	Do not skip job requeue in conflict error Change-Id: Ie97977887a1cc3de58922d73dce92ae1965965bf	2022-07-08 16:14:32 +00:00
Aldo Culquicondor	62a25920e6	Wait for cache sync in TestSyncPastDeadlineJobFinished Change-Id: I6f023ca6999108f4f86a0f57831d47704cdbb42b	2022-06-24 09:22:59 -04:00
Aldo Culquicondor	817c8bbf59	Increase timeout for TestSyncPastDeadlineJobFinished To mitigate flakiness Change-Id: I1d0286d16d2b7dd3a605690e9a2d4d2f954701ff	2022-06-21 14:49:10 -04:00
Harsha Narayana	eea7dca085	GIT-110239: fix activeDeadlineSeconds enforcement bug GIT-110239: add additional tests with preset Status.StartTime GIT-110239: add additional tests with preset Status.StartTime	2022-06-13 20:06:44 +05:30
Kubernetes Prow Robot	6cd258f9f5	Merge pull request #110292 from mimowo/109904-avoid-duplicate-conditions Avoid duplicate Failed conditions in job status	2022-06-09 14:01:45 -07:00
Michal Wozniak	e298649b6c	Avoid duplicate conditions by updating the pre-existing failed condition in case its status is False or Unknown. In case the status of the pre-existing condition is true we ignore the new condition. If there is no pre-existing failed condition, then append the new failed condition as before. Also, make the condition comparisons less hacky by ignoring timestamp fields in tests.	2022-06-01 19:32:53 +02:00
Aldo Culquicondor	a5f5eab5fd	Wait for cache to sync in job's TestWatchOrphanPods Otherwise the event handler might not be called. Change-Id: I23c93c2251b411430a0f2469686db6355d84af2f	2022-05-10 14:18:21 -04:00
Aldo Culquicondor	09caa36718	Fix removing finalizer from finished jobs In some rare race conditions, the job controller might create new pods after the job is declared finished. Change-Id: I8a00429c8845463259cd7f82bb3c241d0011583c	2022-04-20 16:39:10 -04:00
Aldo Culquicondor	53aa05df3a	Don't mark job as failed until expectations are satisfied Change-Id: I99206f35f6f145054c005ab362c792e71b9b15f4	2022-04-20 16:39:10 -04:00
Aldo Culquicondor	8c00f510ef	Graduate JobReadyPods to beta Set podUpdateBatchPeriod to 1s Change-Id: I8a10fd8f8559adad9df179b664b8c82851607855	2022-03-29 10:07:41 -04:00
Aldo Culquicondor	2c5d0a273c	Graduate IndexedJob to stable - Lock feature gate to true and schedule for deletion in 1.26 - Remove checks on feature gate - Graduate E2E test to Conformance Change-Id: I6814819d318edaed5c86dae4055f4b050a4d39fd	2022-03-15 13:41:06 -04:00
Abdullah Gharaibeh	b2d2ec9e76	Graduate SuspendJob to GA	2022-02-15 10:46:13 -05:00
Mike Dame	80c01707e0	Wire contexts to Batch controllers (#105491 ) * Wire contexts to Batch controllers * (hold) feedback + updates that overlap with Apps controllers * fixup errors	2021-11-10 14:56:46 -08:00
Aldo Culquicondor	60fc90967b	Count ready pods in job controller When the feature gate JobReadyPods is enabled. Change-Id: I86f93914568de6a7029f9ae92ee7b749686fbf97	2021-10-19 15:18:37 -04:00
Aldo Culquicondor	4ef9d18abe	Fix name for Pods of NonIndexed Jobs Change-Id: I0ea4685a82f4cdec0caab362d52144476652f95a	2021-10-14 10:55:46 -04:00
Aldo Culquicondor	5929ccd391	Track expected removals of Pod finalizers Add the UIDs of Pods for which we are removing finalizers to an in-memory cache. The controller removes UIDs from the cache as Pod updates or deletes come in. This avoids double counting finished Pods when Pod updates arrive after Job status updates. https://github.com/kubernetes/kubernetes/issues/105200	2021-10-04 16:09:58 -04:00
Aldo Culquicondor	a438f16741	Revert "Revert "Add metric job_pod_finished"" This reverts commit `7868fbbe64`.	2021-09-23 12:56:29 -04:00
Aldo Culquicondor	47a957d163	Revert "Revert "Limit number of Pods counted in a single Job sync"" This reverts commit `8bcb780808`.	2021-09-23 12:56:29 -04:00
Aldo Culquicondor	eebd678cda	Remove GET job and retries for status updates. Doing a GET right before retrying has 2 problems: - It can masquerade conflicts - It adds an additional delay As for retries, we are better of going through the sync backoff. In the case of conflict, we know that there was a Job update that would trigger another sync, so there is no need to do a rate limited requeue.	2021-09-23 11:48:34 -04:00
Kubernetes Prow Robot	76c0573ff4	Merge pull request #105181 from alculquicondor/revert Revert #104739	2021-09-21 16:54:00 -07:00
Aldo Culquicondor	7868fbbe64	Revert "Add metric job_pod_finished" This reverts commit `a0e7a567c5`.	2021-09-21 15:16:54 -04:00
Aldo Culquicondor	8bcb780808	Revert "Limit number of Pods counted in a single Job sync" This reverts commit `7d9cb88fed`.	2021-09-21 15:16:50 -04:00
Kubernetes Prow Robot	f55101913f	Merge pull request #105098 from Karthik-K-N/fix-error-format Fix incorrect format specifier in test files	2021-09-20 08:56:09 -07:00
Karthik K N	c651d50202	Fix incorrect format specifier in test files	2021-09-17 16:27:53 +05:30
Aldo Culquicondor	a0e7a567c5	Add metric job_pod_finished To count the number of pods that the job controller successfully tracked with the JobTrackingWithFinalizers feature gate.	2021-09-15 11:19:47 -04:00
Aldo Culquicondor	7d9cb88fed	Limit number of Pods counted in a single Job sync This prevents big Jobs from starving smaller ones.	2021-09-10 10:32:04 -04:00
Aldo Culquicondor	23ea5d80d6	Fix Job tracking with finalizers for more than 500 pods When doing partial updates for uncountedTerminatedPods, the controller might have removed UIDs for Pods which still had finalizers. Also make more space by removing UIDs that don't have finalizers at the beginning of the sync.	2021-09-01 16:19:04 -04:00
Aldo Culquicondor	5e1b5ec398	Revert counting deleted pods as failures for Job When JobTrackingWithFinalizers is disabled. To preserve existing behavior. Change-Id: Id1752f96feed322911712fe9e918e91e42eca809	2021-07-14 10:03:20 -04:00
Aldo Culquicondor	2dd2622188	Track Job Pods completion in status Through Job.status.uncountedPodUIDs and a Pod finalizer An annotation marks if a job should be tracked with new behavior A separate work queue is used to remove finalizers from orphan pods. Change-Id: I1862e930257a9d1f7f1b2b0a526ed15bc8c248ad	2021-07-08 17:48:05 +00:00
Adhityaa Chandrasekar	ba708e5fc9	graduate SuspendJob to beta Also adds a label to two existing Job metrics. Signed-off-by: Adhityaa Chandrasekar <adtac@google.com>	2021-06-03 18:48:32 +00:00
Mengxue Zhang	e64e34e029	specify pod name and hostname in indexed job	2021-05-19 15:30:13 +00:00
Kubernetes Prow Robot	548fb43643	Merge pull request #101292 from AliceZhang2016/job_controller_metrics Graduate indexed job to beta	2021-05-07 13:31:44 -07:00

1 2

92 Commits