Commit Graph

135 Commits

Author SHA1 Message Date
Sharpz7
7e4b5d0d49 Final Fix 2023-09-08 14:44:22 +00:00
Sharpz7
43fc6b5bdb Added suggests changes 2023-09-06 03:05:14 +00:00
Sharpz7
e9be1d7438 Test now has coverage! 2023-08-27 05:06:53 +00:00
Adam McArthur
0bc0256093
Update job_controller_test.go 2023-08-25 08:15:53 -06:00
Sharpz7
22f4b1c56a Static check fix 2023-08-25 11:35:05 +00:00
Sharpz7
70e2deb32f Fixing lint problem 2023-08-25 11:08:59 +00:00
Sharpz7
6ded53ce4d Added back test changes 2023-08-25 10:35:58 +00:00
Sharpz7
5fb049ff47 Added create job & cleanup 2023-08-25 10:35:58 +00:00
Sharpz7
ff1659cb79 Added syncjob 2023-08-25 10:35:58 +00:00
Sharpz7
f87cc43cdb Review Changes 2023-08-25 10:35:58 +00:00
Sharpz7
d08fc3a4d0 Another one creeped in 2023-08-25 10:35:58 +00:00
Sharpz7
ef6a0eb6d8 Final Lint Fix 2023-08-25 10:35:58 +00:00
Sharpz7
aa9f38c36d More Lint Fixes 2023-08-25 10:35:58 +00:00
Sharpz7
601679446a Lint fixes 2023-08-25 10:35:58 +00:00
Sharpz7
cf32ae9453 Initial Commit 2023-08-25 10:35:58 +00:00
kannon92
f73c253acc fix typos for pod replacement policy 2023-08-09 20:34:48 +00:00
Kubernetes Prow Robot
a30f6b7922
Merge pull request #119506 from mimowo/fix-job-controller-flaky-test
Fix the flaky TestJobApiBackoffReset test
2023-07-21 09:30:07 -07:00
Michal Wozniak
dbea279112 Fix the flaky TestJobApiBackoffReset test 2023-07-21 14:45:04 +02:00
kannon92
74fcf3e766 implementation of PodReplacementPolicy kep in the job controller 2023-07-21 00:44:53 +00:00
Michal Wozniak
35d0af9243 Include ignored pods when computing backoff delay for Job pod failures 2023-07-19 17:39:58 +02:00
Michał Woźniak
a15c27661e
Job controller implementation of backoff limit per index (#118009) 2023-07-18 13:44:11 -07:00
Kubernetes Prow Robot
6f3856f953
Merge pull request #118883 from danielvegamyhre/kep-4017-job
Add completion index as pod label for indexed jobs
2023-07-14 12:23:50 -07:00
Daniel Vega-Myhre
037091284e fix unit test bug 2023-07-13 22:38:21 +00:00
Daniel Vega-Myhre
1ae60c0ed1 use job completion index annotation as label 2023-07-13 21:04:37 +00:00
Patrick Ohly
7d064812bb kube-controller-manager: finish conversion to contextual logging
This removes all exceptions and fixes the remaining unconverted log calls.
2023-07-12 14:57:29 +02:00
Michal Wozniak
bf48165232 Remarks to syncJobCtx 2023-07-11 09:44:08 +02:00
Michal Wozniak
990339d4c3 Introduce syncJobContext to limit the number of function parameters 2023-07-11 09:27:21 +02:00
Daniel Vega-Myhre
98c6e25c37 update name of pod index label 2023-07-10 20:11:52 +00:00
Daniel Vega-Myhre
3a02ecb341 check test case param instead of feature flag in unit test code 2023-07-06 17:30:40 +00:00
Daniel Vega-Myhre
a647f9febb default enabled pod index for test cases, add test case disabling it 2023-07-05 18:47:45 +00:00
Daniel Vega-Myhre
e0af0a5a45 add test case param for feature flag 2023-06-29 21:51:15 +00:00
Daniel Vega-Myhre
a9afaa1eee add feature gate 2023-06-27 18:07:17 +00:00
Daniel Vega-Myhre
2176053415 add completion index as pod label 2023-06-26 19:53:14 +00:00
Michal Wozniak
8ed23558b4 Do not set jm.syncJobBatchPeriod=0 if not needed 2023-06-22 11:10:53 +02:00
Michal Wozniak
784a309b91 Do not error in Job controller sync when there are pod failures 2023-06-20 11:31:24 +02:00
Michal Wozniak
74c5ff97f1 Lower the constants for the rate limiter in Job controller 2023-06-16 17:00:04 +02:00
Michal Wozniak
c51a422d78 Cleanup job controller handling of backoff 2023-06-16 14:53:27 +02:00
Ziqi Zhao
7bc449d7e0 add contextual logging to job-controller
Signed-off-by: Ziqi Zhao <zhaoziqi9146@gmail.com>
2023-06-14 13:40:02 +08:00
Michal Wozniak
2f6b1d3c0f Ensure Job sync invocations are batched by 1s periods 2023-06-07 17:32:46 +02:00
Yuki Iwai
e4340f0d9b Job: Use generic Set in controller
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-05-08 15:02:23 +09:00
Sathyanarayanan Saravanamuthu
c84c8add70
Decouple batch/job back-off logic from workqueues (#114768)
* batch/job: decouple backoff from workqueue

Signed-off-by: Sathyanarayanan Saravanamuthu <sathyanarays@vmware.com>

* Resolving review comments

* Resolving more review comments

* Resolving review comments

Signed-off-by: Sathyanarayanan Saravanamuthu <sathyanarays@vmware.com>

* Computing finish time to now when FinishedAt is unix epoch

* Addressing review comments

Signed-off-by: Sathyanarayanan Saravanamuthu <sathyanarays@vmware.com>

---------

Signed-off-by: Sathyanarayanan Saravanamuthu <sathyanarays@vmware.com>
2023-03-16 10:15:21 -07:00
Daniel Vega-Myhre
d41302312e update validation logic so completions is mutable iff completions is modified in tandem with parallelsim so completions == parallelism 2023-02-23 03:25:16 +00:00
Daniel Vega-Myhre
2a81337e7c update prev succeeded indexes for indexed jobs unconditionally 2023-01-31 19:15:53 +00:00
Kubernetes Prow Robot
5550064bc2
Merge pull request #115063 from kannon92/tracking-remove-comments
tracking with finalizers is the default way for the job controller so comments are not needed that say we are tracking with finalizers
2023-01-17 07:56:44 -08:00
Kubernetes Prow Robot
9af5ae0365
Merge pull request #115030 from kannon92/remove-pod-error-job-tracking
Update SyncJob with PodControllerError updates in job unit tests
2023-01-13 12:08:14 -08:00
Kubernetes Prow Robot
70217a4083
Merge pull request #114944 from mimowo/fix-active-deadline-test
Fix the job controller unit test for enforcing ActiveDeadlineSeconds
2023-01-13 10:46:26 -08:00
kannon92
4890928b78 tracking with finalizers is the default way for the job controller 2023-01-13 16:48:35 +00:00
kannon92
3a838033f8 Update SyncJob with PodControllerError updates in job unit tests 2023-01-13 16:39:18 +00:00
Michal Wozniak
7065b42bb2 Fix the job controller unit test for enforcing ActiveDeadlineSeconds 2023-01-13 16:48:15 +01:00
Nikhita Raghunath
fd8d92a29d pkg/controller/job: re-honor exponential backoff
This commit makes the job controller re-honor exponential backoff for
failed pods. Before this commit, the controller created pods without any
backoff. This is a regression because the controller used to
create pods with an exponential backoff delay before (10s, 20s, 40s ...).

The issue occurs only when the JobTrackingWithFinalizers feature is
enabled (which is enabled by default right now). With this feature, we
get an extra pod update event when the finalizer of a failed pod is
removed.

Note that the pod failure detection and new pod creation happen in the
same reconcile loop so the 2nd pod is created immediately after the 1st
pod fails. The backoff is only applied on 2nd pod failure, which means
that the 3rd pod created 10s after the 2nd pod, 4th pod is created 20s
after the 3rd pod and so on.

This commit fixes a few bugs:

1. Right now, each time `uncounted != nil` and the job does not see a
_new_ failure, `forget` is set to true and the job is removed from the
queue. Which means that this condition is also triggered each time the
finalizer for a failed pod is removed and `NumRequeues` is reset, which
results in a backoff of 0s.

2. Updates `updatePod` to only apply backoff when we see a particular
pod failed for the first time. This is necessary to ensure that the
controller does not apply backoff when it sees a pod update event
for finalizer removal of a failed pod.

3. If `JobsReadyPods` feature is enabled and backoff is 0s, the job is
now enqueued after `podUpdateBatchPeriod` seconds, instead of 0s. The
unit test for this check also had a few bugs:
    - `DefaultJobBackOff` is overwritten to 0 in certain unit tests,
    which meant that `DefaultJobBackOff` was considered to be 0,
    effectively not running any meaningful checks.
    - `JobsReadyPods` was not enabled for test cases that ran tests
    which required the feature gate to be enabled.
    - The check for expected and actual backoff had incorrect
    calculations.
2023-01-12 20:34:10 +05:30