Commit Graph

166 Commits

Author SHA1 Message Date
Yuki Iwai
a85f587984 Job: Use built-in min function instead of integer package
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-11-17 14:10:00 +09:00
Dejan Pejchev
88c0a8be1b
feat: add job_pods_creation_total metric 2023-10-24 17:49:04 +02:00
Dejan Zele Pejchev
f8a4e343a1
Fix tracking of terminating Pods when nothing else changes (#121342)
* cleanup: refactor pod replacement policy integration test into staged assertion

* cleanup: remove typo in job_test.go

* refactor PodReplacementPolicy test and remove test for defaulting the policy

* fix issue with missing update in job controller for terminating status and refactor pod replacement policy integration test

* use t.Cleanup instead of defer in PodReplacementPolicy integration tests

* revert t.Cleanup to defer for reseting feature flag in PodReplacementPolicy integration tests
2023-10-24 15:04:46 +02:00
Kubernetes Prow Robot
8149ab3f3f
Merge pull request #121356 from mimowo/backoff-limit-per-index-beta
Graduate BackoffLimitPerIndex to Beta
2023-10-23 18:39:58 +02:00
Michal Wozniak
b0d04d933b Introduce the job_finished_indexes_total metric 2023-10-20 15:19:04 +02:00
Michal Wozniak
6dd0ad5c0f Graduate BackoffLimitPerIndex to Beta 2023-10-19 12:18:36 +02:00
Kubernetes Prow Robot
6d70013af5
Merge pull request #121147 from kannon92/rm-at-least-no-terminating-count
Remove terminating count from rmAtLeast
2023-10-18 00:44:51 +02:00
Kubernetes Prow Robot
27ff547a14
Merge pull request #121011 from kannon92/job-pod-replacement-policy-feature-on-but-api-specified
Fix panic when enablement of pod replacement policy is skewed
2023-10-17 21:28:48 +02:00
Yuki Iwai
201c30fba8
Job: Handle error returned from AddEventHandler function (#119917)
* Job: Handle error returned from AddEventHandler function

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Use the error message the similar to CronJob

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Clean up error messages

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Put the tesing.T on the second place in the args for the newControllerFromClient function

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Put the testing.T on the second place in the args for the newControllerFromClientWithClock function

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Call t.Helper()

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Put the testing.TB on the second place in the args for the createJobControllerWithSharedInformers function and call tb.Helper() there

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Put the testing.TB on the second place in the args for the startJobControllerAndWaitForCaches function and call tb.Helper() there

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Adapt TestFinializerCleanup to the eventhandler error

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

---------

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-10-17 21:28:34 +02:00
Kevin Hannon
7a1ac18bc8 Fix panic if there are more terminating pods than active pods
Co-authored-by: Aldo Culquicondor <1299064+alculquicondor@users.noreply.github.com>
2023-10-17 14:50:38 -04:00
Kevin Hannon
d7ee6b9d1b fix possible panic if pod replacement policy is turned on and jobs do not set pod replacement policy 2023-10-11 08:37:50 -04:00
Kevin Hannon
b96a074bcd convert pointer to ptr for job controller 2023-10-05 09:30:01 -04:00
Kevin Hannon
a62eb45ae2 Rename job reasons to JobReasons as part of api review 2023-09-19 13:10:22 -04:00
Kevin Hannon
c6e9fba79b move reasons to api package for job controller 2023-09-14 13:24:29 -04:00
Sharpz7
43fc6b5bdb Added suggests changes 2023-09-06 03:05:14 +00:00
Sharpz7
e9be1d7438 Test now has coverage! 2023-08-27 05:06:53 +00:00
Sharpz7
cf32ae9453 Initial Commit 2023-08-25 10:35:58 +00:00
Sharpz7
297f04b74a Added function to remove finalizers as backup 2023-08-25 10:35:57 +00:00
Kubernetes Prow Robot
df493712e4
Merge pull request #119874 from kannon92/pod-replacement-policy-typos
fix typos for pod replacement policy
2023-08-17 11:21:34 -07:00
Kubernetes Prow Robot
d5f2420309
Merge pull request #119914 from luohaha3123/job-feature
Job: Change job controller  methods receiver to pointer
2023-08-15 23:14:05 -07:00
lhaha
947c9376f6 change struct methods receiver to pointer 2023-08-12 10:21:14 +08:00
Yuki Iwai
6f27733af8 Job: Replace deprecated workqueue library with supported one
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-08-11 20:35:36 +09:00
kannon92
f73c253acc fix typos for pod replacement policy 2023-08-09 20:34:48 +00:00
kannon92
74fcf3e766 implementation of PodReplacementPolicy kep in the job controller 2023-07-21 00:44:53 +00:00
Michal Wozniak
35d0af9243 Include ignored pods when computing backoff delay for Job pod failures 2023-07-19 17:39:58 +02:00
Michał Woźniak
a15c27661e
Job controller implementation of backoff limit per index (#118009) 2023-07-18 13:44:11 -07:00
Kubernetes Prow Robot
84a999923f
Merge pull request #119335 from mimowo/use-final-diff-for-job-pod-creation
Ensure final diff is used for setting expectations for Job pod creation
2023-07-14 15:20:54 -07:00
Kubernetes Prow Robot
6f3856f953
Merge pull request #118883 from danielvegamyhre/kep-4017-job
Add completion index as pod label for indexed jobs
2023-07-14 12:23:50 -07:00
Michal Wozniak
9564bdc39d Ensure final diff is used for setting expectations for Job pod creation 2023-07-14 19:09:39 +02:00
Michal Wozniak
7e3b53042b Pass Job context down to firstPendingIndexes 2023-07-13 16:11:06 +02:00
Patrick Ohly
7d064812bb kube-controller-manager: finish conversion to contextual logging
This removes all exceptions and fixes the remaining unconverted log calls.
2023-07-12 14:57:29 +02:00
Michal Wozniak
bf48165232 Remarks to syncJobCtx 2023-07-11 09:44:08 +02:00
Michal Wozniak
990339d4c3 Introduce syncJobContext to limit the number of function parameters 2023-07-11 09:27:21 +02:00
Aldo Culquicondor
f7a1fb76f4
Only declare job as finished after removing all finalizers
Change-Id: Id4b01b0e6fabe24134e57e687356e0fc613cead4
2023-07-07 14:08:19 -04:00
kannon92
921b7e6e8f remove equalReady and replace with k8 util function 2023-07-05 20:11:48 +00:00
Daniel Vega-Myhre
a9afaa1eee add feature gate 2023-06-27 18:07:17 +00:00
Daniel Vega-Myhre
2176053415 add completion index as pod label 2023-06-26 19:53:14 +00:00
Michal Wozniak
8ed23558b4 Do not set jm.syncJobBatchPeriod=0 if not needed 2023-06-22 11:10:53 +02:00
Michal Wozniak
784a309b91 Do not error in Job controller sync when there are pod failures 2023-06-20 11:31:24 +02:00
Michal Wozniak
74c5ff97f1 Lower the constants for the rate limiter in Job controller 2023-06-16 17:00:04 +02:00
Michal Wozniak
c51a422d78 Cleanup job controller handling of backoff 2023-06-16 14:53:27 +02:00
Ziqi Zhao
7bc449d7e0 add contextual logging to job-controller
Signed-off-by: Ziqi Zhao <zhaoziqi9146@gmail.com>
2023-06-14 13:40:02 +08:00
Michal Wozniak
2f6b1d3c0f Ensure Job sync invocations are batched by 1s periods 2023-06-07 17:32:46 +02:00
Michal Wozniak
70d3bb43e5 Adjust the algorithm for computing the pod finish time
Change-Id: Ic282a57169cab8dc498574f08b081914218a1039
2023-06-05 10:06:56 +02:00
Michal Wozniak
0fe27a06f9 Cleanup the Job controller handling of terminating pods 2023-05-19 09:52:08 +02:00
Yuki Iwai
e4340f0d9b Job: Use generic Set in controller
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-05-08 15:02:23 +09:00
Sathyanarayanan Saravanamuthu
c84c8add70
Decouple batch/job back-off logic from workqueues (#114768)
* batch/job: decouple backoff from workqueue

Signed-off-by: Sathyanarayanan Saravanamuthu <sathyanarays@vmware.com>

* Resolving review comments

* Resolving more review comments

* Resolving review comments

Signed-off-by: Sathyanarayanan Saravanamuthu <sathyanarays@vmware.com>

* Computing finish time to now when FinishedAt is unix epoch

* Addressing review comments

Signed-off-by: Sathyanarayanan Saravanamuthu <sathyanarays@vmware.com>

---------

Signed-off-by: Sathyanarayanan Saravanamuthu <sathyanarays@vmware.com>
2023-03-16 10:15:21 -07:00
kannon92
32ac4a9581 left over uncounted from tracking cleanup 2023-02-22 16:45:53 +00:00
Daniel Vega-Myhre
2a81337e7c update prev succeeded indexes for indexed jobs unconditionally 2023-01-31 19:15:53 +00:00
Nikhita Raghunath
fd8d92a29d pkg/controller/job: re-honor exponential backoff
This commit makes the job controller re-honor exponential backoff for
failed pods. Before this commit, the controller created pods without any
backoff. This is a regression because the controller used to
create pods with an exponential backoff delay before (10s, 20s, 40s ...).

The issue occurs only when the JobTrackingWithFinalizers feature is
enabled (which is enabled by default right now). With this feature, we
get an extra pod update event when the finalizer of a failed pod is
removed.

Note that the pod failure detection and new pod creation happen in the
same reconcile loop so the 2nd pod is created immediately after the 1st
pod fails. The backoff is only applied on 2nd pod failure, which means
that the 3rd pod created 10s after the 2nd pod, 4th pod is created 20s
after the 3rd pod and so on.

This commit fixes a few bugs:

1. Right now, each time `uncounted != nil` and the job does not see a
_new_ failure, `forget` is set to true and the job is removed from the
queue. Which means that this condition is also triggered each time the
finalizer for a failed pod is removed and `NumRequeues` is reset, which
results in a backoff of 0s.

2. Updates `updatePod` to only apply backoff when we see a particular
pod failed for the first time. This is necessary to ensure that the
controller does not apply backoff when it sees a pod update event
for finalizer removal of a failed pod.

3. If `JobsReadyPods` feature is enabled and backoff is 0s, the job is
now enqueued after `podUpdateBatchPeriod` seconds, instead of 0s. The
unit test for this check also had a few bugs:
    - `DefaultJobBackOff` is overwritten to 0 in certain unit tests,
    which meant that `DefaultJobBackOff` was considered to be 0,
    effectively not running any meaningful checks.
    - `JobsReadyPods` was not enabled for test cases that ran tests
    which required the feature gate to be enabled.
    - The check for expected and actual backoff had incorrect
    calculations.
2023-01-12 20:34:10 +05:30