Commit Graph

160 Commits

Author SHA1 Message Date
Michal Wozniak
bf48165232 Remarks to syncJobCtx 2023-07-11 09:44:08 +02:00
Michal Wozniak
990339d4c3 Introduce syncJobContext to limit the number of function parameters 2023-07-11 09:27:21 +02:00
Daniel Vega-Myhre
98c6e25c37 update name of pod index label 2023-07-10 20:11:52 +00:00
Daniel Vega-Myhre
3a02ecb341 check test case param instead of feature flag in unit test code 2023-07-06 17:30:40 +00:00
Daniel Vega-Myhre
a647f9febb default enabled pod index for test cases, add test case disabling it 2023-07-05 18:47:45 +00:00
Daniel Vega-Myhre
e0af0a5a45 add test case param for feature flag 2023-06-29 21:51:15 +00:00
Daniel Vega-Myhre
a9afaa1eee add feature gate 2023-06-27 18:07:17 +00:00
Daniel Vega-Myhre
2176053415 add completion index as pod label 2023-06-26 19:53:14 +00:00
Michal Wozniak
8ed23558b4 Do not set jm.syncJobBatchPeriod=0 if not needed 2023-06-22 11:10:53 +02:00
Michal Wozniak
784a309b91 Do not error in Job controller sync when there are pod failures 2023-06-20 11:31:24 +02:00
Michal Wozniak
74c5ff97f1 Lower the constants for the rate limiter in Job controller 2023-06-16 17:00:04 +02:00
Michal Wozniak
c51a422d78 Cleanup job controller handling of backoff 2023-06-16 14:53:27 +02:00
Ziqi Zhao
7bc449d7e0 add contextual logging to job-controller
Signed-off-by: Ziqi Zhao <zhaoziqi9146@gmail.com>
2023-06-14 13:40:02 +08:00
Michal Wozniak
2f6b1d3c0f Ensure Job sync invocations are batched by 1s periods 2023-06-07 17:32:46 +02:00
Yuki Iwai
e4340f0d9b Job: Use generic Set in controller
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-05-08 15:02:23 +09:00
Sathyanarayanan Saravanamuthu
c84c8add70
Decouple batch/job back-off logic from workqueues (#114768)
* batch/job: decouple backoff from workqueue

Signed-off-by: Sathyanarayanan Saravanamuthu <sathyanarays@vmware.com>

* Resolving review comments

* Resolving more review comments

* Resolving review comments

Signed-off-by: Sathyanarayanan Saravanamuthu <sathyanarays@vmware.com>

* Computing finish time to now when FinishedAt is unix epoch

* Addressing review comments

Signed-off-by: Sathyanarayanan Saravanamuthu <sathyanarays@vmware.com>

---------

Signed-off-by: Sathyanarayanan Saravanamuthu <sathyanarays@vmware.com>
2023-03-16 10:15:21 -07:00
Daniel Vega-Myhre
d41302312e update validation logic so completions is mutable iff completions is modified in tandem with parallelsim so completions == parallelism 2023-02-23 03:25:16 +00:00
Daniel Vega-Myhre
2a81337e7c update prev succeeded indexes for indexed jobs unconditionally 2023-01-31 19:15:53 +00:00
Kubernetes Prow Robot
5550064bc2
Merge pull request #115063 from kannon92/tracking-remove-comments
tracking with finalizers is the default way for the job controller so comments are not needed that say we are tracking with finalizers
2023-01-17 07:56:44 -08:00
Kubernetes Prow Robot
9af5ae0365
Merge pull request #115030 from kannon92/remove-pod-error-job-tracking
Update SyncJob with PodControllerError updates in job unit tests
2023-01-13 12:08:14 -08:00
Kubernetes Prow Robot
70217a4083
Merge pull request #114944 from mimowo/fix-active-deadline-test
Fix the job controller unit test for enforcing ActiveDeadlineSeconds
2023-01-13 10:46:26 -08:00
kannon92
4890928b78 tracking with finalizers is the default way for the job controller 2023-01-13 16:48:35 +00:00
kannon92
3a838033f8 Update SyncJob with PodControllerError updates in job unit tests 2023-01-13 16:39:18 +00:00
Michal Wozniak
7065b42bb2 Fix the job controller unit test for enforcing ActiveDeadlineSeconds 2023-01-13 16:48:15 +01:00
Nikhita Raghunath
fd8d92a29d pkg/controller/job: re-honor exponential backoff
This commit makes the job controller re-honor exponential backoff for
failed pods. Before this commit, the controller created pods without any
backoff. This is a regression because the controller used to
create pods with an exponential backoff delay before (10s, 20s, 40s ...).

The issue occurs only when the JobTrackingWithFinalizers feature is
enabled (which is enabled by default right now). With this feature, we
get an extra pod update event when the finalizer of a failed pod is
removed.

Note that the pod failure detection and new pod creation happen in the
same reconcile loop so the 2nd pod is created immediately after the 1st
pod fails. The backoff is only applied on 2nd pod failure, which means
that the 3rd pod created 10s after the 2nd pod, 4th pod is created 20s
after the 3rd pod and so on.

This commit fixes a few bugs:

1. Right now, each time `uncounted != nil` and the job does not see a
_new_ failure, `forget` is set to true and the job is removed from the
queue. Which means that this condition is also triggered each time the
finalizer for a failed pod is removed and `NumRequeues` is reset, which
results in a backoff of 0s.

2. Updates `updatePod` to only apply backoff when we see a particular
pod failed for the first time. This is necessary to ensure that the
controller does not apply backoff when it sees a pod update event
for finalizer removal of a failed pod.

3. If `JobsReadyPods` feature is enabled and backoff is 0s, the job is
now enqueued after `podUpdateBatchPeriod` seconds, instead of 0s. The
unit test for this check also had a few bugs:
    - `DefaultJobBackOff` is overwritten to 0 in certain unit tests,
    which meant that `DefaultJobBackOff` was considered to be 0,
    effectively not running any meaningful checks.
    - `JobsReadyPods` was not enabled for test cases that ran tests
    which required the feature gate to be enabled.
    - The check for expected and actual backoff had incorrect
    calculations.
2023-01-12 20:34:10 +05:30
kannon92
6dfaeff33c Remove Legacy Job Tracking 2023-01-10 14:52:54 +00:00
Kubernetes Prow Robot
e7549eae87
Merge pull request #114905 from kannon92/sync-job-test-fix
Fix SyncPastDeadlineJobFinished for enabling finalizer path
2023-01-09 12:47:28 -08:00
kannon92
0362c67859 Fix SyncPastDeadlineJobFinished for enabling finalizer path 2023-01-09 17:12:52 +00:00
Aldo Culquicondor
4c1b95ddfa
Ensure job is up to date in informer cache in test
The fake client doesn't guarantee that the informer cache is updated.
If it's not up-to-date, the controller always tries to set the
StartTime, leading to a broken test.

Change-Id: I71f26d46ea44beff88f0d03517985348654aec95
2023-01-09 10:53:19 -05:00
Harsha Narayana
208c3868cf
job controller: refactored job controller to be able to inject FakeClock for Unit Test 2022-12-20 21:29:24 +05:30
ialidzhikov
aede3fbf40 pkg/controller: Replace deprecated func usage from the k8s.io/utils/pointer pkg 2022-11-23 17:40:23 +02:00
Aldo Culquicondor
7dc36bdf82
Wait for Pods to finish before considering Failed in Job (#113860)
* Wait for Pods to finish before considering Failed

Limit behavior to feature gates PodDisruptionConditions and
JobPodFailurePolicy and jobs with a podFailurePolicy.

Change-Id: I926391cc2521b389c8e52962afb0d4a6a845ab8f

* Remove check for unsheduled terminating pod

Change-Id: I3dc05bb4ea3738604f01bf8cb5fc8cc0f6ea54ec
2022-11-15 09:44:53 -08:00
Michal Wozniak
c803892bd8 Enable the feature into beta 2022-11-09 09:02:40 +01:00
Aldo Culquicondor
4948918155
Graduate JobTrackingWithFinalizers to stable
Change-Id: Ifc749a85b1270c0155ac511b91d4681d53236820
2022-11-04 17:05:53 -04:00
Michal Wozniak
bf9ce70de3 Support handling of pod failures with respect to the specified rules 2022-08-04 18:39:08 +02:00
Aldo Culquicondor
ca8cebe5ba Fix JobTrackingWithFinalizers when a pod succeeds after the job fails
Change-Id: I3be351fb3b53216948a37b1d58224f8fbbf22b47
2022-08-02 19:33:06 -04:00
Aldo Culquicondor
b492f49c9f Do not skip job requeue in conflict error
Change-Id: Ie97977887a1cc3de58922d73dce92ae1965965bf
2022-07-08 16:14:32 +00:00
Aldo Culquicondor
62a25920e6 Wait for cache sync in TestSyncPastDeadlineJobFinished
Change-Id: I6f023ca6999108f4f86a0f57831d47704cdbb42b
2022-06-24 09:22:59 -04:00
Aldo Culquicondor
817c8bbf59 Increase timeout for TestSyncPastDeadlineJobFinished
To mitigate flakiness

Change-Id: I1d0286d16d2b7dd3a605690e9a2d4d2f954701ff
2022-06-21 14:49:10 -04:00
Harsha Narayana
eea7dca085
GIT-110239: fix activeDeadlineSeconds enforcement bug
GIT-110239: add additional tests with preset Status.StartTime

GIT-110239: add additional tests with preset Status.StartTime
2022-06-13 20:06:44 +05:30
Kubernetes Prow Robot
6cd258f9f5
Merge pull request #110292 from mimowo/109904-avoid-duplicate-conditions
Avoid duplicate Failed conditions in job status
2022-06-09 14:01:45 -07:00
Michal Wozniak
e298649b6c Avoid duplicate conditions by updating the pre-existing failed condition
in case its status is False or Unknown.

In case the status of the pre-existing condition is true we ignore the new
condition. If there is no pre-existing failed condition, then append
the new failed condition as before.

Also, make the condition comparisons less hacky by ignoring timestamp fields
in tests.
2022-06-01 19:32:53 +02:00
Aldo Culquicondor
a5f5eab5fd Wait for cache to sync in job's TestWatchOrphanPods
Otherwise the event handler might not be called.

Change-Id: I23c93c2251b411430a0f2469686db6355d84af2f
2022-05-10 14:18:21 -04:00
Aldo Culquicondor
09caa36718 Fix removing finalizer from finished jobs
In some rare race conditions, the job controller might create new pods after the job is declared finished.

Change-Id: I8a00429c8845463259cd7f82bb3c241d0011583c
2022-04-20 16:39:10 -04:00
Aldo Culquicondor
53aa05df3a Don't mark job as failed until expectations are satisfied
Change-Id: I99206f35f6f145054c005ab362c792e71b9b15f4
2022-04-20 16:39:10 -04:00
Aldo Culquicondor
8c00f510ef Graduate JobReadyPods to beta
Set podUpdateBatchPeriod to 1s

Change-Id: I8a10fd8f8559adad9df179b664b8c82851607855
2022-03-29 10:07:41 -04:00
Aldo Culquicondor
2c5d0a273c Graduate IndexedJob to stable
- Lock feature gate to true and schedule for deletion in 1.26
- Remove checks on feature gate
- Graduate E2E test to Conformance

Change-Id: I6814819d318edaed5c86dae4055f4b050a4d39fd
2022-03-15 13:41:06 -04:00
Abdullah Gharaibeh
b2d2ec9e76 Graduate SuspendJob to GA 2022-02-15 10:46:13 -05:00
Mike Dame
80c01707e0
Wire contexts to Batch controllers (#105491)
* Wire contexts to Batch controllers

* (hold) feedback + updates that overlap with Apps controllers

* fixup errors
2021-11-10 14:56:46 -08:00
Aldo Culquicondor
60fc90967b Count ready pods in job controller
When the feature gate JobReadyPods is enabled.

Change-Id: I86f93914568de6a7029f9ae92ee7b749686fbf97
2021-10-19 15:18:37 -04:00
Aldo Culquicondor
4ef9d18abe Fix name for Pods of NonIndexed Jobs
Change-Id: I0ea4685a82f4cdec0caab362d52144476652f95a
2021-10-14 10:55:46 -04:00
Aldo Culquicondor
5929ccd391 Track expected removals of Pod finalizers
Add the UIDs of Pods for which we are removing finalizers to an in-memory cache.

The controller removes UIDs from the cache as Pod updates or deletes come in.

This avoids double counting finished Pods when Pod updates arrive after Job status updates.

https://github.com/kubernetes/kubernetes/issues/105200
2021-10-04 16:09:58 -04:00
Aldo Culquicondor
a438f16741 Revert "Revert "Add metric job_pod_finished""
This reverts commit 7868fbbe64.
2021-09-23 12:56:29 -04:00
Aldo Culquicondor
47a957d163 Revert "Revert "Limit number of Pods counted in a single Job sync""
This reverts commit 8bcb780808.
2021-09-23 12:56:29 -04:00
Aldo Culquicondor
eebd678cda Remove GET job and retries for status updates.
Doing a GET right before retrying has 2 problems:
- It can masquerade conflicts
- It adds an additional delay

As for retries, we are better of going through the sync backoff.

In the case of conflict, we know that there was a Job update that would trigger another sync, so there is no need to do a rate limited requeue.
2021-09-23 11:48:34 -04:00
Kubernetes Prow Robot
76c0573ff4
Merge pull request #105181 from alculquicondor/revert
Revert #104739
2021-09-21 16:54:00 -07:00
Aldo Culquicondor
7868fbbe64 Revert "Add metric job_pod_finished"
This reverts commit a0e7a567c5.
2021-09-21 15:16:54 -04:00
Aldo Culquicondor
8bcb780808 Revert "Limit number of Pods counted in a single Job sync"
This reverts commit 7d9cb88fed.
2021-09-21 15:16:50 -04:00
Kubernetes Prow Robot
f55101913f
Merge pull request #105098 from Karthik-K-N/fix-error-format
Fix incorrect format specifier in test files
2021-09-20 08:56:09 -07:00
Karthik K N
c651d50202 Fix incorrect format specifier in test files 2021-09-17 16:27:53 +05:30
Aldo Culquicondor
a0e7a567c5 Add metric job_pod_finished
To count the number of pods that the job controller successfully tracked with the JobTrackingWithFinalizers feature gate.
2021-09-15 11:19:47 -04:00
Aldo Culquicondor
7d9cb88fed Limit number of Pods counted in a single Job sync
This prevents big Jobs from starving smaller ones.
2021-09-10 10:32:04 -04:00
Aldo Culquicondor
23ea5d80d6 Fix Job tracking with finalizers for more than 500 pods
When doing partial updates for uncountedTerminatedPods, the controller might have removed UIDs for Pods which still had finalizers.

Also make more space by removing UIDs that don't have finalizers at the beginning of the sync.
2021-09-01 16:19:04 -04:00
Aldo Culquicondor
5e1b5ec398 Revert counting deleted pods as failures for Job
When JobTrackingWithFinalizers is disabled. To preserve existing behavior.

Change-Id: Id1752f96feed322911712fe9e918e91e42eca809
2021-07-14 10:03:20 -04:00
Aldo Culquicondor
2dd2622188 Track Job Pods completion in status
Through Job.status.uncountedPodUIDs and a Pod finalizer

An annotation marks if a job should be tracked with new behavior

A separate work queue is used to remove finalizers from orphan pods.

Change-Id: I1862e930257a9d1f7f1b2b0a526ed15bc8c248ad
2021-07-08 17:48:05 +00:00
Adhityaa Chandrasekar
ba708e5fc9 graduate SuspendJob to beta
Also adds a label to two existing Job metrics.

Signed-off-by: Adhityaa Chandrasekar <adtac@google.com>
2021-06-03 18:48:32 +00:00
Mengxue Zhang
e64e34e029 specify pod name and hostname in indexed job 2021-05-19 15:30:13 +00:00
Kubernetes Prow Robot
548fb43643
Merge pull request #101292 from AliceZhang2016/job_controller_metrics
Graduate indexed job to beta
2021-05-07 13:31:44 -07:00
Mengxue Zhang
2d2ee6bc3a change default feature gate value of IndexedJob 2021-04-30 14:36:15 +00:00
Mengxue Zhang
4cf7e75841 indexed job: remove pods with invalid index 2021-04-19 14:07:07 +00:00
Kubernetes Prow Robot
0172cbf56c
Merge pull request #99963 from alculquicondor/job_complete_active
Remove active pods past completions
2021-04-08 17:10:10 -07:00
Aldo Culquicondor
e6c3d7b34d Only default Job fields when feature gates are enabled
Also use pointer for completionMode enum
2021-03-12 20:46:52 +00:00
Aldo Culquicondor
4af432bab3 Remove active pods past completions 2021-03-10 14:55:40 +00:00
Aldo Culquicondor
8ae0ad2b2f Fix completed indexed job with repeated indexes 2021-03-09 19:22:45 +00:00
Adhityaa Chandrasekar
a0844da8f7 batch: add suspended job
Signed-off-by: Adhityaa Chandrasekar <adtac@google.com>
2021-03-08 20:08:21 +00:00
Kubernetes Prow Robot
170c6a9833
Merge pull request #99806 from alculquicondor/job-adoption-unit
Merge tests for getPodsForJob
2021-03-06 12:50:29 -08:00
Aldo Culquicondor
f0f9f1d540 Merge tests for getPodsForJob 2021-03-04 21:09:33 +00:00
Aldo Culquicondor
2dd0c73056 Test for removal of invalid and repeated indexes
in Indexed Job
2021-03-04 16:39:34 +00:00
Aldo Culquicondor
8812531b8c Add completion index to Job Pods
When .spec.completionMode="Indexed"
2021-03-03 22:45:53 +00:00
Aldo Culquicondor
609116b147 Test failed pod recreation
Change-Id: I31a2e667e9d96c385a921e25347ebeb5a8424e62
2021-02-01 13:20:03 -05:00
Aldo Culquicondor
dbf9e3b2d3 Make sync Job test tables more readable
And use t.Run to improve debugging experience

Change-Id: Ia91adbfe9c419cc640abe0efe287f5b9ab715e87
2021-01-27 16:56:41 -05:00
yodarshafrir1
24010022ef Number of failed jobs should exceed the backoff limit and not big equal.
Remove patch in e2e test of backoff limit due to usage of NumRequeues
2020-08-11 11:06:09 +03:00
yodarshafrir1
ca420ddada Fix job's backoff limit for restart policy Never, rely on number of failures instead of number of NumRequeues 2020-08-07 14:22:40 +03:00
Kubernetes Prow Robot
b17ddac4df
Merge pull request #78944 from avorima/golint_fix_job
Fix golint errors in pkg/controller/job
2020-04-12 21:57:47 -07:00
taesun_lee
79680b5d9b Fix pkg/controller typos in some error messages, comments etc
- applied review results by LuisSanchez
- Co-Authored-By: Luis Sanchez <sanchezl@redhat.com>

genernal -> general
iniital -> initial
initalObjects -> initialObjects
intentionaly -> intentionally
inforer -> informer
anotother -> another
triger -> trigger
mutli -> multi
Verifyies -> Verifies
valume -> volume
unexpect -> unexpected
unfulfiled -> unfulfilled
implenets -> implements
assignement -> assignment
expectataions -> expectations
nexpected -> unexpected
boundSatsified -> boundSatisfied
externel -> external
calcuates -> calculates
workes -> workers
unitialized -> uninitialized
afater -> after
Espected -> Expected
nodeMontiorGracePeriod -> NodeMonitorGracePeriod
estimateGrracefulTermination -> estimateGracefulTermination
secondrary -> secondary
ShouldRunDaemonPodOnUnscheduableNode -> ShouldRunDaemonPodOnUnschedulableNode
rrror -> error
expectatitons -> expectations
foud -> found
epackage -> package
succesfulJobs -> successfulJobs
namesapce -> namespace
ConfigMapResynce -> ConfigMapResync
2020-02-27 00:15:33 +09:00
David Xia
fabfd950b1
cleanup: fix some log and error capitalizations
Part of https://github.com/kubernetes/kubernetes/issues/15863
2019-07-20 18:26:16 -04:00
Mario Valderrama
dbbe68601f Fix golint errors in pkg/controller/job 2019-06-12 20:09:57 +02:00
Fei Xu
9feb0df370 Add pending status for pastBackoffLimitOnFailure 2019-05-21 09:45:29 +08:00
goodluckbot
53c3e103d1 Fix pastBackoffLimitOnFailure when backoffLimit is zero 2018-10-11 17:29:11 +08:00
Kubernetes Submit Queue
65819a8f92
Merge pull request #63744 from krmayankk/changelog
Automatic merge from submit-queue (batch tested with PRs 63580, 63744, 64541, 64502, 64100). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

remove redundant getKey functions from controller tests

```release-note
None
```
2018-06-20 01:27:32 -07:00
Maciej Szulik
d80ed537e5
Rate limit only when an actual error happens, not on update conflicts 2018-06-05 22:53:09 +02:00
Mayank Kumar
a1cd3a4bcc remove redundant getKey functions from tests 2018-05-30 22:15:06 -07:00
David Eads
94e3d94d67 update tests to be specific about the versions they are testing instead of floating 2018-05-01 13:18:41 -04:00
David Eads
a89291a5de stop duplicating preferred version order 2018-04-26 10:03:36 -04:00
Kubernetes Submit Queue
139309f798
Merge pull request #58972 from soltysh/issue54870
Automatic merge from submit-queue (batch tested with PRs 61962, 58972, 62509, 62606). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Fix job's backoff limit for restart policy OnFailure

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #54870

**Release note**:
```release-note
NONE
```

/assign janetkuo
2018-04-19 16:47:18 -07:00
yue9944882
c9962b9644 fixes failing job back off test 2018-04-12 15:58:09 +08:00
Maciej Szulik
5ff7e977bc
Fix job's backoff limit for restart policy OnFailure 2018-03-19 17:40:29 +01:00
Maciej Szulik
1266252dc2
Backoff only when failed pod shows up 2018-03-14 11:49:13 +01:00
cedric lamoriniere
c6e8bd62ad Improves backoff policy in JobController
issues: https://github.com/kubernetes/kubernetes/issues/56853

Add check if the number of pods succeeded increased since the last
check. If yes the backoff delay is cleared. This logic improves the Job
backoff policy when parallelism > 1 and few pods's Job failed but others
succeed.
2018-02-22 10:24:23 +01:00
Maciej Szulik
f760e00af7
Add job controller test verifying if backoff is reseted on success 2017-12-01 15:14:58 +01:00