Commit Graph

166 Commits

Author SHA1 Message Date
kannon92
6dfaeff33c Remove Legacy Job Tracking 2023-01-10 14:52:54 +00:00
Harsha Narayana
208c3868cf
job controller: refactored job controller to be able to inject FakeClock for Unit Test 2022-12-20 21:29:24 +05:30
Aldo Culquicondor
7dc36bdf82
Wait for Pods to finish before considering Failed in Job (#113860)
* Wait for Pods to finish before considering Failed

Limit behavior to feature gates PodDisruptionConditions and
JobPodFailurePolicy and jobs with a podFailurePolicy.

Change-Id: I926391cc2521b389c8e52962afb0d4a6a845ab8f

* Remove check for unsheduled terminating pod

Change-Id: I3dc05bb4ea3738604f01bf8cb5fc8cc0f6ea54ec
2022-11-15 09:44:53 -08:00
Michal Wozniak
c803892bd8 Enable the feature into beta 2022-11-09 09:02:40 +01:00
Aldo Culquicondor
4948918155
Graduate JobTrackingWithFinalizers to stable
Change-Id: Ifc749a85b1270c0155ac511b91d4681d53236820
2022-11-04 17:05:53 -04:00
Aldo Culquicondor
5e03865f65
Add benchmark for large indexed job
Change-Id: I556f0cce5842699c98654cfb5a66e7c8d63b2e2e
2022-11-02 11:56:26 -04:00
Michał Woźniak
3628532311
Extend metrics with the new labels (#113324)
* Extend job metrics

* Refactor TestMetrics to extract its checks into dedicated tests per feature
2022-10-31 08:50:45 -07:00
Aldo Culquicondor
12d308f5c4 Add metric for terminated pods with tracking finalizer
Change-Id: I26f3169588c30ed82250cb7baff8e277f8d13bb7
2022-10-20 11:35:20 -04:00
Kubernetes Prow Robot
28ced69b76
Merge pull request #113054 from logicalhan/proxy-metric
remove rate limiter metric as it is not in use
2022-10-17 11:09:18 -07:00
Han Kang
2bbd445f50 remove rate limiter metric as it is not in use
Change-Id: I91157653e3860eeecc3f572aee88da6ffc65faed
2022-10-13 13:07:11 -07:00
Michal Wozniak
b64e5b2d15 Fix the occasional double-counting job_finished_total metric
The reason for the issue is that the metrics were bumped before the
final job status update. In case the update failed the path was
repeated by the next syncJob leading to double-counting of the metrics.

The solution is to delay recording metrics and broadcasting events
after the job status update succeeds.
2022-10-13 17:23:03 +02:00
Michal Wozniak
bf9ce70de3 Support handling of pod failures with respect to the specified rules 2022-08-04 18:39:08 +02:00
Aldo Culquicondor
ca8cebe5ba Fix JobTrackingWithFinalizers when a pod succeeds after the job fails
Change-Id: I3be351fb3b53216948a37b1d58224f8fbbf22b47
2022-08-02 19:33:06 -04:00
Davanum Srinivas
a9593d634c
Generate and format files
- Run hack/update-codegen.sh
- Run hack/update-generated-device-plugin.sh
- Run hack/update-generated-protobuf.sh
- Run hack/update-generated-runtime.sh
- Run hack/update-generated-swagger-docs.sh
- Run hack/update-openapi-spec.sh
- Run hack/update-gofmt.sh

Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2022-07-26 13:14:05 -04:00
Aldo Culquicondor
b492f49c9f Do not skip job requeue in conflict error
Change-Id: Ie97977887a1cc3de58922d73dce92ae1965965bf
2022-07-08 16:14:32 +00:00
Harsha Narayana
eea7dca085
GIT-110239: fix activeDeadlineSeconds enforcement bug
GIT-110239: add additional tests with preset Status.StartTime

GIT-110239: add additional tests with preset Status.StartTime
2022-06-13 20:06:44 +05:30
Kubernetes Prow Robot
6cd258f9f5
Merge pull request #110292 from mimowo/109904-avoid-duplicate-conditions
Avoid duplicate Failed conditions in job status
2022-06-09 14:01:45 -07:00
Michal Wozniak
e298649b6c Avoid duplicate conditions by updating the pre-existing failed condition
in case its status is False or Unknown.

In case the status of the pre-existing condition is true we ignore the new
condition. If there is no pre-existing failed condition, then append
the new failed condition as before.

Also, make the condition comparisons less hacky by ignoring timestamp fields
in tests.
2022-06-01 19:32:53 +02:00
Wojciech Tyczyński
11b679c66a Fix event broadcaster shutdown in multiple controllers 2022-05-17 22:14:19 +02:00
Aldo Culquicondor
09caa36718 Fix removing finalizer from finished jobs
In some rare race conditions, the job controller might create new pods after the job is declared finished.

Change-Id: I8a00429c8845463259cd7f82bb3c241d0011583c
2022-04-20 16:39:10 -04:00
Aldo Culquicondor
53aa05df3a Don't mark job as failed until expectations are satisfied
Change-Id: I99206f35f6f145054c005ab362c792e71b9b15f4
2022-04-20 16:39:10 -04:00
Aldo Culquicondor
8c00f510ef Graduate JobReadyPods to beta
Set podUpdateBatchPeriod to 1s

Change-Id: I8a10fd8f8559adad9df179b664b8c82851607855
2022-03-29 10:07:41 -04:00
Aldo Culquicondor
8776931abb Remove finalizer when orphaned
Change-Id: Id88a28755660812a274dffab2693cb8a0ef4235c
2022-03-24 11:57:51 -04:00
Aldo Culquicondor
211e33d93f Fix: Clean job tracking finalizer from orphan pods
Change-Id: I04cd70725fd1830be8daf2dca53f67bc10a379b7
2022-03-24 11:57:51 -04:00
Aldo Culquicondor
2c5d0a273c Graduate IndexedJob to stable
- Lock feature gate to true and schedule for deletion in 1.26
- Remove checks on feature gate
- Graduate E2E test to Conformance

Change-Id: I6814819d318edaed5c86dae4055f4b050a4d39fd
2022-03-15 13:41:06 -04:00
Abdullah Gharaibeh
b2d2ec9e76 Graduate SuspendJob to GA 2022-02-15 10:46:13 -05:00
Mike Dame
80c01707e0
Wire contexts to Batch controllers (#105491)
* Wire contexts to Batch controllers

* (hold) feedback + updates that overlap with Apps controllers

* fixup errors
2021-11-10 14:56:46 -08:00
Kubernetes Prow Robot
8e37a3b324
Merge pull request #103868 from qingsenLi/210723-forget
Merge conditional assignment into variable declaration
2021-10-28 16:32:50 -07:00
Aldo Culquicondor
60fc90967b Count ready pods in job controller
When the feature gate JobReadyPods is enabled.

Change-Id: I86f93914568de6a7029f9ae92ee7b749686fbf97
2021-10-19 15:18:37 -04:00
Kubernetes Prow Robot
0bfa37dfcc
Merge pull request #105676 from alculquicondor/job-name
Fix name for Pods of NonIndexed Jobs
2021-10-14 10:50:12 -07:00
Aldo Culquicondor
4ef9d18abe Fix name for Pods of NonIndexed Jobs
Change-Id: I0ea4685a82f4cdec0caab362d52144476652f95a
2021-10-14 10:55:46 -04:00
Kubernetes Prow Robot
f27e4714ba
Merge pull request #105377 from damemi/wire-contexts-apps
Wire contexts to Apps controllers
2021-10-14 06:59:19 -07:00
Mike Dame
41fcb95f2f Wire contexts to Apps controllers 2021-10-13 16:32:13 -04:00
Aldo Culquicondor
5929ccd391 Track expected removals of Pod finalizers
Add the UIDs of Pods for which we are removing finalizers to an in-memory cache.

The controller removes UIDs from the cache as Pod updates or deletes come in.

This avoids double counting finished Pods when Pod updates arrive after Job status updates.

https://github.com/kubernetes/kubernetes/issues/105200
2021-10-04 16:09:58 -04:00
Aldo Culquicondor
95c2a8024c Parallelize pod updates in job test
To potentially reduce the number of job controller syncs.

Also reduce the maximum number of pods to sync in tests.
2021-10-01 09:55:53 -04:00
Aldo Culquicondor
a438f16741 Revert "Revert "Add metric job_pod_finished""
This reverts commit 7868fbbe64.
2021-09-23 12:56:29 -04:00
Aldo Culquicondor
47a957d163 Revert "Revert "Limit number of Pods counted in a single Job sync""
This reverts commit 8bcb780808.
2021-09-23 12:56:29 -04:00
Aldo Culquicondor
01f27cd93e Fix log line for target number of running pods 2021-09-23 12:56:29 -04:00
Aldo Culquicondor
eebd678cda Remove GET job and retries for status updates.
Doing a GET right before retrying has 2 problems:
- It can masquerade conflicts
- It adds an additional delay

As for retries, we are better of going through the sync backoff.

In the case of conflict, we know that there was a Job update that would trigger another sync, so there is no need to do a rate limited requeue.
2021-09-23 11:48:34 -04:00
Aldo Culquicondor
7868fbbe64 Revert "Add metric job_pod_finished"
This reverts commit a0e7a567c5.
2021-09-21 15:16:54 -04:00
Aldo Culquicondor
8bcb780808 Revert "Limit number of Pods counted in a single Job sync"
This reverts commit 7d9cb88fed.
2021-09-21 15:16:50 -04:00
Aldo Culquicondor
a0e7a567c5 Add metric job_pod_finished
To count the number of pods that the job controller successfully tracked with the JobTrackingWithFinalizers feature gate.
2021-09-15 11:19:47 -04:00
Aldo Culquicondor
7d9cb88fed Limit number of Pods counted in a single Job sync
This prevents big Jobs from starving smaller ones.
2021-09-10 10:32:04 -04:00
Aldo Culquicondor
23ea5d80d6 Fix Job tracking with finalizers for more than 500 pods
When doing partial updates for uncountedTerminatedPods, the controller might have removed UIDs for Pods which still had finalizers.

Also make more space by removing UIDs that don't have finalizers at the beginning of the sync.
2021-09-01 16:19:04 -04:00
10177505
2740965dc9 Merge conditional assignment into variable declaration 2021-07-23 17:02:19 +08:00
Aldo Culquicondor
5e1b5ec398 Revert counting deleted pods as failures for Job
When JobTrackingWithFinalizers is disabled. To preserve existing behavior.

Change-Id: Id1752f96feed322911712fe9e918e91e42eca809
2021-07-14 10:03:20 -04:00
Aldo Culquicondor
2dd2622188 Track Job Pods completion in status
Through Job.status.uncountedPodUIDs and a Pod finalizer

An annotation marks if a job should be tracked with new behavior

A separate work queue is used to remove finalizers from orphan pods.

Change-Id: I1862e930257a9d1f7f1b2b0a526ed15bc8c248ad
2021-07-08 17:48:05 +00:00
Adhityaa Chandrasekar
ba708e5fc9 graduate SuspendJob to beta
Also adds a label to two existing Job metrics.

Signed-off-by: Adhityaa Chandrasekar <adtac@google.com>
2021-06-03 18:48:32 +00:00
Aldo Culquicondor
d8aad7944c Remove unused util CreatePods
And rename CreatePodsWithControllerRef to simply CreatePods
2021-05-20 20:27:21 +00:00
Mengxue Zhang
e64e34e029 specify pod name and hostname in indexed job 2021-05-19 15:30:13 +00:00
Kubernetes Prow Robot
548fb43643
Merge pull request #101292 from AliceZhang2016/job_controller_metrics
Graduate indexed job to beta
2021-05-07 13:31:44 -07:00
Mengxue Zhang
5fd4ab3dc3 add pod create/delete operation limitations per job sync 2021-04-27 18:51:38 +00:00
Mengxue Zhang
cda503fcc9 indexed job: add three metrics to job controller 2021-04-27 18:32:53 +00:00
Mengxue Zhang
4cf7e75841 indexed job: remove pods with invalid index 2021-04-19 14:07:07 +00:00
Kubernetes Prow Robot
0172cbf56c
Merge pull request #99963 from alculquicondor/job_complete_active
Remove active pods past completions
2021-04-08 17:10:10 -07:00
Adhityaa Chandrasekar
0a21157c96 job controller: don't mutate shared cache object
Signed-off-by: Adhityaa Chandrasekar <adtac@google.com>
2021-03-25 06:36:15 +00:00
Aldo Culquicondor
e6c3d7b34d Only default Job fields when feature gates are enabled
Also use pointer for completionMode enum
2021-03-12 20:46:52 +00:00
Aldo Culquicondor
4af432bab3 Remove active pods past completions 2021-03-10 14:55:40 +00:00
Aldo Culquicondor
8ae0ad2b2f Fix completed indexed job with repeated indexes 2021-03-09 19:22:45 +00:00
Adhityaa Chandrasekar
a0844da8f7 batch: add suspended job
Signed-off-by: Adhityaa Chandrasekar <adtac@google.com>
2021-03-08 20:08:21 +00:00
Aldo Culquicondor
8812531b8c Add completion index to Job Pods
When .spec.completionMode="Indexed"
2021-03-03 22:45:53 +00:00
KeZhang
67b40a50c6 Optimize log output 2020-12-08 11:20:24 +08:00
yodarshafrir1
24010022ef Number of failed jobs should exceed the backoff limit and not big equal.
Remove patch in e2e test of backoff limit due to usage of NumRequeues
2020-08-11 11:06:09 +03:00
yodarshafrir1
ca420ddada Fix job's backoff limit for restart policy Never, rely on number of failures instead of number of NumRequeues 2020-08-07 14:22:40 +03:00
Kubernetes Prow Robot
00d6255f44
Merge pull request #91712 from KobayashiD27/structured-logging-in-event
Migrate log to klog.InfoS for staging/src/k8s.io/client-go
2020-06-22 23:53:40 -07:00
Kubernetes Prow Robot
be31023a95
Merge pull request #87155 from kolorful/patch-3
Fix a comment in job_controller
2020-06-19 08:51:58 -07:00
Kobayashi Daisuke
4ae11dac2e Replace StartLogging(klog.Infof) with StartStructuredLogging(0) 2020-06-15 17:48:35 +09:00
KeZhang
884f94ad92 Do not swallow NotFound error for DeletePod in dsc.manage 2020-06-04 16:41:38 +08:00
Zhou Peng
bc9bff0d9e [pkg/controller/job]: fix comment typo
Signed-off-by: Zhou Peng <p@ctriple.cn>
2020-05-30 23:09:10 +08:00
Davanum Srinivas
442a69c3bd
switch over k/k to use klog v2
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2020-05-16 07:54:27 -04:00
Kubernetes Prow Robot
b17ddac4df
Merge pull request #78944 from avorima/golint_fix_job
Fix golint errors in pkg/controller/job
2020-04-12 21:57:47 -07:00
Mike Danese
25651408ae generated: run refactor 2020-02-08 12:30:21 -05:00
Mike Danese
3aa59f7f30 generated: run refactor 2020-02-07 18:16:47 -08:00
Kubernetes Prow Robot
e4926e2d70
Merge pull request #85421 from terrytangyuan/patch-1
Fix grammar: have -> has
2020-01-22 08:40:58 -08:00
Kewei Ma
34fce9faee
Fix a comment in job_controller 2020-01-13 10:09:06 -06:00
Kubernetes Prow Robot
42fe74cd2c
Merge pull request #86142 from raz-bn/add-complete-event
Adding new job completed event
2019-12-16 23:43:58 -08:00
raz-bn
0224c48120 Job completed event added 2019-12-16 21:41:15 +00:00
Ted Yu
9cff345770 Do not swallow timeout in manageReplicas 2019-12-12 11:27:36 -08:00
Yuan Tang
dd308ca576
Fix grammar: have -> has 2019-11-18 11:17:58 -05:00
Clayton Coleman
c6e34e58c5
job: Ignore namespace termination errors when creating pods or jobs
Instead of reporting an event or displaying an error, simply exit
when the namespace is being terminated. This reduces the amount of
controller churn on namespace shutdown. While we could technically
exit the entire processing loop early for very large jobs,
we should wait for more evidence that is an issue before changing
that logic substantially.
2019-10-20 18:39:01 -04:00
Yassine TIJANI
c1487840bc move util/metrics to component-base
Signed-off-by: Yassine TIJANI <ytijani@vmware.com>
2019-10-08 14:42:31 +02:00
Yassine TIJANI
7e4c3096fe move WaitForCacheSync to the sharedInformer package
Signed-off-by: Yassine TIJANI <ytijani@vmware.com>
2019-08-22 16:13:41 +01:00
Ted Yu
898f099346 Skip unnecessary operations if diff is less than 0 2019-07-17 14:03:08 -07:00
Mario Valderrama
6ac7421535 Update comments 2019-06-14 14:23:13 +02:00
Mario Valderrama
dbbe68601f Fix golint errors in pkg/controller/job 2019-06-12 20:09:57 +02:00
Fei Xu
9feb0df370 Add pending status for pastBackoffLimitOnFailure 2019-05-21 09:45:29 +08:00
Andrew Kim
0bc5508aca replace client-go/util/integer with k8s.io/utils/integer 2019-01-24 15:34:21 -05:00
Davanum Srinivas
954996e231
Move from glog to klog
- Move from the old github.com/golang/glog to k8s.io/klog
- klog as explicit InitFlags() so we add them as necessary
- we update the other repositories that we vendor that made a similar
change from glog to klog
  * github.com/kubernetes/repo-infra
  * k8s.io/gengo/
  * k8s.io/kube-openapi/
  * github.com/google/cadvisor
- Entirely remove all references to glog
- Fix some tests by explicit InitFlags in their init() methods

Change-Id: I92db545ff36fcec83afe98f550c9e630098b3135
2018-11-10 07:50:31 -05:00
k8s-ci-robot
e6c5fb4666
Merge pull request #67859 from goodluckbot/job-controller-backoffLimit
Fix pastBackoffLimitOnFailure in job controller
2018-10-11 05:49:30 -07:00
goodluckbot
53c3e103d1 Fix pastBackoffLimitOnFailure when backoffLimit is zero 2018-10-11 17:29:11 +08:00
Kubernetes Submit Queue
d744c6ea61
Merge pull request #66085 from liggitt/updatejob
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

fix updateJob scheduling of resync

fixes #66071 

```release-note
NONE
```
2018-08-27 17:40:54 -07:00
Da K. Ma
a56121c191 Removed unused functions.
Signed-off-by: Da K. Ma <klaus1982.cn@gmail.com>
2018-07-22 20:56:53 +08:00
Jordan Liggitt
6d6842da0b
fix updateJob scheduling of resync 2018-07-11 17:10:10 -04:00
Maciej Szulik
d80ed537e5
Rate limit only when an actual error happens, not on update conflicts 2018-06-05 22:53:09 +02:00
Maciej Szulik
5df2755399
Never clean backoff in job controller 2018-06-04 19:28:58 +02:00
Kubernetes Submit Queue
7eb88f11d2
Merge pull request #59727 from wgliang/master.time
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

should use time.Since instead of time.Now().Sub

**What this PR does / why we need it**:
should use time.Since instead of time.Now().Sub

**Special notes for your reviewer**:
2018-05-10 20:29:40 -07:00
Kubernetes Submit Queue
139309f798
Merge pull request #58972 from soltysh/issue54870
Automatic merge from submit-queue (batch tested with PRs 61962, 58972, 62509, 62606). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Fix job's backoff limit for restart policy OnFailure

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #54870

**Release note**:
```release-note
NONE
```

/assign janetkuo
2018-04-19 16:47:18 -07:00
Wang Guoliang
89669283fe should use time.Since instead of time.Now().Sub 2018-04-10 12:05:51 +08:00
Mikhail Mazurskiy
468655b76a
Use typed events client directly 2018-04-01 18:57:29 +10:00
Maciej Szulik
5ff7e977bc
Fix job's backoff limit for restart policy OnFailure 2018-03-19 17:40:29 +01:00