kannon92
74fcf3e766
implementation of PodReplacementPolicy kep in the job controller
2023-07-21 00:44:53 +00:00
Michal Wozniak
35d0af9243
Include ignored pods when computing backoff delay for Job pod failures
2023-07-19 17:39:58 +02:00
Michał Woźniak
a15c27661e
Job controller implementation of backoff limit per index ( #118009 )
2023-07-18 13:44:11 -07:00
Kubernetes Prow Robot
84a999923f
Merge pull request #119335 from mimowo/use-final-diff-for-job-pod-creation
...
Ensure final diff is used for setting expectations for Job pod creation
2023-07-14 15:20:54 -07:00
Kubernetes Prow Robot
6f3856f953
Merge pull request #118883 from danielvegamyhre/kep-4017-job
...
Add completion index as pod label for indexed jobs
2023-07-14 12:23:50 -07:00
Michal Wozniak
9564bdc39d
Ensure final diff is used for setting expectations for Job pod creation
2023-07-14 19:09:39 +02:00
Daniel Vega-Myhre
037091284e
fix unit test bug
2023-07-13 22:38:21 +00:00
Daniel Vega-Myhre
a1a5f49bb9
remove statefulset label added to wrong branch
2023-07-13 21:07:17 +00:00
Daniel Vega-Myhre
1ae60c0ed1
use job completion index annotation as label
2023-07-13 21:04:37 +00:00
Michal Wozniak
7e3b53042b
Pass Job context down to firstPendingIndexes
2023-07-13 16:11:06 +02:00
Patrick Ohly
7d064812bb
kube-controller-manager: finish conversion to contextual logging
...
This removes all exceptions and fixes the remaining unconverted log calls.
2023-07-12 14:57:29 +02:00
Michal Wozniak
bf48165232
Remarks to syncJobCtx
2023-07-11 09:44:08 +02:00
Michal Wozniak
990339d4c3
Introduce syncJobContext to limit the number of function parameters
2023-07-11 09:27:21 +02:00
Daniel Vega-Myhre
98c6e25c37
update name of pod index label
2023-07-10 20:11:52 +00:00
Aldo Culquicondor
f7a1fb76f4
Only declare job as finished after removing all finalizers
...
Change-Id: Id4b01b0e6fabe24134e57e687356e0fc613cead4
2023-07-07 14:08:19 -04:00
Daniel Vega-Myhre
3a02ecb341
check test case param instead of feature flag in unit test code
2023-07-06 17:30:40 +00:00
kannon92
921b7e6e8f
remove equalReady and replace with k8 util function
2023-07-05 20:11:48 +00:00
Daniel Vega-Myhre
a647f9febb
default enabled pod index for test cases, add test case disabling it
2023-07-05 18:47:45 +00:00
Daniel Vega-Myhre
e0af0a5a45
add test case param for feature flag
2023-06-29 21:51:15 +00:00
Daniel Vega-Myhre
a9afaa1eee
add feature gate
2023-06-27 18:07:17 +00:00
Daniel Vega-Myhre
2176053415
add completion index as pod label
2023-06-26 19:53:14 +00:00
Michal Wozniak
8ed23558b4
Do not set jm.syncJobBatchPeriod=0 if not needed
2023-06-22 11:10:53 +02:00
Michal Wozniak
784a309b91
Do not error in Job controller sync when there are pod failures
2023-06-20 11:31:24 +02:00
Michal Wozniak
74c5ff97f1
Lower the constants for the rate limiter in Job controller
2023-06-16 17:00:04 +02:00
Michal Wozniak
c51a422d78
Cleanup job controller handling of backoff
2023-06-16 14:53:27 +02:00
Ziqi Zhao
7bc449d7e0
add contextual logging to job-controller
...
Signed-off-by: Ziqi Zhao <zhaoziqi9146@gmail.com>
2023-06-14 13:40:02 +08:00
Michal Wozniak
2f6b1d3c0f
Ensure Job sync invocations are batched by 1s periods
2023-06-07 17:32:46 +02:00
Michal Wozniak
71ab7dc791
Remarks
2023-06-05 10:48:32 +02:00
Michal Wozniak
70d3bb43e5
Adjust the algorithm for computing the pod finish time
...
Change-Id: Ic282a57169cab8dc498574f08b081914218a1039
2023-06-05 10:06:56 +02:00
Michal Wozniak
0fe27a06f9
Cleanup the Job controller handling of terminating pods
2023-05-19 09:52:08 +02:00
Yuki Iwai
e4340f0d9b
Job: Use generic Set in controller
...
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-05-08 15:02:23 +09:00
Sathyanarayanan Saravanamuthu
c84c8add70
Decouple batch/job back-off logic from workqueues ( #114768 )
...
* batch/job: decouple backoff from workqueue
Signed-off-by: Sathyanarayanan Saravanamuthu <sathyanarays@vmware.com>
* Resolving review comments
* Resolving more review comments
* Resolving review comments
Signed-off-by: Sathyanarayanan Saravanamuthu <sathyanarays@vmware.com>
* Computing finish time to now when FinishedAt is unix epoch
* Addressing review comments
Signed-off-by: Sathyanarayanan Saravanamuthu <sathyanarays@vmware.com>
---------
Signed-off-by: Sathyanarayanan Saravanamuthu <sathyanarays@vmware.com>
2023-03-16 10:15:21 -07:00
Kubernetes Prow Robot
ccba890df9
Merge pull request #114420 from bzsuni/bz/optimization
...
Cleanup: fix variable names in comments
2023-03-09 21:33:37 -08:00
Daniel Vega-Myhre
d41302312e
update validation logic so completions is mutable iff completions is modified in tandem with parallelsim so completions == parallelism
2023-02-23 03:25:16 +00:00
kannon92
32ac4a9581
left over uncounted from tracking cleanup
2023-02-22 16:45:53 +00:00
Daniel Vega-Myhre
2a81337e7c
update prev succeeded indexes for indexed jobs unconditionally
2023-01-31 19:15:53 +00:00
Kubernetes Prow Robot
5550064bc2
Merge pull request #115063 from kannon92/tracking-remove-comments
...
tracking with finalizers is the default way for the job controller so comments are not needed that say we are tracking with finalizers
2023-01-17 07:56:44 -08:00
Kubernetes Prow Robot
9af5ae0365
Merge pull request #115030 from kannon92/remove-pod-error-job-tracking
...
Update SyncJob with PodControllerError updates in job unit tests
2023-01-13 12:08:14 -08:00
Kubernetes Prow Robot
70217a4083
Merge pull request #114944 from mimowo/fix-active-deadline-test
...
Fix the job controller unit test for enforcing ActiveDeadlineSeconds
2023-01-13 10:46:26 -08:00
kannon92
4890928b78
tracking with finalizers is the default way for the job controller
2023-01-13 16:48:35 +00:00
kannon92
3a838033f8
Update SyncJob with PodControllerError updates in job unit tests
2023-01-13 16:39:18 +00:00
Michal Wozniak
7065b42bb2
Fix the job controller unit test for enforcing ActiveDeadlineSeconds
2023-01-13 16:48:15 +01:00
Nikhita Raghunath
fd8d92a29d
pkg/controller/job: re-honor exponential backoff
...
This commit makes the job controller re-honor exponential backoff for
failed pods. Before this commit, the controller created pods without any
backoff. This is a regression because the controller used to
create pods with an exponential backoff delay before (10s, 20s, 40s ...).
The issue occurs only when the JobTrackingWithFinalizers feature is
enabled (which is enabled by default right now). With this feature, we
get an extra pod update event when the finalizer of a failed pod is
removed.
Note that the pod failure detection and new pod creation happen in the
same reconcile loop so the 2nd pod is created immediately after the 1st
pod fails. The backoff is only applied on 2nd pod failure, which means
that the 3rd pod created 10s after the 2nd pod, 4th pod is created 20s
after the 3rd pod and so on.
This commit fixes a few bugs:
1. Right now, each time `uncounted != nil` and the job does not see a
_new_ failure, `forget` is set to true and the job is removed from the
queue. Which means that this condition is also triggered each time the
finalizer for a failed pod is removed and `NumRequeues` is reset, which
results in a backoff of 0s.
2. Updates `updatePod` to only apply backoff when we see a particular
pod failed for the first time. This is necessary to ensure that the
controller does not apply backoff when it sees a pod update event
for finalizer removal of a failed pod.
3. If `JobsReadyPods` feature is enabled and backoff is 0s, the job is
now enqueued after `podUpdateBatchPeriod` seconds, instead of 0s. The
unit test for this check also had a few bugs:
- `DefaultJobBackOff` is overwritten to 0 in certain unit tests,
which meant that `DefaultJobBackOff` was considered to be 0,
effectively not running any meaningful checks.
- `JobsReadyPods` was not enabled for test cases that ran tests
which required the feature gate to be enabled.
- The check for expected and actual backoff had incorrect
calculations.
2023-01-12 20:34:10 +05:30
kannon92
6dfaeff33c
Remove Legacy Job Tracking
2023-01-10 14:52:54 +00:00
Kubernetes Prow Robot
e7549eae87
Merge pull request #114905 from kannon92/sync-job-test-fix
...
Fix SyncPastDeadlineJobFinished for enabling finalizer path
2023-01-09 12:47:28 -08:00
kannon92
0362c67859
Fix SyncPastDeadlineJobFinished for enabling finalizer path
2023-01-09 17:12:52 +00:00
Aldo Culquicondor
4c1b95ddfa
Ensure job is up to date in informer cache in test
...
The fake client doesn't guarantee that the informer cache is updated.
If it's not up-to-date, the controller always tries to set the
StartTime, leading to a broken test.
Change-Id: I71f26d46ea44beff88f0d03517985348654aec95
2023-01-09 10:53:19 -05:00
Harsha Narayana
208c3868cf
job controller: refactored job controller to be able to inject FakeClock for Unit Test
2022-12-20 21:29:24 +05:30
bzsuni
16fcb1c708
optimise some code
2022-12-13 09:56:36 +08:00
Kubernetes Prow Robot
0cd13e573c
Merge pull request #113196 from mimowo/job-controller-reviewer
...
Self-nominate mimowo as a reviewer for pkg/controller/job & test/integration/job packages
2022-12-10 02:01:39 -08:00
ialidzhikov
aede3fbf40
pkg/controller: Replace deprecated func usage from the k8s.io/utils/pointer
pkg
2022-11-23 17:40:23 +02:00
Aldo Culquicondor
7dc36bdf82
Wait for Pods to finish before considering Failed in Job ( #113860 )
...
* Wait for Pods to finish before considering Failed
Limit behavior to feature gates PodDisruptionConditions and
JobPodFailurePolicy and jobs with a podFailurePolicy.
Change-Id: I926391cc2521b389c8e52962afb0d4a6a845ab8f
* Remove check for unsheduled terminating pod
Change-Id: I3dc05bb4ea3738604f01bf8cb5fc8cc0f6ea54ec
2022-11-15 09:44:53 -08:00
Aldo Culquicondor
bc5afaf580
Fix match onExitCodes when Pod is not terminated
...
Change-Id: Id1f9c46f8b6a12115577a1fadb12adc580c9ba6a
2022-11-11 10:05:11 -05:00
Michal Wozniak
c803892bd8
Enable the feature into beta
2022-11-09 09:02:40 +01:00
Maciej Szulik
39d9981dc2
Promote job-related metrics to stable
2022-11-07 19:28:40 +01:00
Aldo Culquicondor
4948918155
Graduate JobTrackingWithFinalizers to stable
...
Change-Id: Ifc749a85b1270c0155ac511b91d4681d53236820
2022-11-04 17:05:53 -04:00
Aldo Culquicondor
5e03865f65
Add benchmark for large indexed job
...
Change-Id: I556f0cce5842699c98654cfb5a66e7c8d63b2e2e
2022-11-02 11:56:26 -04:00
Michał Woźniak
3628532311
Extend metrics with the new labels ( #113324 )
...
* Extend job metrics
* Refactor TestMetrics to extract its checks into dedicated tests per feature
2022-10-31 08:50:45 -07:00
Aldo Culquicondor
12d308f5c4
Add metric for terminated pods with tracking finalizer
...
Change-Id: I26f3169588c30ed82250cb7baff8e277f8d13bb7
2022-10-20 11:35:20 -04:00
Michal Wozniak
b1e575aaf7
Self-nominate mimowo as a reviewer for pkg/controller/job & test/integration/job
...
I think I'm ready to start review and LGTM code changes within this
package, but not necessarily for the entire sig-apps.
My PRs to the packages:
https://github.com/kubernetes/kubernetes/pull/110292
https://github.com/kubernetes/kubernetes/pull/111113
https://github.com/kubernetes/kubernetes/pull/112948
PRs to the packages I contributed reviews to:
https://github.com/kubernetes/kubernetes/pull/113166
https://github.com/kubernetes/kubernetes/pull/110294
2022-10-20 09:22:35 +02:00
Kubernetes Prow Robot
28ced69b76
Merge pull request #113054 from logicalhan/proxy-metric
...
remove rate limiter metric as it is not in use
2022-10-17 11:09:18 -07:00
Han Kang
2bbd445f50
remove rate limiter metric as it is not in use
...
Change-Id: I91157653e3860eeecc3f572aee88da6ffc65faed
2022-10-13 13:07:11 -07:00
Michal Wozniak
b64e5b2d15
Fix the occasional double-counting job_finished_total metric
...
The reason for the issue is that the metrics were bumped before the
final job status update. In case the update failed the path was
repeated by the next syncJob leading to double-counting of the metrics.
The solution is to delay recording metrics and broadcasting events
after the job status update succeeds.
2022-10-13 17:23:03 +02:00
Kubernetes Prow Robot
afebf498d7
Merge pull request #111314 from BinacsLee/binacs/cleanup-use-clone-to-avoid-interim-slice
...
cleanup: use sets.Clone() to avoid interim slice
2022-10-04 07:34:22 -07:00
Aldo Culquicondor
848eece7b7
Add alculquicondor to job OWNERS
...
Change-Id: If974f0890ef4accbd7d2111fb1a1aa38718dc74b
2022-08-26 11:29:37 -04:00
Aldo Culquicondor
c1e0dac461
Fix deleting UIDs tracking expectations
...
Change-Id: I5dad644cf5cb232ebed0950a14b35a781a38eeb0
2022-08-05 12:37:31 -04:00
Michal Wozniak
bf9ce70de3
Support handling of pod failures with respect to the specified rules
2022-08-04 18:39:08 +02:00
Aldo Culquicondor
ca8cebe5ba
Fix JobTrackingWithFinalizers when a pod succeeds after the job fails
...
Change-Id: I3be351fb3b53216948a37b1d58224f8fbbf22b47
2022-08-02 19:33:06 -04:00
Davanum Srinivas
a9593d634c
Generate and format files
...
- Run hack/update-codegen.sh
- Run hack/update-generated-device-plugin.sh
- Run hack/update-generated-protobuf.sh
- Run hack/update-generated-runtime.sh
- Run hack/update-generated-swagger-docs.sh
- Run hack/update-openapi-spec.sh
- Run hack/update-gofmt.sh
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2022-07-26 13:14:05 -04:00
BinacsLee
ae0c7b1ffb
cleanup: use sets.Clone() to avoid interim slice
2022-07-21 20:21:01 +08:00
Aldo Culquicondor
b492f49c9f
Do not skip job requeue in conflict error
...
Change-Id: Ie97977887a1cc3de58922d73dce92ae1965965bf
2022-07-08 16:14:32 +00:00
Aldo Culquicondor
62a25920e6
Wait for cache sync in TestSyncPastDeadlineJobFinished
...
Change-Id: I6f023ca6999108f4f86a0f57831d47704cdbb42b
2022-06-24 09:22:59 -04:00
Aldo Culquicondor
817c8bbf59
Increase timeout for TestSyncPastDeadlineJobFinished
...
To mitigate flakiness
Change-Id: I1d0286d16d2b7dd3a605690e9a2d4d2f954701ff
2022-06-21 14:49:10 -04:00
Harsha Narayana
eea7dca085
GIT-110239: fix activeDeadlineSeconds enforcement bug
...
GIT-110239: add additional tests with preset Status.StartTime
GIT-110239: add additional tests with preset Status.StartTime
2022-06-13 20:06:44 +05:30
Kubernetes Prow Robot
6cd258f9f5
Merge pull request #110292 from mimowo/109904-avoid-duplicate-conditions
...
Avoid duplicate Failed conditions in job status
2022-06-09 14:01:45 -07:00
Michal Wozniak
e298649b6c
Avoid duplicate conditions by updating the pre-existing failed condition
...
in case its status is False or Unknown.
In case the status of the pre-existing condition is true we ignore the new
condition. If there is no pre-existing failed condition, then append
the new failed condition as before.
Also, make the condition comparisons less hacky by ignoring timestamp fields
in tests.
2022-06-01 19:32:53 +02:00
Wojciech Tyczyński
11b679c66a
Fix event broadcaster shutdown in multiple controllers
2022-05-17 22:14:19 +02:00
Aldo Culquicondor
a5f5eab5fd
Wait for cache to sync in job's TestWatchOrphanPods
...
Otherwise the event handler might not be called.
Change-Id: I23c93c2251b411430a0f2469686db6355d84af2f
2022-05-10 14:18:21 -04:00
Aldo Culquicondor
09caa36718
Fix removing finalizer from finished jobs
...
In some rare race conditions, the job controller might create new pods after the job is declared finished.
Change-Id: I8a00429c8845463259cd7f82bb3c241d0011583c
2022-04-20 16:39:10 -04:00
Aldo Culquicondor
53aa05df3a
Don't mark job as failed until expectations are satisfied
...
Change-Id: I99206f35f6f145054c005ab362c792e71b9b15f4
2022-04-20 16:39:10 -04:00
Aldo Culquicondor
8c00f510ef
Graduate JobReadyPods to beta
...
Set podUpdateBatchPeriod to 1s
Change-Id: I8a10fd8f8559adad9df179b664b8c82851607855
2022-03-29 10:07:41 -04:00
Aldo Culquicondor
8776931abb
Remove finalizer when orphaned
...
Change-Id: Id88a28755660812a274dffab2693cb8a0ef4235c
2022-03-24 11:57:51 -04:00
Aldo Culquicondor
211e33d93f
Fix: Clean job tracking finalizer from orphan pods
...
Change-Id: I04cd70725fd1830be8daf2dca53f67bc10a379b7
2022-03-24 11:57:51 -04:00
Aldo Culquicondor
2c5d0a273c
Graduate IndexedJob to stable
...
- Lock feature gate to true and schedule for deletion in 1.26
- Remove checks on feature gate
- Graduate E2E test to Conformance
Change-Id: I6814819d318edaed5c86dae4055f4b050a4d39fd
2022-03-15 13:41:06 -04:00
Abdullah Gharaibeh
b2d2ec9e76
Graduate SuspendJob to GA
2022-02-15 10:46:13 -05:00
Davanum Srinivas
9682b7248f
OWNERS cleanup - Jan 2021 Week 1
...
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2022-01-10 08:14:29 -05:00
Davanum Srinivas
9405e9b55e
Check in OWNERS modified by update-yamlfmt.sh
...
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2021-12-09 21:31:26 -05:00
Mike Dame
80c01707e0
Wire contexts to Batch controllers ( #105491 )
...
* Wire contexts to Batch controllers
* (hold) feedback + updates that overlap with Apps controllers
* fixup errors
2021-11-10 14:56:46 -08:00
Kubernetes Prow Robot
8e37a3b324
Merge pull request #103868 from qingsenLi/210723-forget
...
Merge conditional assignment into variable declaration
2021-10-28 16:32:50 -07:00
Aldo Culquicondor
60fc90967b
Count ready pods in job controller
...
When the feature gate JobReadyPods is enabled.
Change-Id: I86f93914568de6a7029f9ae92ee7b749686fbf97
2021-10-19 15:18:37 -04:00
Kubernetes Prow Robot
0bfa37dfcc
Merge pull request #105676 from alculquicondor/job-name
...
Fix name for Pods of NonIndexed Jobs
2021-10-14 10:50:12 -07:00
Aldo Culquicondor
4ef9d18abe
Fix name for Pods of NonIndexed Jobs
...
Change-Id: I0ea4685a82f4cdec0caab362d52144476652f95a
2021-10-14 10:55:46 -04:00
Kubernetes Prow Robot
f27e4714ba
Merge pull request #105377 from damemi/wire-contexts-apps
...
Wire contexts to Apps controllers
2021-10-14 06:59:19 -07:00
Mike Dame
41fcb95f2f
Wire contexts to Apps controllers
2021-10-13 16:32:13 -04:00
Aldo Culquicondor
5929ccd391
Track expected removals of Pod finalizers
...
Add the UIDs of Pods for which we are removing finalizers to an in-memory cache.
The controller removes UIDs from the cache as Pod updates or deletes come in.
This avoids double counting finished Pods when Pod updates arrive after Job status updates.
https://github.com/kubernetes/kubernetes/issues/105200
2021-10-04 16:09:58 -04:00
Aldo Culquicondor
95c2a8024c
Parallelize pod updates in job test
...
To potentially reduce the number of job controller syncs.
Also reduce the maximum number of pods to sync in tests.
2021-10-01 09:55:53 -04:00
Aldo Culquicondor
a438f16741
Revert "Revert "Add metric job_pod_finished""
...
This reverts commit 7868fbbe64
.
2021-09-23 12:56:29 -04:00
Aldo Culquicondor
47a957d163
Revert "Revert "Limit number of Pods counted in a single Job sync""
...
This reverts commit 8bcb780808
.
2021-09-23 12:56:29 -04:00
Aldo Culquicondor
01f27cd93e
Fix log line for target number of running pods
2021-09-23 12:56:29 -04:00
Aldo Culquicondor
eebd678cda
Remove GET job and retries for status updates.
...
Doing a GET right before retrying has 2 problems:
- It can masquerade conflicts
- It adds an additional delay
As for retries, we are better of going through the sync backoff.
In the case of conflict, we know that there was a Job update that would trigger another sync, so there is no need to do a rate limited requeue.
2021-09-23 11:48:34 -04:00