This commit makes the job controller re-honor exponential backoff for
failed pods. Before this commit, the controller created pods without any
backoff. This is a regression because the controller used to
create pods with an exponential backoff delay before (10s, 20s, 40s ...).
The issue occurs only when the JobTrackingWithFinalizers feature is
enabled (which is enabled by default right now). With this feature, we
get an extra pod update event when the finalizer of a failed pod is
removed.
Note that the pod failure detection and new pod creation happen in the
same reconcile loop so the 2nd pod is created immediately after the 1st
pod fails. The backoff is only applied on 2nd pod failure, which means
that the 3rd pod created 10s after the 2nd pod, 4th pod is created 20s
after the 3rd pod and so on.
This commit fixes a few bugs:
1. Right now, each time `uncounted != nil` and the job does not see a
_new_ failure, `forget` is set to true and the job is removed from the
queue. Which means that this condition is also triggered each time the
finalizer for a failed pod is removed and `NumRequeues` is reset, which
results in a backoff of 0s.
2. Updates `updatePod` to only apply backoff when we see a particular
pod failed for the first time. This is necessary to ensure that the
controller does not apply backoff when it sees a pod update event
for finalizer removal of a failed pod.
3. If `JobsReadyPods` feature is enabled and backoff is 0s, the job is
now enqueued after `podUpdateBatchPeriod` seconds, instead of 0s. The
unit test for this check also had a few bugs:
- `DefaultJobBackOff` is overwritten to 0 in certain unit tests,
which meant that `DefaultJobBackOff` was considered to be 0,
effectively not running any meaningful checks.
- `JobsReadyPods` was not enabled for test cases that ran tests
which required the feature gate to be enabled.
- The check for expected and actual backoff had incorrect
calculations.
Marking the pods not ready on a node requires looping over them and
updating each pod's status one at a time. This is performed serially,
and can take a while if we're processing each node serially as well.
Since the time is spent waiting on io, there's an opportunity to go
faster by processing multiple nodes concurrently. This change modifies
the loop to process nodes in parallel, using the same number of workers
as doNodeProcessingPassWorker.
This change also introduces histogram metrics to better observe
monitorNodeHealth.
The fake client doesn't guarantee that the informer cache is updated.
If it's not up-to-date, the controller always tries to set the
StartTime, leading to a broken test.
Change-Id: I71f26d46ea44beff88f0d03517985348654aec95
* Add tracker types and tests
* Modify ResourceEventHandler interface's OnAdd member
* Add additional ResourceEventHandlerDetailedFuncs struct
* Fix SharedInformer to let users track HasSynced for their handlers
* Fix in-tree controllers which weren't computing HasSynced correctly
* Deprecate the cache.Pop function
Since https://github.com/kubernetes/kubernetes/pull/112648, we can
efficiently handle selectors from pre-existing `map[string]string`,
making the cache obsolete.
Benchmark:
```
name old time/op new time/op delta
GetPodServiceMemberships-48 189µs ± 1% 193µs ± 1% +2.10% (p=0.000 n=10+10)
name old alloc/op new alloc/op delta
GetPodServiceMemberships-48 59.0kB ± 0% 58.9kB ± 0% -0.09% (p=0.000 n=9+9)
name old allocs/op new allocs/op delta
GetPodServiceMemberships-48 1.02k ± 0% 1.02k ± 0% ~ (all equal)
```
Endpoints generated by the endpoints controller are in the canonical
form, however, custom endpoints can not be in canonical format
(there was a time they were canonicalized in the apiserver, but this
caused performance issues because the endpoint controller kept
updating them since the created endpoint were different than the
stored one due to the canonicalization)
There are cases where a custom endpoint may generate multiple slices
due to the controller, per example, when the same address is present
in different subsets.
The endpointslice mirroring controller should canonicalize the
endpoints subsets before start processing them to be consistent
on the slices generated, there is no risk of hotlooping because
the endpoint is only used as input.
Change-Id: I2a8cd53c658a640aea559a88ce33e857fa98cc5c
This ensures that the daemonset controller updates daemonset statuses in
a best-effort manner even if syncDaemonSet fails.
In order to add an integration test, this also replaces
`cmd/kube-apiserver/app/testing.StartTestServer` with
`test/integration/framework.StartTestServer` and adds
`setupWithServerSetup` to configure the admission control of the
apiserver.
* Wait for Pods to finish before considering Failed
Limit behavior to feature gates PodDisruptionConditions and
JobPodFailurePolicy and jobs with a podFailurePolicy.
Change-Id: I926391cc2521b389c8e52962afb0d4a6a845ab8f
* Remove check for unsheduled terminating pod
Change-Id: I3dc05bb4ea3738604f01bf8cb5fc8cc0f6ea54ec
The controller uses the exact same logic as the generic ephemeral inline volume
controller, just for inline ResourceClaimTemplate -> ResourceClaim.
In addition, it supports removal of pods from the ReservedFor field when those
pods are known to not need the claim anymore. At the moment, only this special
case is supported. Removal of arbitrary objects would imply granting full read
access to all types to determine whether a) an object is gone and b) if the
current incarnation is the one which is listed in ReservedFor. This may get
added later.
In the long term, the complex defer function makes the code harder to
maintain as code after it should take that into account. This removes
the complex defer function updating the status of a statefulset.
Patch request does not support RV by default, we need to include them explicitly and patching lists actually overwrites whole field. It means that there is a race condition, in which we can overwrite changes to taints that happened between GET and PATCH requests.
We should not rely on syncUnboundClaim() to do nothing after it updates
PVC with a default storage class until next re-sync but instead restart
the sync explicitly to make sure we hit isDelayBindingMode() and
findBestMatchForClaim() immediately right after the PVC update.
The reason for the issue is that the metrics were bumped before the
final job status update. In case the update failed the path was
repeated by the next syncJob leading to double-counting of the metrics.
The solution is to delay recording metrics and broadcasting events
after the job status update succeeds.
Improve error logging from timed workers which are used for pod eviction
Co-authored-by: Aldo Culquicondor <1299064+alculquicondor@users.noreply.github.com>
Update the maximum sync backoff value to 1000s to match the sequence of
delays expected by the endpointslice controller when syncing Services:
Before this change the sequence was:
> 1s, 2s, 4s, 8s, 16s, 32s, 64s, 100s
Now it is:
> 1s, 2s, 4s, 8s, 16s, 32s, 64s, 128s, 256s, 512s, 1000s
Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
Fixes instances of #98213 (to ultimately complete #98213 linting is
required).
This commit fixes a few instances of a common mistake done when writing
parallel subtests or Ginkgo tests (basically any test in which the test
closure is dynamically created in a loop and the loop doesn't wait for
the test closure to complete).
I'm developing a very specific linter that detects this king of mistake
and these are the only violations of it it found in this repo (it's not
airtight so there may be more).
In the case of Ginkgo tests, without this fix, only the last entry in
the loop iteratee is actually tested. In the case of Parallel tests I
think it's the same problem but maybe a bit different, iiuc it depends
on the execution speed.
Waiting for the CI to confirm the tests are still passing, even after
this fix - since it's likely it's the first time those test cases are
executed - they may be buggy or testing code that is buggy.
Another instance of this is in `test/e2e/storage/csi_mock_volume.go` and
is still failing so it has been left out of this commit and will be
addressed in a separate one
To be able to implement controllers that are dynamically deciding
on which resources to watch, it is required to get rid of
dedicated watches and event handlers again. This requires the
possibility to remove event handlers from SharedIndexInformers again.
Stopping an informer is not sufficient, because there might
be multiple controllers in a controller manager that independently
decide which resources to watch.
Unfortunately the ResourceEventHandler interface encourages to use
value objects for handlers (like the ResourceEventHandlerFuncs
struct, that uses value receivers to implement the interface).
Go does not support comparison of function pointers and therefore
the comparison of such structs is not possible, also. To be able
to remove all kinds of handlers and to solve the problem of
multi-registrations of handlers a registration handle is introduced.
It is returned when adding a handler and can later be used to remove
the registration again. This handle directly stores the created
listener to simplify the deletion.
The Priority is determined as follows:
P0: ClusterCIDR with higher number of matching labels has highest
priority.
P1: ClusterCIDR having cidrSet with fewer allocatable Pod CIDRs has
higher priority.
P2: ClusterCIDR with a PerNodeMaskSize having fewer IPs has higher
priority.
P3: ClusterCIDR having label with lower alphanumeric value has higher
priority.
P4: ClusterCIDR with a cidrSet having a smaller IP address value has
higher priority.
Add a new cidrset named `multicidrset` which extends the current
cidrset mechanism to track allocatable Pod and Service CIDRs.
multicidrset stores the info about allocated CIDRs in a Map as opposed
to the current cidrset implementation where it is stored in a bitmap.
Add a new call to VolumePlugin interface and change all its
implementations.
Kubelet's VolumeManager will be interested whether a volume supports
mounting with -o conext=XYZ or not to hanle SetUp() / MountDevice()
accordingly.
In future commits we will need this to set the user/group of supported
volumes of KEP 127 - Phase 1.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>