When a ServiceCIDR is deleted, the service CIDR controller on the
controller manager verifies that is safe to be deleted before removing
the finalizer, howerver, since the information of deletion takes time to
propragate, there can be a race where the apiserver allocators didn't
receive the information of deletion and assign an IP address that will
be orphan.
To avoid this race, the service cidr controller waits a grace period
before removing the finalizer to ensure the allocators do not assign any
new IP Address from that range before is completely deleted.
Change-Id: Ib34d32c0bdde91c6e84f1d056db9374589b25c0b
Controls the lifecycle of the ServiceCIDRs adding finalizers and
setting the Ready condition in status when they are created, and
removing the finalizers once it is safe to remove (no orphan IPAddresses)
An IPAddress is orphan if there are no ServiceCIDR containing it.
Change-Id: Icbe31e1ed8525fa04df3b741c8a817e5f2a49e80
The informer is not initialized, so no assertion performed before. Fixed this now.
Then fixed the test failure by using NewAttachDetachController to initialize adc.
update pod replacement policy feature flag comment and refactor the e2e test for pod replacement policy
minor fixes for pod replacement policy and e2e test
fix wrong assertions for pod replacement policy e2e test
more fixes to pod replacement policy e2e test
refactor PodReplacementPolicy e2e test to use finalizers
fix unit tests when pod replacement policy feature flag is promoted to beta
fix podgc controller unit tests when pod replacement feature is enabled
fix lint issue in pod replacement policy e2e test
assert no error in defer function for removing finalizer in pod replacement policy e2e test
implement test using a sh trap for pod replacement policy
reduce sleep after SIGTERM in pod replacement policy e2e test to 5s
* cleanup: refactor pod replacement policy integration test into staged assertion
* cleanup: remove typo in job_test.go
* refactor PodReplacementPolicy test and remove test for defaulting the policy
* fix issue with missing update in job controller for terminating status and refactor pod replacement policy integration test
* use t.Cleanup instead of defer in PodReplacementPolicy integration tests
* revert t.Cleanup to defer for reseting feature flag in PodReplacementPolicy integration tests
Initially this method was returning a number of missed schedules, but
that turned out to be not reliable for some complex schedules. For
example, those which are being run only during week days. The second
approach was to only return a boolean indicating the too many missed
information. It turns out that we need to return all three values:
none missed, few missed and many missed, to let consumers know what to
do, but don't leak the wrong number out of mostRecentScheduleTime.
* Job: Handle error returned from AddEventHandler function
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Use the error message the similar to CronJob
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Clean up error messages
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Put the tesing.T on the second place in the args for the newControllerFromClient function
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Put the testing.T on the second place in the args for the newControllerFromClientWithClock function
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Call t.Helper()
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Put the testing.TB on the second place in the args for the createJobControllerWithSharedInformers function and call tb.Helper() there
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Put the testing.TB on the second place in the args for the startJobControllerAndWaitForCaches function and call tb.Helper() there
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Adapt TestFinializerCleanup to the eventhandler error
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
---------
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
KEP-2593 proposed to expand the existing node-ipam controller
to be configurable via a ClusterCIDR objects, however, there
were reasonable doubts on the SIG about the feature and after
several months of dicussions we decided to not move forward
with the KEP intree, hence, we are going to remove the existing
code, that is still in alpha.
https://groups.google.com/g/kubernetes-sig-network/c/nts1xEZ--gQ/m/2aTOUNFFAAAJ
Change-Id: Ieaf2007b0b23c296cde333247bfb672441fe6dfc
Have hpa always update both the metrics and replica count. This fix an
edge case behavior bug where the metrics would not be updated if a
custom metrics was unavailable.
* Added locks on request tracker before accessing fields
Unit test StatefulSetAutoDeletePVCEnabled has been
flaking with DATARACE. Added lock on request tracker
before accessing err field.
* Addressed review comments for PR : Added locks on request tracker before accessing fields
- this function is used by other packages and was mistakenly removed
in 397cc73dc9
- let resource quota controller use this constructor instead of an
object instantiation
Volume that failed Detach() should not be marked as attached, CSI
external-attacher is probably still trying to detach it.
Mark it uncertain instead and wait for Detach() to succeed.
This uses the generic ptr.To in k8s.io/utils to replace functions and
code constructs which only serve to return pointers to intstr
values. Other uses of the deprecated pointer package are updated in
modified files.
Signed-off-by: Stephen Kitt <skitt@redhat.com>
When the resource claim name inside the pod had some suffix like "1a" in
"resource-1a", the generated name suffix got added directly after that, leading
to "my-pod-resource-1ax6zgt".
Adding another hyphen makes the result more readable: "my-pod-resource-1a-x6zgt".
* Fix a job quota related deadlock
In case ResourceQuota is used and sets a max # of jobs, a CronJob may get
trapped in a deadlock:
1. Job quota for a namespace is reached.
2. CronJob controller can't create a new job, because quota is
reached.
3. Cleanup of jobs owned by a cronjob doesn't happen, because a
control loop iteration is finished because of an error to create a
job.
To fix this we stop early quitting from a control loop iteration when
cronjob reconciliation failed and always let old jobs to be cleaned up.
* Dont reorder imports
* Don't stop requeuing on reconciliation error
Previous code only logged the reconciliation error inside jm.sync() and
didn't return the reconciliation error to it's invoker
processNextWorkItem().
Adding a copy-paste back to avoid this issue.
* Remove copy-pasted cleanupFinishedJobs()
Now we always call jm.cleanupFinishedJobs() first and then
jm.syncCronJob().
We also extract cronJobCopy and updateStatus outside jm.syncCronJob
function and pass pointers to them in both jm.syncCronJob and
jm.cleanupFinishedJobs to make delayed updates handling more explicit
and not dependent on the order in which cleanupFinishedJobs and
syncCronJob are invoked.
* Return updateStatus bool instead of changing the reference
* Explicitly ignore err in tests to fix linter
PVC and containers shared the same ResourceRequirements struct to define their
API. When resource claims were added, that struct got extended, which
accidentally also changed the PVC API. To avoid such a mistake from happening
again, PVC now uses its own VolumeResourceRequirements struct.
The `Claims` field gets removed because risk of breaking someone is low:
theoretically, YAML files which have a claims field for volumes now
get rejected when validating against the OpenAPI. Such files
have never made sense and should be fixed.
Code that uses the struct definitions needs to be updated.