When the resource claim name inside the pod had some suffix like "1a" in
"resource-1a", the generated name suffix got added directly after that, leading
to "my-pod-resource-1ax6zgt".
Adding another hyphen makes the result more readable: "my-pod-resource-1a-x6zgt".
* Fix a job quota related deadlock
In case ResourceQuota is used and sets a max # of jobs, a CronJob may get
trapped in a deadlock:
1. Job quota for a namespace is reached.
2. CronJob controller can't create a new job, because quota is
reached.
3. Cleanup of jobs owned by a cronjob doesn't happen, because a
control loop iteration is finished because of an error to create a
job.
To fix this we stop early quitting from a control loop iteration when
cronjob reconciliation failed and always let old jobs to be cleaned up.
* Dont reorder imports
* Don't stop requeuing on reconciliation error
Previous code only logged the reconciliation error inside jm.sync() and
didn't return the reconciliation error to it's invoker
processNextWorkItem().
Adding a copy-paste back to avoid this issue.
* Remove copy-pasted cleanupFinishedJobs()
Now we always call jm.cleanupFinishedJobs() first and then
jm.syncCronJob().
We also extract cronJobCopy and updateStatus outside jm.syncCronJob
function and pass pointers to them in both jm.syncCronJob and
jm.cleanupFinishedJobs to make delayed updates handling more explicit
and not dependent on the order in which cleanupFinishedJobs and
syncCronJob are invoked.
* Return updateStatus bool instead of changing the reference
* Explicitly ignore err in tests to fix linter
PVC and containers shared the same ResourceRequirements struct to define their
API. When resource claims were added, that struct got extended, which
accidentally also changed the PVC API. To avoid such a mistake from happening
again, PVC now uses its own VolumeResourceRequirements struct.
The `Claims` field gets removed because risk of breaking someone is low:
theoretically, YAML files which have a claims field for volumes now
get rejected when validating against the OpenAPI. Such files
have never made sense and should be fixed.
Code that uses the struct definitions needs to be updated.
* [API REVIEW] ValidatingAdmissionPolicyStatucController config.
worker count.
* ValidatingAdmissionPolicyStatus controller.
* remove CEL typechecking from API server.
* fix initializer tests.
* remove type checking integration tests
from API server integration tests.
* validatingadmissionpolicy-status options.
* grant access to VAP controller.
* add defaulting unit test.
* generated: ./hack/update-codegen.sh
* add OWNERS for VAP status controller.
* type checking test case.
When someone decides that a Pod should definitely run on a specific node, they
can create the Pod with spec.nodeName already set. Some custom scheduler might
do that. Then kubelet starts to check the pod and (if DRA is enabled) will
refuse to run it, either because the claims are still waiting for the first
consumer or the pod wasn't added to reservedFor. Both are things the scheduler
normally does.
Also, if a pod got scheduled while the DRA feature was off in the
kube-scheduler, a pod can reach the same state.
The resource claim controller can handle these two cases by taking over for the
kube-scheduler when nodeName is set. Triggering an allocation is simpler than
in the scheduler because all it takes is creating the right
PodSchedulingContext with spec.selectedNode set. There's no need to list nodes
because that choice was already made, permanently. Adding the pod to
reservedFor also isn't hard.
What's currently missing is triggering de-allocation of claims to re-allocate
them for the desired node. This is not important for claims that get created
for the pod from a template and then only get used once, but it might be
worthwhile to add de-allocation in the future.
The allocation mode is relevant when clearing the reservedFor: for delayed
allocation, deallocation gets requested, for immediate allocation not. Both
should get tested.
All pre-defined claims now use delayed allocation, just as they would if
created normally.
Enabling logging is useful to track what the code is doing.
There are some functional changes:
- The pod handler checks for existence of claims. This
avoids adding pods to the work queue in more cases
when nothing needs to be done, at the cost of
making the event handlers a bit slower. This will become
more important when adding more work to the controller
- The handler for deleted ResourceClaim did not check for
cache.DeletedFinalStateUnknown.
We don't need to remember that a pod got deleted when it had no resource claims
because the code which checks the cached UIDs only checks for pods which have
resource claims.
This addresses the following bad sequence of events:
- controller creates ResourceClaim
- updating pod status fails
- pod gets retried before the informer receives
the created ResourceClaim
- another ResourceClaim gets created
Storing the generated ResourceClaim in a MutationCache ensures that the
controller knows about it during the retry.
A positive side effect is that ResourceClaims now get index by pod owner and
thus iterating over existing ones becomes a bit more efficient.
Generating the name avoids all potential name collisions. It's not clear how
much of a problem that was because users can avoid them and the deterministic
names for generic ephemeral volumes have not led to reports from users. But
using generated names is not too hard either.
What makes it relatively easy is that the new pod.status.resourceClaimStatus
map stores the generated name for kubelet and node authorizer, i.e. the
information in the pod is sufficient to determine the name of the
ResourceClaim.
The resource claim controller becomes a bit more complex and now needs
permission to modify the pod status. The new failure scenario of "ResourceClaim
created, updating pod status fails" is handled with the help of a new special
"resource.kubernetes.io/pod-claim-name" annotation that together with the owner
reference identifies exactly for what a ResourceClaim was generated, so
updating the pod status can be retried for existing ResourceClaims.
The transition from deterministic names is handled with a special case for that
recovery code path: a ResourceClaim with no annotation and a name that follows
the Kubernetes <= 1.27 naming pattern is assumed to be generated for that pod
claim and gets added to the pod status.
There's no immediate need for it, but just in case that it may become relevant,
the name of the generated ResourceClaim may also be left unset to record that
no claim was needed. Components processing such a pod can skip whatever they
normally would do for the claim. To ensure that they do and also cover other
cases properly ("no known field is set", "must check ownership"),
resourceclaim.Name gets extended.
When a pod is done, but not getting removed yet for while, then a claim that
got generated for that pod can be deleted already. This then also triggers
deallocation.
This releases the underlying resource sooner and ensures that another consumer
can get scheduled without being influenced by a decision that was made for the
previous consumer.
An alternative would have been to have the apiserver trigger the deallocation
whenever it sees the `status.reservedFor` getting reduced to zero. But that
then also triggers deallocation when kube-scheduler removes the last
reservation after a failed scheduling cycle. In that case we want to keep the
claim allocated and let the kube-scheduler decide on a case-by-case basis which
claim should get deallocated.
* Skip terminal Pods with a deletion timestamp from the Daemonset sync
Change-Id: I64a347a87c02ee2bd48be10e6fff380c8c81f742
* Review comments and fix integration test
Change-Id: I3eb5ec62bce8b4b150726a1e9b2b517c4e993713
* Include deleted terminal pods in history
Change-Id: I8b921157e6be1c809dd59f8035ec259ea4d96301
Before this change we've assumed a constant time between schedule runs,
which is not true for cases like "30 6-16/4 * * 1-5".
The fix is to calculate the potential next run using the fixed schedule
as the baseline, and then go back one schedule back and allow the cron
library to calculate the correct time.
This approach saves us from iterating multiple times between last
schedule time and now, if the cronjob for any reason wasn't running for
significant amount of time.
The GetDefaultClass() was fixed in scope of this issue:
https://github.com/kubernetes/kubernetes/issues/110514
Before this change assignDefaultStorageClass() was ignoring errors from
this function since it could mean there are multiple defaults - assign
could safely continue and do nothing.
This is no longer true because we always choose one from multiple
defaults - any errors returned from GetDefaultClass() are real errors
and should not be ignored.
Since the feature is GA and locked to true, tests can no longer set it
to false. Cleaning up by removing all references to this feature gate
from tests.
Feature gate will be removed in v1.29.
Reduce the visibility of various public identifiers that are only used
within the scope of a package.
This was originally motivated by KEP-3685 in order to reduce the public
API surface and improve supportability.
This touches cases where FromInt() is used on numeric constants, or
values which are already int32s, or int variables which are defined
close by and can be changed to int32s with little impact.
Signed-off-by: Stephen Kitt <skitt@redhat.com>
This touches cases where FromInt() is used on numeric constants, or
values which are already int32s, or int variables which are defined
close by and can be changed to int32s with little impact.
Signed-off-by: Stephen Kitt <skitt@redhat.com>
The member variable `cpuRatiosByZone` should be accessed with the lock
acquired as it could be be updated by `SetNodes` concurrently.
Signed-off-by: Quan Tian <qtian@vmware.com>
Co-authored-by: Antonio Ojea <aojea@google.com>
The topology.kubernetes.io/zone label may be added by could provider
asynchronously after the Node is created. The previous code didn't
update the topology cache after receiving the Node update event, causing
TopologyAwareHint to not work until kube-controller-manager restarts or
other Node events trigger the update.
Signed-off-by: Quan Tian <qtian@vmware.com>