The allocation mode is relevant when clearing the reservedFor: for delayed
allocation, deallocation gets requested, for immediate allocation not. Both
should get tested.
All pre-defined claims now use delayed allocation, just as they would if
created normally.
Enabling logging is useful to track what the code is doing.
There are some functional changes:
- The pod handler checks for existence of claims. This
avoids adding pods to the work queue in more cases
when nothing needs to be done, at the cost of
making the event handlers a bit slower. This will become
more important when adding more work to the controller
- The handler for deleted ResourceClaim did not check for
cache.DeletedFinalStateUnknown.
We don't need to remember that a pod got deleted when it had no resource claims
because the code which checks the cached UIDs only checks for pods which have
resource claims.
This addresses the following bad sequence of events:
- controller creates ResourceClaim
- updating pod status fails
- pod gets retried before the informer receives
the created ResourceClaim
- another ResourceClaim gets created
Storing the generated ResourceClaim in a MutationCache ensures that the
controller knows about it during the retry.
A positive side effect is that ResourceClaims now get index by pod owner and
thus iterating over existing ones becomes a bit more efficient.
Generating the name avoids all potential name collisions. It's not clear how
much of a problem that was because users can avoid them and the deterministic
names for generic ephemeral volumes have not led to reports from users. But
using generated names is not too hard either.
What makes it relatively easy is that the new pod.status.resourceClaimStatus
map stores the generated name for kubelet and node authorizer, i.e. the
information in the pod is sufficient to determine the name of the
ResourceClaim.
The resource claim controller becomes a bit more complex and now needs
permission to modify the pod status. The new failure scenario of "ResourceClaim
created, updating pod status fails" is handled with the help of a new special
"resource.kubernetes.io/pod-claim-name" annotation that together with the owner
reference identifies exactly for what a ResourceClaim was generated, so
updating the pod status can be retried for existing ResourceClaims.
The transition from deterministic names is handled with a special case for that
recovery code path: a ResourceClaim with no annotation and a name that follows
the Kubernetes <= 1.27 naming pattern is assumed to be generated for that pod
claim and gets added to the pod status.
There's no immediate need for it, but just in case that it may become relevant,
the name of the generated ResourceClaim may also be left unset to record that
no claim was needed. Components processing such a pod can skip whatever they
normally would do for the claim. To ensure that they do and also cover other
cases properly ("no known field is set", "must check ownership"),
resourceclaim.Name gets extended.
When a pod is done, but not getting removed yet for while, then a claim that
got generated for that pod can be deleted already. This then also triggers
deallocation.
This releases the underlying resource sooner and ensures that another consumer
can get scheduled without being influenced by a decision that was made for the
previous consumer.
An alternative would have been to have the apiserver trigger the deallocation
whenever it sees the `status.reservedFor` getting reduced to zero. But that
then also triggers deallocation when kube-scheduler removes the last
reservation after a failed scheduling cycle. In that case we want to keep the
claim allocated and let the kube-scheduler decide on a case-by-case basis which
claim should get deallocated.
* Skip terminal Pods with a deletion timestamp from the Daemonset sync
Change-Id: I64a347a87c02ee2bd48be10e6fff380c8c81f742
* Review comments and fix integration test
Change-Id: I3eb5ec62bce8b4b150726a1e9b2b517c4e993713
* Include deleted terminal pods in history
Change-Id: I8b921157e6be1c809dd59f8035ec259ea4d96301
Before this change we've assumed a constant time between schedule runs,
which is not true for cases like "30 6-16/4 * * 1-5".
The fix is to calculate the potential next run using the fixed schedule
as the baseline, and then go back one schedule back and allow the cron
library to calculate the correct time.
This approach saves us from iterating multiple times between last
schedule time and now, if the cronjob for any reason wasn't running for
significant amount of time.