This change will leverage the new PreFilterResult
to reduce down the list of eligible nodes for pod
using Bound Local PVs during PreFilter stage so
that only the node(s) which local PV node affinity
matches will be cosnidered in subsequent scheduling
stages.
Today, the NodeAffinity check is done during Filter
which means all nodes will be considered even though
there may be a large number of nodes that are not
eligible due to not matching the pod's bound local
PV(s)' node affinity requirement. Here we can
reduce down the node list in PreFilter to ensure that
during Filter we are only considering the reduced
list and thus can provide a more clear message to
users when node(s) are not available for scheduling
since the list only contains relevant nodes.
If error is encountered (e.g. PV cache read error) or
if node list reduction cannot be done (e.g. pod uses
no local PVs), then we will still proceed to consider
all nodes for the rest of scheduling stages.
Signed-off-by: Yibo Zhuang <yibzhuang@gmail.com>
The feature gate gets locked to "true", with the goal to remove it in two
releases.
All code now can assume that the feature is enabled. Tests for "feature
disabled" are no longer needed and get removed.
Some code wasn't using the new helper functions yet. That gets changed while
touching those lines.
The name concatenation and ownership check were originally considered small
enough to not warrant dedicated functions, but the intent of the code is more
readable with them.
UnschedulableAndUnresolvable
This change adds an additional check in the volumebinding scheduler
plugin to handle PVC with phase ClaimLost which will allow the
scheduler to return UnschedulableAndUnresolvable during the PreFilter
stage and skip the rest of the node evaluation since the PVC is
bound to a PV that does not exist.
Without this change, the FailedScheduling error message would look like:
0/10 nodes are available: 2 node(s) had taint {node/test: true},
that the pod didn't tolerate, 6 node(s) had taint {node/unhealthy: true},
that the pod didn't tolerate, 2 pvc(s) bound to non-existent pv(s)
Which is still evaluating every single node to determine that the pod
cannot be scheduled because the PVC is bound to a non-existent PV
With this change, the FailedScheduling error message would look like:
0/10 nodes are available: 1 persistentvolumeclaim "foo" bound
to non-existent persistentvolume "bar"
Signed-off By: Yibo Zhuang <yibzhuang@gmail.com>
These events are currently emitted for a pod using a generic ephemeral volume:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3s default-scheduler 0/1 nodes are available: 1 persistentvolumeclaim "my-csi-app-inline-volume-my-csi-volume" not found.
Warning FailedScheduling 2s default-scheduler 0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.
The one about "persistentvolumeclaim not found" is potentially confusing. It
occurs because the scheduler typically checks the pod before the ephemeral
volume controller had a chance to create the PVC.
This is a bit easier to understand:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 4s default-scheduler 0/1 nodes are available: 1 waiting for ephemeral volume controller to create the persistentvolumeclaim "my-csi-app-inline-volume-my-csi-volume".
Warning FailedScheduling 2s default-scheduler 0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.
All dependencies of VolumeBinding plugin from
"k8s.io/kubernetes/pkg/controller/volume/scheduling" package moved to
"k8s.io/kubernetes/pkg/scheduler/framework/plugins/volumebinding" package:
- whole file pkg/controller/volume/scheduling/scheduler_assume_cache.go
- whole file pkg/controller/volume/scheduling/scheduler_assume_cache_test.go
- whole file pkg/controller/volume/scheduling/scheduler_binder.go
- whole file pkg/controller/volume/scheduling/scheduler_binder_fake.go
- whole file pkg/controller/volume/scheduling/scheduler_binder_test.go
Package "k8s.io/kubernetes/pkg/controller/volume/scheduling/metrics" moved
to "k8s.io/kubernetes/pkg/scheduler/framework/plugins/volumebinding/metrics"
because it only used in VolumeBinding plugin and (e2e) tests.
More described in issue #89930 and PR #102953.
Signed-off-by: Konstantin Misyutin <konstantin.misyutin@huawei.com>
This PR also looses the check to allow zero since the API doc has
explained that value zero indicates no waiting.
Signed-off-by: Dave Chen <dave.chen@arm.com>
The implementation consists of
- identifying all places where VolumeSource.PersistentVolumeClaim has
a special meaning and then ensuring that the same code path is taken
for an ephemeral volume, with the ownership check
- adding a controller that produces the PVCs for each embedded
VolumeSource.EphemeralVolume
- relaxing the PVC protection controller such that it removes
the finalizer already before the pod is deleted (only
if the GenericEphemeralVolume feature is enabled): this is
needed to break a cycle where foreground deletion of the pod
blocks on removing the PVC, which waits for deletion of the pod
The controller was derived from the endpointslices controller.
This uses the information provided by a CSI driver deployment for
checking whether a node has access to enough storage to create the
currently unbound volumes, if the CSI driver opts into that checking
with CSIDriver.Spec.VolumeCapacity != false.
This resolves a TODO from commit 95b530366a.
Previously, separate interfaces were defined for Reserve and Unreserve
plugins. However, in nearly all cases, a plugin that allocates a
resource using Reserve will likely want to register itself for Unreserve
as well in order to free the allocated resource at the end of a failed
scheduling/binding cycle. Having separate plugins for Reserve and
Unreserve also adds unnecessary config toil. To that end, this patch
aims to merge the two plugins into a single interface called a
ReservePlugin that requires implementing both the Reserve and Unreserve
methods.