Previously, the situation was ignored, which might have had the effect that Pod
scheduling continued (?) even though the Pod+PVC weren't known to be in an
acceptable state.
When adding the ephemeral volume feature, the special case for
PersistentVolumeClaim volume sources in kubelet's host path and node
limits checks was overlooked. An ephemeral volume source is another
way of referencing a claim and has to be treated the same way.
UnschedulableAndUnresolvable
This change adds an additional check in the volumebinding scheduler
plugin to handle PVC with phase ClaimLost which will allow the
scheduler to return UnschedulableAndUnresolvable during the PreFilter
stage and skip the rest of the node evaluation since the PVC is
bound to a PV that does not exist.
Without this change, the FailedScheduling error message would look like:
0/10 nodes are available: 2 node(s) had taint {node/test: true},
that the pod didn't tolerate, 6 node(s) had taint {node/unhealthy: true},
that the pod didn't tolerate, 2 pvc(s) bound to non-existent pv(s)
Which is still evaluating every single node to determine that the pod
cannot be scheduled because the PVC is bound to a non-existent PV
With this change, the FailedScheduling error message would look like:
0/10 nodes are available: 1 persistentvolumeclaim "foo" bound
to non-existent persistentvolume "bar"
Signed-off By: Yibo Zhuang <yibzhuang@gmail.com>
This change will make the message more clear when there
is a case of PVC(s) bound to PV(s) that no longer exists
and scheduler does not select the node due to this issue.
Previous error message would look like:
0/2 nodes are available: 2 pvc(s) bound to non-existent pv(s)
Updated message looks like:
0/2 nodes are available: 2 node(s) unavailable due to one or more
pvc(s) bound to non-existent pv(s)
For larger clusters with many different reasons of nodes that
are not available, the current message can be very misleading for
users to think that there are many PVCs lost due to PVs deleted but
in fact it could be just a single PVC case but many nodes not selected
by the scheduler due to this case.
Signed-off By: Yibo Zhuang <yibzhuang@gmail.com>
Checking for all topology labels is not backwards compatible. Clusters were nodes don't have zone labels effectively have default spreading disabled.
Change only applies to system defaults.
Some plugins expect the new feature gate struct. We can inject that additional
parameter via a helper function instead of having to repeat the same anonymous
function for each plugin.
These events are currently emitted for a pod using a generic ephemeral volume:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3s default-scheduler 0/1 nodes are available: 1 persistentvolumeclaim "my-csi-app-inline-volume-my-csi-volume" not found.
Warning FailedScheduling 2s default-scheduler 0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.
The one about "persistentvolumeclaim not found" is potentially confusing. It
occurs because the scheduler typically checks the pod before the ephemeral
volume controller had a chance to create the PVC.
This is a bit easier to understand:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 4s default-scheduler 0/1 nodes are available: 1 waiting for ephemeral volume controller to create the persistentvolumeclaim "my-csi-app-inline-volume-my-csi-volume".
Warning FailedScheduling 2s default-scheduler 0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.
As part of https://github.com/kubernetes/kubernetes/pull/100720 we
backported fix on existing releases and in this commit we completely
remove the deprecated metric from master branch.
Signed-off-by: dntosas <ntosas@gmail.com>
This metrics is measured in seconds so it makes no sense starting from
1000 as init value. This breaks also the scheduler e2e metric thus make
users unable to compute, for example, their SLO for the scheduler.
Even if this metric is deprecated, it should behave correctly until it is
completely removed to avoid user confusion.
For example, for each volume created, the minimum value exposed
as a metric is 16.6min (1000sec/60) which is obviously wrong as logic.
In this commit, we migrate bucket creation to start from reasonable
numbers, copying the incrementation from the conventions that the
scheduler follows itself.
Signed-off-by: dntosas <ntosas@gmail.com>
All dependencies of VolumeBinding plugin from
"k8s.io/kubernetes/pkg/controller/volume/scheduling" package moved to
"k8s.io/kubernetes/pkg/scheduler/framework/plugins/volumebinding" package:
- whole file pkg/controller/volume/scheduling/scheduler_assume_cache.go
- whole file pkg/controller/volume/scheduling/scheduler_assume_cache_test.go
- whole file pkg/controller/volume/scheduling/scheduler_binder.go
- whole file pkg/controller/volume/scheduling/scheduler_binder_fake.go
- whole file pkg/controller/volume/scheduling/scheduler_binder_test.go
Package "k8s.io/kubernetes/pkg/controller/volume/scheduling/metrics" moved
to "k8s.io/kubernetes/pkg/scheduler/framework/plugins/volumebinding/metrics"
because it only used in VolumeBinding plugin and (e2e) tests.
More described in issue #89930 and PR #102953.
Signed-off-by: Konstantin Misyutin <konstantin.misyutin@huawei.com>
Check the PVC ref count on the node info cache to determine if a pod's
PVCs are in use. If they are and it is using ReadWriteOncePod, fail the
request.
kubernetes#60525 introduced
Balanced attached node volumes feature gate to include volume
count for prioritizing nodes. The reason for introducing this
flag was its usefulness in Red Hat OpenShift Online environment
which is not being used any more. So, removing the flag
as it helps in maintainability of the scheduler code base
as mentioned at kubernetes#101489 (comment)
Given that we give a default CPU/memory requests for containers that don't provide any, the calculated usage can exceed the allocatable.
Change-Id: I72e249652acacfbe8cea0dd6f895dabe43ff6376
Both `ErrReasonAffinityRulesNotMatch` and `ErrReasonAntiAffinityRulesNotMatch` are
more precise than `ErrReasonAffinityNotMatch`.
Signed-off-by: Dave Chen <dave.chen@arm.com>