Commit Graph

1547 Commits

Author SHA1 Message Date
Kubernetes Prow Robot
90615231a6 Merge pull request #125097 from YamasouA/ft/queuehit-csinode
volumebinding: scheduler queueing hints - CSINode
2024-07-09 17:53:05 -07:00
Kubernetes Prow Robot
4a214f6ad9 Merge pull request #125461 from mimowo/pod-disruption-conditions-ga
Graduate PodDisruptionConditions to stable
2024-07-09 11:08:13 -07:00
Kubernetes Prow Robot
b6899c5e08 Merge pull request #122251 from olderTaoist/unschedulable-plugin
register unschedulable plugin  for those plugins that PreFilter's PreFilterResult filter out some nodes
2024-07-05 05:44:26 -07:00
olderTaoist
b478621596 register unscheduable plugin when prefileter with NodeNames 2024-07-02 13:02:45 +08:00
Kubernetes Prow Robot
ac9aec9f9b Merge pull request #125116 from pohly/dra-one-of-source
DRA: remove "source" indirection from v1 Pod API
2024-06-28 12:46:45 -07:00
Michal Wozniak
780191bea6 review remarks for graduating PodDisruptionConditions 2024-06-28 17:32:27 +02:00
Michal Wozniak
bf0c9885a4 Graduate PodDisruptionConditions to stable 2024-06-28 16:36:51 +02:00
Kubernetes Prow Robot
eb66365bc4 Merge pull request #124931 from pohly/dra-scheduler-prebind-fix
DRA: fix scheduler/resource claim controller race
2024-06-28 05:57:24 -07:00
Patrick Ohly
bde9b64cdf DRA: remove "source" indirection from v1 Pod API
This makes the API nicer:

    resourceClaims:
    - name: with-template
      resourceClaimTemplateName: test-inline-claim-template
    - name: with-claim
      resourceClaimName: test-shared-claim

Previously, this was:

    resourceClaims:
    - name: with-template
      source:
        resourceClaimTemplateName: test-inline-claim-template
    - name: with-claim
      source:
        resourceClaimName: test-shared-claim

A more long-term benefit is that other, future alternatives
might not make sense under the "source" umbrella.

This is a breaking change. It's justified because DRA is still
alpha and will have several other API breaks in 1.31.
2024-06-27 17:53:24 +02:00
Patrick Ohly
4bddebc48e DRA: fix scheduler/resource claim controller race with retry
The JSON patch approach works, but it is complex. A retry loop is easier to
understand (detect conflict, get new claim, try again). There is one additional
API call (the get), but in practice this scenario is unlikely.
2024-06-27 15:03:56 +02:00
Patrick Ohly
ecbafb8de5 DRA: fix scheduler/resource claim controller race
There was a race caused by having to update claim finalizer and status in two
different operations:
- Resource claim controller removes allocation, does not yet
  get to remove the finalizer.
- Scheduler prepares an allocation, without adding the finalizer
  because it's there.
- Controller removes finalizer.
- Scheduler adds allocation.

This is an invalid state. Automatic checking found this during the execution of
the "with translated parameters on single node.*supports sharing a claim
sequentially" E2E test, but only when run stand-alone. When running in
parallel (as in the CI), the bad outcome of the race did not occur.

The fix is to check that the finalizer is still set when adding the
allocation. The apiserver doesn't check that because it doesn't know which
finalizer goes with the allocation result. It could check for "some finalizer",
but that is not guaranteed to be correct (could be some unrelated one).

Checking the finalizer can only be done with a JSON patch. Despite the
complications, having the ability to add multiple pods concurrently to
ReservedFor seems worth it (avoids expensive rescheduling or a local retry
loop).

The resource claim controller doesn't need this, it can do a normal update
which implicitly checks ResourceVersion.
2024-06-27 15:03:06 +02:00
googs1025
8ce056df84 add DefaultSelector method ut
Signed-off-by: googs1025 <googs1025@gmail.com>
2024-06-27 11:23:48 +08:00
Kubernetes Prow Robot
084d6c4968 Merge pull request #125699 from pohly/scheduler-framework-logging
scheduler: fix klog.KObjSlice when applied to []*NodeInfo
2024-06-26 01:50:23 -07:00
Patrick Ohly
719a49cc13 scheduler: fix klog.KObjSlice when applied to []*NodeInfo
The DRA plugin does that. It didn't actually work and only printed an error
message about NodeInfo not implementing klog.KMetata. That's not a compile-time
check due to limitations with Go generics and had been missed earlier.
2024-06-26 08:11:31 +02:00
Kubernetes Prow Robot
8c478a06d8 Merge pull request #124595 from pohly/dra-scheduler-assume-cache-eventhandlers
DRA: scheduler event handlers via assume cache
2024-06-25 11:56:28 -07:00
Patrick Ohly
1b63639d31 DRA scheduler: use assume cache to list claims
This finishes the transition to the assume cache as source of truth for the
current set of claims.

The tests have to be adapted. It's not enough anymore to directly put objects
into the informer store because that doesn't change the assume cache
content. Instead, normal Create/Update calls and waiting for the cache update
are needed.
2024-06-25 14:00:25 +02:00
Patrick Ohly
9a6f3b9388 scheduler: central ResourceClaim assume cache
This enables connecting the event handler for ResourceClaim to the assume
cache, which addresses a theoretic race condition.

It may also be useful for implementing the autoscaler support, because now
the autoscaler can modify the content of the cache.
2024-06-25 14:00:25 +02:00
Kubernetes Prow Robot
a008776ec9 Merge pull request #125279 from HirazawaUi/add-poddeleted-queueinghintfn
Add QueueingHintFn for pod events in VolumeRestriction plugin
2024-06-19 12:22:41 -07:00
Kubernetes Prow Robot
64355780d9 Merge pull request #125495 from pohly/dra-scheduler-fix-parameter-indexing
DRA: fix indexing of generated parameters
2024-06-18 04:10:38 -07:00
Kubernetes Prow Robot
ab8ad49b47 Merge pull request #125533 from kaisoz/sched-test-disruption-target-cond
scheduler: Test that the DisruptionTarget condition is added at preemption time
2024-06-18 01:14:28 -07:00
Tomas Tormo
8d7c113434 Test that the DisruptionTarget condition is added at preemption 2024-06-17 16:59:52 +00:00
HirazawaUi
f9693e0c0a Implement QueueingHintFn for pod deleted event 2024-06-17 22:42:04 +08:00
Patrick Ohly
e0fce54d02 DRA: fix indexing of generated parameters
The claim parameter key didn't include the namespace of the claim. In the case
where two namespaces used the exact same parameter reference, the "too many
generated parameters" case got triggered incorrectly and lookup could have
returned an object from the wrong namespace.

Found while running the E2E tests in parallel:

              message: 'running PreFilter plugin "DynamicResources": multiple generated claim
                parameters for ConfigMap. dra-8794/parameters-3 found: [dra-4729/parameters-4
                dra-7328/parameters-4 dra-8794/parameters-4 dra-3402/parameters-4 dra-6156/parameters-4
                dra-1839/parameters-4 dra-7434/parameters-4 dra-6504/parameters-4]'
2024-06-13 17:27:04 +02:00
Kubernetes Prow Robot
9c8c61aee4 Merge pull request #122234 from AxeZhan/podUpdateEvent
[Scheduler]Put pod into the correct queue during podUpdate
2024-06-12 12:28:17 -07:00
AxeZhan
d66f8f9413 schedulingQueue update pod by queueHint 2024-06-12 21:26:09 +08:00
Patrick Ohly
c339eafb76 scheduler: allow PreBind to return "Pending" and "Unschedulable"
Any error result from PreBind was treated as a pod scheduling failure. This was
overlooked when moving blocking API calls in the DRA plugin into a PreBind
implementation, leading to:

    E0604 15:45:50.980929  306340 schedule_one.go:1048] "Error scheduling pod; retrying" err="waiting for resource driver" pod="test/test-draqld28"

That's because DRA's PreBind does some updates in the apiserver, then returns
Pending to wait for the outcome.

The fix is to allow PreBind to return the same special status codes as other
extension points.
2024-06-06 15:28:08 +02:00
YamasouA
f409dedb5d Implement QHint for CSINode 2024-06-04 23:04:52 +09:00
AxeZhan
cf73c9d93c remove EvaluatedNodes field in Diagnosis struct 2024-06-04 14:20:55 +08:00
Kubernetes Prow Robot
cfe5a7d03a Merge pull request #125213 from carlory/fix-dra-flaky
fix dra flaky test on TestPlugin
2024-06-03 13:32:10 -07:00
Kubernetes Prow Robot
8bd36c60bd Merge pull request #125197 from gabesaba/prefilter_perf
[scheduler] absent key in NodeToStatusMap implies UnschedulableAndUnresolvable
2024-06-03 07:35:41 -07:00
Gabe
c8f0ea1a54 Don't fill in NodeToStatusMap with UnschedulableAndUnresolvable 2024-05-31 15:52:16 +00:00
carlory
2794baf4c0 fix dra flaky test on TestPlugin 2024-05-30 23:22:37 +08:00
Kubernetes Prow Robot
ee2c1ffa80 Merge pull request #124630 from carlory/fix-123731
DRA: scheduler: index claim and class parameters to simplify lookup
2024-05-29 14:38:14 -07:00
carlory
3072987fcc DRA: scheduler: index claim and class parameters to simplify lookup 2024-05-27 15:57:10 +08:00
NoicFank
31a4b13238 enhancement(scheduler): share waitingPods among profiles 2024-05-17 17:07:27 +08:00
Toru Komatsu
5722db7aa3 QueueingHint for CSILimit when deleting pods (#121508)
Signed-off-by: utam0k <k0ma@utam0k.jp>
2024-05-14 11:07:11 -07:00
Kubernetes Prow Robot
9d87fa215d Merge pull request #124735 from AxeZhan/evaluatedNodes
Change EvaluatedNodes to count Nodes that reach Filter phase only
2024-05-09 22:43:22 -07:00
AxeZhan
bcf1c55837 evaluated nodes only consider filter stage 2024-05-10 12:40:12 +08:00
carlory
c8e91b9bc2 CephRBD volume plugin ( ) and its csi migration support were removed in this release 2024-05-09 22:55:34 +08:00
Kubernetes Prow Robot
b27608875c Merge pull request #124287 from sanposhiho/tainttoleration
implement QueueingHint in TaintToleration
2024-05-01 00:06:16 -07:00
carlory
06d3cd33b2 use slices library instead 2024-04-29 16:50:53 +08:00
wackxu
a4bfaae8a4 implement QueueingHint in TaintToleration 2024-04-29 07:18:35 +00:00
Kubernetes Prow Robot
cffc2c0b40 Merge pull request #124102 from pohly/dra-scheduler-assume-cache
scheduler: move assume cache to utils
2024-04-26 08:49:12 -07:00
Patrick Ohly
7f54c5dfec scheduler: remove AssumeCache interface
There's no reason for having the interface because there is only one
implementation. Makes the implementation of the test functions a bit
simpler (no casting). They are still stand-alone functions instead of methods
because they should not be considered part of the "normal" API.
2024-04-25 11:46:58 +02:00
Patrick Ohly
26e0409c36 scheduler: move assume cache to utils, part 2
This is now used by both the volumebinding and dynamicresources plugin, so
promoting it to a common helper package is better.

In terms of functionality, nothing was changed. Documentation got
updated (warns about storing locally modified objects, clarifies what the Get
parameters are). Code coverage should be a bit better than before (tested with
and without indexer, exercises event handlers, more error paths).

Checking for specific errors can now be done via errors.Is.
2024-04-25 11:45:43 +02:00
Patrick Ohly
910b90fca3 scheduler: move assume cache to utils, part 1
This is a verbatim move resp. copy of the files. They don't build in their new
location yet.
2024-04-25 10:49:41 +02:00
Marek Siarkowicz
3ee8178768 Cleanup defer from SetFeatureGateDuringTest function call 2024-04-24 20:25:29 +02:00
Patrick Ohly
a66d2163f9 dra scheduler: fix data race in unit test
Clearing some irrelevant fields in objects caused a flaky data race alert
because in some cases, the objects were pointers into a shared cache. A better
solution is to treat the objects as read-only and ignore the irrelevant fields.
2024-04-19 17:14:13 +02:00
Kubernetes Prow Robot
846e282d05 Merge pull request #124055 from yangjunmyfm192085/optklogprint
Optimize klog output(Use klog.KObj(pod) instead of pod)
2024-04-18 02:11:47 -07:00
Kubernetes Prow Robot
d2ce87eb94 Merge pull request #123938 from pohly/dra-structured-parameters-tests
DRA: test for structured parameters
2024-04-18 02:10:08 -07:00