Commit Graph

291 Commits

Author SHA1 Message Date
Patrick Ohly
6b01ece580 scheduler-perf: fix perfdash display problem
perfdash expects all data items to have the same set of labels.  It then
renders drop-down buttons for each label with all values found for each
label. Previously, data items that didn't have a label didn't match any label
filter in perfdash and couldn't get selected because perfdash doesn't have
"unset" in it's drop-down menus.

To avoid that, scheduler-perf now collects all labels and then adds missing
labels with "not applicable" as value:

    {
      "data": {
        "Average": 939.7071223010004,
        "Perc50": 927.7987421383649,
        "Perc90": 2166.153846153846,
        "Perc95": 2363.076923076923,
        "Perc99": 2520.6153846153848
      },
      "unit": "ms",
      "labels": {
        "Metric": "scheduler_pod_scheduling_duration_seconds",
        "Name": "SchedulingBasic/5000Nodes/namespace-2",
        "extension_point": "not applicable",
        "result": "not applicable"
      }
    },
    ...
    {
      "data": {
        "Average": 1.1172570650000004,
        "Perc50": 1.1418367346938776,
        "Perc90": 1.5500000000000003,
        "Perc95": 1.6410256410256412,
        "Perc99": 3.7333333333333334
      },
      "unit": "ms",
      "labels": {
        "Metric": "scheduler_framework_extension_point_duration_seconds",
        "Name": "SchedulingBasic/5000Nodes/namespace-2",
        "extension_point": "Score",
        "result": "not applicable"
      }
    },
2023-07-03 21:16:53 +02:00
Patrick Ohly
29e5771aa4 scheduler-perf: shorten "Name" label in metrics
Because the JSON file gets written at the end of the top-level benchmark, all
data items had `BenchmarkPerfScheduling/` as prefix in the `Name` label. This
is redundant and makes it harder to see the actual name. Now that common prefix
gets removed.
2023-07-03 21:15:16 +02:00
Patrick Ohly
0d41d509d2 scheduler_perf: replace gomega.Eventually with wait.PollUntilContextTimeout
This is done for the sake of consistency. The failure message becomes less
useful.
2023-06-28 09:22:26 +02:00
Patrick Ohly
cecebe8ea2 scheduler_perf: add TestScheduling integration test
This runs workloads that are labeled as "integration-test". The apiserver and
scheduler are only started once per unique configuration, followed by each
workload using that configuration. This makes execution faster. In contrast to
benchmarking, we care less about starting with a clean slate for each test.
2023-06-28 09:22:25 +02:00
Patrick Ohly
dfd646e0a8 scheduler_perf: fix namespace deletion
Merely deleting the namespace is not enough:
- Workloads might rely on the garbage collector to get rid of obsolete objects,
  so we should run it to be on the safe side.
- Pods must be force-deleted because kubelet is not running.
- Finally, the namespace controller is needed to get rid of
  deleted namespaces.
2023-06-28 09:22:25 +02:00
Patrick Ohly
d9c16a1ced scheduler_perf: fix goroutine leak in runWorkload
This becomes relevant when doing more fine-grained leak checking.
2023-06-28 08:14:34 +02:00
Patrick Ohly
c91c578795 scheduler_perf: skip expensive cleanup during benchmarks
Each benchmark test case runs with a fresh etcd instance. Therefore it is not
necessary to delete objects after a run.

A future unit test might reuse etcd, therefore cleanup is optional.
2023-06-22 08:56:14 +02:00
Kubernetes Prow Robot
2057a48ee5
Merge pull request #114771 from sanposhiho/scheduling_perf_scheduler_scheduling_attempt_duration_seconds
feature(scheduler_perf): distinguish result in scheduler_scheduling_attempt_duration_seconds metric result
2023-06-07 06:18:13 -07:00
Kensei Nakada
a4ea058cc7 feature(scheduler_perf): distinguish result in scheduler_scheduling_attempt_duration_seconds metric result 2023-06-02 14:45:55 +00:00
Patrick Ohly
2db577a560 scheduler-perf: inject "benchmark" as name into JSON result filename
This is required because an empty name is no longer supported: the
perf-dashboard is run with --allow-parsers-matching-all-tests=false with causes
perfdash to skip current configuration for BenchmarkPerfResults as it does not
have name
set (4674704f45/perfdash/metrics-downloader.go (L165-L167)).

The perf-dash config needs to be updated accordingly.
2023-05-22 08:07:15 +02:00
kerthcet
0616d15712 Fix perf-test by increasing the error margin
Signed-off-by: kerthcet <kerthcet@gmail.com>
2023-05-17 12:14:06 +08:00
Kubernetes Prow Robot
f84ff3d052
Merge pull request #117813 from pohly/scheduler-perf-test-runtime
scheduler-perf: measure workload runtime and relabel workloads
2023-05-15 12:19:18 -07:00
Patrick Ohly
d85b91f343 scheduler-perf: measure workload runtime and relabel workloads
The goal is to only label workloads as "performance" which actually run long
enough to provide useful metrics. The throughput collector samples once per
second, so a workload should run at least 5, better 10 seconds to get at least
a minimal amount of samples for the percentile calculation.

For benchstat analysis of runs with sufficient repetitions to get statistically
meaningful results, each workload shouldn't run more than one minute, otherwise
before/after analysis becomes too slow.

The labels were chosen based on benchmark runs on a reasonably fast desktop. To
know how long each workload takes, a new "runtime_seconds" benchmark result
gets added.
2023-05-15 14:33:40 +02:00
Kubernetes Prow Robot
8b33eaa0a7
Merge pull request #116207 from pohly/dra-scheduler-perf
scheduler_perf: dynamic resource allocation test cases
2023-05-10 10:58:59 -07:00
Patrick Ohly
034528a9f0 scheduler perf: add DynamicResourceAllocation test cases
The default scheduler configuration must be based on the v1 API where the
plugin is enabled by default. Then if (and only if) the
DynamicResourceAllocation feature gate for a test is set, the corresponding
API group also gets enabled.

The normal dynamic resource claim controller is started if needed to create
ResourceClaims from ResourceClaimTemplates.

Without the upcoming optimizations in the scheduler, scheduling with dynamic
resources is fairly slow. The new test cases take around 15 minutes wall clock
time on my desktop.
2023-05-04 13:08:06 +02:00
Kante Yin
a7035f5459 Pass Context to StartTestServer
Signed-off-by: Kante Yin <kerthcet@gmail.com>
2023-05-04 10:25:09 +08:00
Kubernetes Prow Robot
fb93000eb5
Merge pull request #117468 from HirazawaUi/replace-test-deprecated-ioutil
Replace the deprecated ioutil methods in the test directory
2023-05-03 12:02:32 -07:00
Kubernetes Prow Robot
47f1bd9f80
Merge pull request #117649 from SataQiu/scheduler-remove-v1beta2-20230427
scheduler: remove deprecated v1beta2 KubeSchedulerConfiguration  component config
2023-05-03 09:54:41 -07:00
Kubernetes Prow Robot
aece6838e8
Merge pull request #117232 from pohly/scheduler-perf-code-cleanups
scheduler_perf: code cleanups
2023-05-03 09:54:13 -07:00
SataQiu
1f7c07f355 scheduler: remove deprecated v1beta2 KubeSchedulerConfiguration 2023-05-03 21:43:19 +08:00
Kubernetes Prow Robot
b4c6a70927
Merge pull request #117230 from pohly/scheduler-perf-throughput
scheduler_perf: update throughputCollector
2023-04-29 12:12:17 -07:00
Patrick Ohly
b3e0bc8864 scheduler_perf: let the test decide which informers are needed
This will change when adding dynamic resource allocation test cases. Instead of
changing mustSetupScheduler and StartScheduler for that, let's return the
informer factory and create informers as needed in the test.
2023-04-27 15:31:40 +02:00
Patrick Ohly
969d28b12b scheduler_perf: refactor common code 2023-04-27 15:31:37 +02:00
Kubernetes Prow Robot
dd62a53e1a
Merge pull request #117196 from pohly/scheduler-perf-labels
scheduler_perf: support test case selection via labels
2023-04-26 14:26:14 -07:00
Patrick Ohly
550d4c0074 scheduler_perf: support test case selection via labels
Entire test cases and workloads can have labels attached to them. The union of
these must match the label filter which works as in GitHub. The benchmark by
default runs the tests that are labeled "performance", which is the same as
before.
2023-04-26 21:01:31 +02:00
Patrick Ohly
78b8af9fed scheduler_perf: update throughputCollector
The previous solution had some shortcomings:

- It was based on the assumption that the goroutine gets woken up at regular
  intervals. This is not actually guaranteed. Now the code keeps track of the
  actual start and end of an interval and verifies that assumption.

- If no pod was scheduled (unlikely, but could happen), then
  "0 pods/s" got recorded. In such a case, the metric was always either
  zero or >= 1. A better solution is to extend the interval
  until some pod gets scheduled. With the larger time interval
  it is then possible to also track, for example, 0.5 pods/s.
2023-04-26 08:11:50 +02:00
HirazawaUi
a8b808ee6c Replace the deprecated ioutil methods in the test directory 2023-04-18 21:51:10 +08:00
Kubernetes Prow Robot
aa026a6b30
Merge pull request #117202 from pohly/scheduler-perf-zero-count
scheduler perf: allow creating 0 items
2023-04-11 21:18:20 -07:00
Kubernetes Prow Robot
251d4a00e6
Merge pull request #117201 from pohly/scheduler-perf-node-labels
test/integration: create nodes directly with kubernetes.io/hostname label
2023-04-11 21:18:12 -07:00
Kubernetes Prow Robot
69b59b9d42
Merge pull request #117199 from pohly/scheduler-perf-race-fix
scheduler_perf: fix race condition
2023-04-11 21:18:05 -07:00
Kubernetes Prow Robot
2bfaaf21c1
Merge pull request #117197 from pohly/scheduler-perf-cleanup
scheduler perf: remove cleanup func
2023-04-11 21:17:57 -07:00
Patrick Ohly
464edfe6f6 test/integration: create nodes directly with kubernetes.io/hostname label
By generating the unique name in advance, the label also can be set to a
matching value directly in the Create request. This makes test startup in
test/integration/scheduler_perf a bit faster because the extra patching can be
avoided.

It also leads to a better label because previously, the unique label value
didn't match the node name. This is required for simulating dynamic resource
allocation, which relies on the label to track where an allocated claim is
available.
2023-04-11 16:35:37 +02:00
Patrick Ohly
aa73f06e56 scheduler perf: allow creating 0 items
It makes sense to define a test where, depending on the parameters, some
operation creations zero pods, namespaces or nodes. The validation didn't allow
that previously due to the way how it was implemented although the underlying
code works fine with zero as count.
2023-04-11 09:59:16 +02:00
Patrick Ohly
49bbf7c268 scheduler_perf: fix race condition
collector.collect got called without ensuring that collector.run had
terminated, so it could have happened that collector.run adds another sample
while collector.collect is reading them.
2023-04-11 09:46:34 +02:00
Patrick Ohly
a869a89825 scheduler perf: remove cleanup func
b.Cleanup may as well get called inside the function instead
of leaving that to the caller.
2023-04-11 09:43:45 +02:00
sarab
8d18ae6fc2 Use the generic Set in scheduler 2023-04-09 11:34:17 +05:30
Wei Huang
c9bc2f98d0
fix: remove SchedulingMigratedInTreePVs feature gate in sched perf test 2023-03-08 08:34:44 -08:00
Patrick Ohly
cc4bcd1d8e scheduler_perf: report data items as benchmark results
This replaces the pretty useless us/op metric (useless because it includes
setup and teardown times) with the same values that also get stored in the JSON
file.

The main advantage is that benchstat can be used to analyze and compare
results.
2023-02-28 23:08:23 +01:00
Patrick Ohly
961129c5f1 scheduler_perf: add logging flags
This enables testing of different real production configurations (JSON
vs. text, different log levels, contextual logging).
2023-02-28 23:08:17 +01:00
Kante Yin
3d0894fabf
Fix failure(context canceled) in scheduler_perf benchmark (#114843)
* Fix failure in scheduler_perf benchmark

Signed-off-by: Kante Yin <kerthcet@gmail.com>

* Fatal when error in cleaning up nodes in scheduler perf tests

Signed-off-by: Kante Yin <kerthcet@gmail.com>

* Use derived context to better organize the codes

Signed-off-by: Kante Yin <kerthcet@gmail.com>

* Change log level to 2 in scheduler perf-test

Signed-off-by: Kante Yin <kerthcet@gmail.com>

---------

Signed-off-by: Kante Yin <kerthcet@gmail.com>
2023-01-30 16:21:00 -08:00
Kensei Nakada
e8092cc885 cleanup(scheduler_perf): remove all removed feature gates 2023-01-04 01:07:47 +00:00
Patrick Ohly
2f6c4f5eab e2e: use Ginkgo context
All code must use the context from Ginkgo when doing API calls or polling for a
change, otherwise the code would not return immediately when the test gets
aborted.
2022-12-16 20:14:04 +01:00
Mark Rossetti
534f052a8d
Updating pause image refernces to 3.9
Signed-off-by: Mark Rossetti <marosset@microsoft.com>
2022-11-14 10:24:54 -08:00
kerthcet
d6ffb47832 Replace klog with benchmark log in scheduler_perf
Signed-off-by: kerthcet <kerthcet@gmail.com>
2022-11-09 09:11:55 +08:00
Kubernetes Prow Robot
73f6b96f0a
Merge pull request #113615 from kerthcet/feat/add-benchmark-tests
Add nodeInclusionPolicy benchmark tests to scheduler_perf
2022-11-07 09:18:28 -08:00
kerthcet
bc15aca26d Refactor SchedulerConfigFile
Rename to SchedulerConfigPath and make it a pointer
to be consist with other fields

Signed-off-by: kerthcet <kerthcet@gmail.com>
2022-11-05 00:30:34 +08:00
kerthcet
48f2c9ec20 Add benchmark tests for nodeInclusionPolicy
Signed-off-by: kerthcet <kerthcet@gmail.com>
2022-11-05 00:13:43 +08:00
kerthcet
cfc53ee524 Refactor code and annotations for readability
Signed-off-by: kerthcet <kerthcet@gmail.com>
2022-11-01 17:44:45 +08:00
kerthcet
21e8a69a22 Use operationCode instead of string directly
Signed-off-by: kerthcet <kerthcet@gmail.com>
2022-11-01 17:01:22 +08:00
Patrick Ohly
41619ace15 stop using deprecated klog flags
Some scripts and tools still relied on the deprecated flags, the ones
which are about to be removed.

This is intentionally not a complete removal of all those flags in the entire
repo. This would lead to much more code churn also in places where commands
still accept the flags because they use klog directly.
2022-09-04 21:02:43 +02:00