Automated performance testing #74730

New Issue

Brecht Van Lommel · 2020-03-13T16:07:50+01:00

Brecht Van Lommel commented

2020-03-13 16:07:50 +01:00

See here for documentation on the performance testing framework:
https://wiki.blender.org/wiki/Tools/Tests/Performance

This system is separate from Open Data by design, for developers rather than users.

Tasks

Automate running on buildbot
- Nightly builds tracking performance over time
- GPU support
- Publish HTML results on builder.blender.org/download
Tests
- Cycles
- Animation playback
- Geometry nodes
- Object and mesh editing operators
- I/O

See here for documentation on the performance testing framework: https://wiki.blender.org/wiki/Tools/Tests/Performance This system is separate from [Open Data ](https://opendata.blender.org/) by design, for developers rather than users. **Tasks** - [ ] Automate running on buildbot - [ ] Nightly builds tracking performance over time - [ ] GPU support - [ ] Publish HTML results on builder.blender.org/download - [ ] Tests - [x] Cycles - [ ] Animation playback - [x] Geometry nodes - [ ] Object and mesh editing operators - [ ] I/O

Brecht Van Lommel commented

2020-03-13 16:07:50 +01:00

Changed status from 'Needs Triage' to: 'Confirmed'

Brecht Van Lommel commented

2020-03-13 16:07:50 +01:00

Added subscriber: @brecht

Brecht Van Lommel commented

2020-03-13 16:31:18 +01:00

Added subscribers: @SemMulder, @Jeroen-Bakker, @Sergey, @ideasman42, @mont29, @jesterking

Brecht Van Lommel commented

2020-03-13 16:31:18 +01:00

We don't immediately have to build out complicated infrastructure for this. But I thought it would be a good time to create this task now that we are starting work on performance projects.

The first step can just be gathering tests files in lib/benchmarks.

We don't immediately have to build out complicated infrastructure for this. But I thought it would be a good time to create this task now that we are starting work on performance projects. The first step can just be gathering tests files in `lib/benchmarks`.

Jeroen Bakker commented

2020-03-13 16:49:16 +01:00

Looking at rust-lang. they have a mechanism that stores the test results on the machine it was tested on when a test was changed or was run for the first time or forced by a flag.

The test would fail when the test input was the same (hash based?) with the previous run and the time was not within bandwidth. For the animation tests it would help if blender was able to run in the foreground.

Looking at rust-lang. they have a mechanism that stores the test results on the machine it was tested on when a test was changed or was run for the first time or forced by a flag. The test would fail when the test input was the same (hash based?) with the previous run and the time was not within bandwidth. For the animation tests it would help if blender was able to run in the foreground.

Brecht Van Lommel commented

2020-03-13 16:52:38 +01:00

Some experience from creating the Cycles benchmarking scripts:

Test should run with a dedicated Blender build that has all the proper build flags. Running performance tests with the build used for development means having to switch options too often and stops you from working while the tests run. Building should be handled by the test script.
It should be easy to run more tests with a different device, or different benchmark files, add extra revisions to bisect an issue, re-run a failed test, etc. Tests should be queued to run by another script, rather than manually having to manage when a test runs.
For Cycles I run each test 3 times and interleaved, and display the variance in graphs to detect tests with unpredictable performance. Disable ASLR and Turbo Boost to get more predictable performance on the CPU. For Cycles renders, test times are usually within 0.1% for different runs.

Some experience from creating the Cycles benchmarking scripts: * Test should run with a dedicated Blender build that has all the proper build flags. Running performance tests with the build used for development means having to switch options too often and stops you from working while the tests run. Building should be handled by the test script. * It should be easy to run more tests with a different device, or different benchmark files, add extra revisions to bisect an issue, re-run a failed test, etc. Tests should be queued to run by another script, rather than manually having to manage when a test runs. * For Cycles I run each test 3 times and interleaved, and display the variance in graphs to detect tests with unpredictable performance. Disable ASLR and Turbo Boost to get more predictable performance on the CPU. For Cycles renders, test times are usually within 0.1% for different runs.

Brecht Van Lommel commented

2020-03-13 17:07:11 +01:00

In #74730#890732, @Jeroen-Bakker wrote:
Looking at rust-lang. they have a mechanism that stores the test results on the machine it was tested on when a test was changed or was run for the first time or forced by a flag.

The test would fail when the test input was the same (hash based?) with the previous run and the time was not within bandwidth.

I think that's more difficult in our case. We will have more complicated tests that you probably wouldn't run locally unless you were specifically working on performance? At least I'm not imagining these to be part of our ctests.

For the animation tests it would help if blender was able to run in the foreground.

If we have dedicated machines for performance testing, they should have OpenGL to run such tests. For Blender itself, it's possible to run tests in the foreground, WITH_OPENGL_DRAW_TESTS does it for example.

> In #74730#890732, @Jeroen-Bakker wrote: > Looking at rust-lang. they have a mechanism that stores the test results on the machine it was tested on when a test was changed or was run for the first time or forced by a flag. > > The test would fail when the test input was the same (hash based?) with the previous run and the time was not within bandwidth. I think that's more difficult in our case. We will have more complicated tests that you probably wouldn't run locally unless you were specifically working on performance? At least I'm not imagining these to be part of our ctests. > For the animation tests it would help if blender was able to run in the foreground. If we have dedicated machines for performance testing, they should have OpenGL to run such tests. For Blender itself, it's possible to run tests in the foreground, `WITH_OPENGL_DRAW_TESTS` does it for example.

blender-admin commented

2020-03-15 00:30:26 +01:00

This issue was referenced by dc3f46d96b

This issue was referenced by dc3f46d96b780260d982954578cac3bff74efd83

Habib Gahbiche commented

2020-03-24 20:12:36 +01:00

Added subscriber: @zazizizou

Habib Gahbiche commented

2020-03-24 20:12:36 +01:00

If I understood correctly, a performance test could like the following:

1) Build blender
2) For each test case
        i. run init script (e.g. go to edit mode, get correct context etc..)
        ii. start clock
        iii. <run relevant script> 
        iv. stop clock
3) export results in Open Data .json format

Animation:
We probably need a relative time measurement here, e.g. number of frames * fps - actual time

Object/Mesh Operators:
For me this one is straightforward as long as no human input is needed. We could use a similar approach to the modifiers regression testing but use large meshes with large selection and only restrict the time measurement to the actual operator. Is this also what you were thinking of? Or should we take a similar approach to the Cycles tests and have a complex scene with large objects and meshes, apply many modifiers and operators and see how long the whole thing takes?

Cycles:
Not sure what should be done there, the Cycles benchmark seems to do exactly what you want already?

If I understood correctly, a performance test could like the following: ``` 1) Build blender 2) For each test case i. run init script (e.g. go to edit mode, get correct context etc..) ii. start clock iii. <run relevant script> iv. stop clock 3) export results in Open Data .json format ``` **Animation**: We probably need a relative time measurement here, e.g. `number of frames * fps - actual time` **Object/Mesh Operators**: For me this one is straightforward as long as no human input is needed. We could use a similar approach to the modifiers regression testing but use large meshes with large selection and only restrict the time measurement to the actual operator. Is this also what you were thinking of? Or should we take a similar approach to the Cycles tests and have a complex scene with large objects and meshes, apply many modifiers and operators and see how long the whole thing takes? **Cycles**: Not sure what should be done there, the Cycles benchmark seems to do exactly what you want already?

Brecht Van Lommel commented

2020-03-24 21:00:21 +01:00

Yes, that's all correct. Some of the tests might be artificial cases, but definitely real-world scenes is what I'm thinking of. We already have a system for Cycles, the purpose would just be to make all the performance tests use a single system that developers can run on their computer and that we can run every night on the buildbot as well. Beyond that maybe include some in the Blender benchmark.

Looking at open data, it's difficult to re-use a lot. There is about 500 lines of Python code, and particularly the device detection from that we can use. But part is also implemented in Go, things like running Blender with particular command-line options and parsing the Blender output to find the render time. That I think we should have in Python, each type of test needs to work a bit different there, the abstraction should be higher.

I think perhaps the best way forward would be to include this code in the Blender repo, in a way that there is a simple API that the Blender Benchmark can use and provide a nice UI for, but that also can be used for developers to tests locally on their machines.

I prototyped something last weekend, based on my Cycles benchmarking code but cleaner. It's incomplete of course, no support for GPU devices, no graphs, no proper JSON data format, test implementations don't really measure the right thing, etc.
https://developer.blender.org/diffusion/B/browse/performance-test/tests/performance/

Example output from that:

$ ./tests/performance/benchmark
usage: benchmark <command> [<args>]

Commands:
  init                   Set up git worktree and build in ../benchmark

  list                   List available tests
  devices                List available devices

  run                    Execute benchmarks for current revision
  add                    Queue current revision to be benchmarked
  remove                 Removed current revision
  clear                  Removed all queued and completed benchmarks

  status                 List queued and completed tests

  server                 Run as server, executing queued revisions

Arguments for run, add, remove and status:
  --test <pattern>       Pattern to match test name, may include wildcards
  --device <device>      Use only specified device
  --revision <revision>  Use specified instead of current revision

$ ./tests/performance/benchmark list
cycles_wdas_cloud    CPU
undo_translation     CPU

$ ./tests/performance/benchmark run --test undo*
f9d8640              undo_translation     CPU        [done]     1.1524s
$ git checkout other-branch
$ ./tests/performance/benchmark run --test undo*
87c825e              undo_translation     CPU        [done]     1.1555s

$ ./tests/performance/benchmark status
f9d8640              undo_translation     CPU        [done]     1.1524s
87c825e              undo_translation     CPU        [done]     1.1555s

Yes, that's all correct. Some of the tests might be artificial cases, but definitely real-world scenes is what I'm thinking of. We already have a system for Cycles, the purpose would just be to make all the performance tests use a single system that developers can run on their computer and that we can run every night on the buildbot as well. Beyond that maybe include some in the Blender benchmark. Looking at open data, it's difficult to re-use a lot. There is about 500 lines of Python code, and particularly the device detection from that we can use. But part is also implemented in Go, things like running Blender with particular command-line options and parsing the Blender output to find the render time. That I think we should have in Python, each type of test needs to work a bit different there, the abstraction should be higher. I think perhaps the best way forward would be to include this code in the Blender repo, in a way that there is a simple API that the Blender Benchmark can use and provide a nice UI for, but that also can be used for developers to tests locally on their machines. I prototyped something last weekend, based on my Cycles benchmarking code but cleaner. It's incomplete of course, no support for GPU devices, no graphs, no proper JSON data format, test implementations don't really measure the right thing, etc. https://developer.blender.org/diffusion/B/browse/performance-test/tests/performance/ Example output from that: ``` $ ./tests/performance/benchmark usage: benchmark <command> [<args>] Commands: init Set up git worktree and build in ../benchmark list List available tests devices List available devices run Execute benchmarks for current revision add Queue current revision to be benchmarked remove Removed current revision clear Removed all queued and completed benchmarks status List queued and completed tests server Run as server, executing queued revisions Arguments for run, add, remove and status: --test <pattern> Pattern to match test name, may include wildcards --device <device> Use only specified device --revision <revision> Use specified instead of current revision $ ./tests/performance/benchmark list cycles_wdas_cloud CPU undo_translation CPU $ ./tests/performance/benchmark run --test undo* f9d8640 undo_translation CPU [done] 1.1524s $ git checkout other-branch $ ./tests/performance/benchmark run --test undo* 87c825e undo_translation CPU [done] 1.1555s $ ./tests/performance/benchmark status f9d8640 undo_translation CPU [done] 1.1524s 87c825e undo_translation CPU [done] 1.1555s ```

Habib Gahbiche commented

2020-03-24 22:17:07 +01:00

That actually looks nice. I just saw that the code is in the branch performance-test. I would very much like to contribute to it. What I would like to do next is:

Put a layer of abstraction between environment.Test and the actual test case, e.g. AnimationTest such that adding a new test can be done by just adding a new spec, e.g. list of parameters for a modifier or a blend file path for animation or rendering.
Implement an interface class for the following areas:
- Animation
- Modifiers
- Operators (object and edit mode)
- Compositor
- Cycles
- Custom script maybe?

I still have to look at the Open Data benchmark more closely to understand the format and how a GUI would interact with the benchmark and see how much I can re-use for the input/output as well.

That actually looks nice. I just saw that the code is in the branch `performance-test`. I would very much like to contribute to it. What I would like to do next is: - Put a layer of abstraction between `environment.Test` and the actual test case, e.g. `AnimationTest` such that adding a new test can be done by just adding a new spec, e.g. list of parameters for a modifier or a blend file path for animation or rendering. - Implement an interface class for the following areas: - Animation - Modifiers - Operators (object and edit mode) - Compositor - Cycles - Custom script maybe? I still have to look at the Open Data benchmark more closely to understand the format and how a GUI would interact with the benchmark and see how much I can re-use for the input/output as well.

Brecht Van Lommel commented

2020-03-25 15:04:31 +01:00

I was expecting AnimationTest itself to be that abstraction layer, multiple instances with different .blend files can already be generated. If there is more abstraction needed for particular types of tests that's fine, just not sure what the concrete cases would be. "custom script" I don't understand, that's what is intended to be possible already.

I wonder if we wouldn't be duplicating the regression test code too much, maybe we can share code. I think creating modifier regression tests should be simplified, so that the input is only a .blend file with a bunch of test mesh object, and tests are defined fully by a line of Python code. Creating collections and expect objects should not be done manually. With that type of setup it's easier to also reuse it for performance tests.

Mainly I was thinking of actual production files and not so much synthetic tests, so I haven't thought about that design much.

I was expecting `AnimationTest` itself to be that abstraction layer, multiple instances with different .blend files can already be generated. If there is more abstraction needed for particular types of tests that's fine, just not sure what the concrete cases would be. "custom script" I don't understand, that's what is intended to be possible already. I wonder if we wouldn't be duplicating the regression test code too much, maybe we can share code. I think creating modifier regression tests should be simplified, so that the input is only a .blend file with a bunch of test mesh object, and tests are defined fully by a line of Python code. Creating collections and expect objects should not be done manually. With that type of setup it's easier to also reuse it for performance tests. Mainly I was thinking of actual production files and not so much synthetic tests, so I haven't thought about that design much.

Habib Gahbiche commented

2020-03-27 18:55:36 +01:00

Added subscriber: @Calra

Habib Gahbiche commented

2020-03-27 18:55:36 +01:00

In #74730#897649, @brecht wrote:
Creating collections and expect objects should not be done manually. With that type of setup it's easier to also reuse it for performance tests.

Actually I was discussing just that with @calra. But we might need a few more adjustments to make the framework usable as a performance test.

If there is more abstraction needed for particular types of tests that's fine, just not sure what the concrete cases would be

I was thinking about simplifying adding more tests. It might be simple for animation or cycles but operators need operator specific paramters/selection and I want to avoid creating a new class for each new (type of) operator.

I wonder if we wouldn't be duplicating the regression test code too much, maybe we can share code

Sure, I was thinking about creating the interface only. The implementation can use the code from regression tests.

> In #74730#897649, @brecht wrote: > Creating collections and expect objects should not be done manually. With that type of setup it's easier to also reuse it for performance tests. Actually I was discussing just that with @calra. But we might need a few more adjustments to make the framework usable as a performance test. > If there is more abstraction needed for particular types of tests that's fine, just not sure what the concrete cases would be I was thinking about simplifying adding more tests. It might be simple for animation or cycles but operators need operator specific paramters/selection and I want to avoid creating a new class for each new (type of) operator. > I wonder if we wouldn't be duplicating the regression test code too much, maybe we can share code Sure, I was thinking about creating the interface only. The implementation can use the code from regression tests.

Himanshi Kalra commented

2020-03-31 12:13:29 +02:00

Yes, I have added it as my 1st Deliverable in Gsoc, for regression testing it would be better if the user has a choice of doing it in Blender as well as with just a line of Python.

Philipp Oeser removed the

Download

What's New

Blender Studio

Manual

Developers Blog

Documentation

Benchmark

Blender Conference

Development Fund

One-time Donations

Automated performance testing #74730