Automated performance testing #74730

Open
opened 2020-03-13 16:07:50 +01:00 by Brecht Van Lommel · 16 comments

See here for documentation on the performance testing framework:
https://wiki.blender.org/wiki/Tools/Tests/Performance

This system is separate from Open Data by design, for developers rather than users.

Tasks

  • Automate running on buildbot
    • Nightly builds tracking performance over time
    • GPU support
    • Publish HTML results on builder.blender.org/download
  • Tests
    • Cycles
    • Animation playback
    • Geometry nodes
    • Object and mesh editing operators
    • I/O
See here for documentation on the performance testing framework: https://wiki.blender.org/wiki/Tools/Tests/Performance This system is separate from [Open Data ](https://opendata.blender.org/) by design, for developers rather than users. **Tasks** - [ ] Automate running on buildbot - [ ] Nightly builds tracking performance over time - [ ] GPU support - [ ] Publish HTML results on builder.blender.org/download - [ ] Tests - [x] Cycles - [ ] Animation playback - [x] Geometry nodes - [ ] Object and mesh editing operators - [ ] I/O
Author
Owner

Changed status from 'Needs Triage' to: 'Confirmed'

Changed status from 'Needs Triage' to: 'Confirmed'
Author
Owner

Added subscriber: @brecht

Added subscriber: @brecht
Author
Owner
Added subscribers: @SemMulder, @Jeroen-Bakker, @Sergey, @ideasman42, @mont29, @jesterking
Author
Owner

We don't immediately have to build out complicated infrastructure for this. But I thought it would be a good time to create this task now that we are starting work on performance projects.

The first step can just be gathering tests files in lib/benchmarks.

We don't immediately have to build out complicated infrastructure for this. But I thought it would be a good time to create this task now that we are starting work on performance projects. The first step can just be gathering tests files in `lib/benchmarks`.
Member

Looking at rust-lang. they have a mechanism that stores the test results on the machine it was tested on when a test was changed or was run for the first time or forced by a flag.

The test would fail when the test input was the same (hash based?) with the previous run and the time was not within bandwidth. For the animation tests it would help if blender was able to run in the foreground.

Looking at rust-lang. they have a mechanism that stores the test results on the machine it was tested on when a test was changed or was run for the first time or forced by a flag. The test would fail when the test input was the same (hash based?) with the previous run and the time was not within bandwidth. For the animation tests it would help if blender was able to run in the foreground.
Author
Owner

Some experience from creating the Cycles benchmarking scripts:

  • Test should run with a dedicated Blender build that has all the proper build flags. Running performance tests with the build used for development means having to switch options too often and stops you from working while the tests run. Building should be handled by the test script.
  • It should be easy to run more tests with a different device, or different benchmark files, add extra revisions to bisect an issue, re-run a failed test, etc. Tests should be queued to run by another script, rather than manually having to manage when a test runs.
  • For Cycles I run each test 3 times and interleaved, and display the variance in graphs to detect tests with unpredictable performance. Disable ASLR and Turbo Boost to get more predictable performance on the CPU. For Cycles renders, test times are usually within 0.1% for different runs.
Some experience from creating the Cycles benchmarking scripts: * Test should run with a dedicated Blender build that has all the proper build flags. Running performance tests with the build used for development means having to switch options too often and stops you from working while the tests run. Building should be handled by the test script. * It should be easy to run more tests with a different device, or different benchmark files, add extra revisions to bisect an issue, re-run a failed test, etc. Tests should be queued to run by another script, rather than manually having to manage when a test runs. * For Cycles I run each test 3 times and interleaved, and display the variance in graphs to detect tests with unpredictable performance. Disable ASLR and Turbo Boost to get more predictable performance on the CPU. For Cycles renders, test times are usually within 0.1% for different runs.
Author
Owner

In #74730#890732, @Jeroen-Bakker wrote:
Looking at rust-lang. they have a mechanism that stores the test results on the machine it was tested on when a test was changed or was run for the first time or forced by a flag.

The test would fail when the test input was the same (hash based?) with the previous run and the time was not within bandwidth.

I think that's more difficult in our case. We will have more complicated tests that you probably wouldn't run locally unless you were specifically working on performance? At least I'm not imagining these to be part of our ctests.

For the animation tests it would help if blender was able to run in the foreground.

If we have dedicated machines for performance testing, they should have OpenGL to run such tests. For Blender itself, it's possible to run tests in the foreground, WITH_OPENGL_DRAW_TESTS does it for example.

> In #74730#890732, @Jeroen-Bakker wrote: > Looking at rust-lang. they have a mechanism that stores the test results on the machine it was tested on when a test was changed or was run for the first time or forced by a flag. > > The test would fail when the test input was the same (hash based?) with the previous run and the time was not within bandwidth. I think that's more difficult in our case. We will have more complicated tests that you probably wouldn't run locally unless you were specifically working on performance? At least I'm not imagining these to be part of our ctests. > For the animation tests it would help if blender was able to run in the foreground. If we have dedicated machines for performance testing, they should have OpenGL to run such tests. For Blender itself, it's possible to run tests in the foreground, `WITH_OPENGL_DRAW_TESTS` does it for example.

This issue was referenced by dc3f46d96b

This issue was referenced by dc3f46d96b780260d982954578cac3bff74efd83
Member

Added subscriber: @zazizizou

Added subscriber: @zazizizou
Member

If I understood correctly, a performance test could like the following:

1) Build blender
2) For each test case
        i. run init script (e.g. go to edit mode, get correct context etc..)
        ii. start clock
        iii. <run relevant script> 
        iv. stop clock
3) export results in Open Data .json format

Animation:
We probably need a relative time measurement here, e.g. number of frames * fps - actual time

Object/Mesh Operators:
For me this one is straightforward as long as no human input is needed. We could use a similar approach to the modifiers regression testing but use large meshes with large selection and only restrict the time measurement to the actual operator. Is this also what you were thinking of? Or should we take a similar approach to the Cycles tests and have a complex scene with large objects and meshes, apply many modifiers and operators and see how long the whole thing takes?

Cycles:
Not sure what should be done there, the Cycles benchmark seems to do exactly what you want already?

If I understood correctly, a performance test could like the following: ``` 1) Build blender 2) For each test case i. run init script (e.g. go to edit mode, get correct context etc..) ii. start clock iii. <run relevant script> iv. stop clock 3) export results in Open Data .json format ``` **Animation**: We probably need a relative time measurement here, e.g. `number of frames * fps - actual time` **Object/Mesh Operators**: For me this one is straightforward as long as no human input is needed. We could use a similar approach to the modifiers regression testing but use large meshes with large selection and only restrict the time measurement to the actual operator. Is this also what you were thinking of? Or should we take a similar approach to the Cycles tests and have a complex scene with large objects and meshes, apply many modifiers and operators and see how long the whole thing takes? **Cycles**: Not sure what should be done there, the Cycles benchmark seems to do exactly what you want already?
Author
Owner

Yes, that's all correct. Some of the tests might be artificial cases, but definitely real-world scenes is what I'm thinking of. We already have a system for Cycles, the purpose would just be to make all the performance tests use a single system that developers can run on their computer and that we can run every night on the buildbot as well. Beyond that maybe include some in the Blender benchmark.

Looking at open data, it's difficult to re-use a lot. There is about 500 lines of Python code, and particularly the device detection from that we can use. But part is also implemented in Go, things like running Blender with particular command-line options and parsing the Blender output to find the render time. That I think we should have in Python, each type of test needs to work a bit different there, the abstraction should be higher.

I think perhaps the best way forward would be to include this code in the Blender repo, in a way that there is a simple API that the Blender Benchmark can use and provide a nice UI for, but that also can be used for developers to tests locally on their machines.

I prototyped something last weekend, based on my Cycles benchmarking code but cleaner. It's incomplete of course, no support for GPU devices, no graphs, no proper JSON data format, test implementations don't really measure the right thing, etc.
https://developer.blender.org/diffusion/B/browse/performance-test/tests/performance/

Example output from that:

$ ./tests/performance/benchmark
usage: benchmark <command> [<args>]

Commands:
  init                   Set up git worktree and build in ../benchmark

  list                   List available tests
  devices                List available devices

  run                    Execute benchmarks for current revision
  add                    Queue current revision to be benchmarked
  remove                 Removed current revision
  clear                  Removed all queued and completed benchmarks

  status                 List queued and completed tests

  server                 Run as server, executing queued revisions

Arguments for run, add, remove and status:
  --test <pattern>       Pattern to match test name, may include wildcards
  --device <device>      Use only specified device
  --revision <revision>  Use specified instead of current revision

$ ./tests/performance/benchmark list
cycles_wdas_cloud    CPU
undo_translation     CPU

$ ./tests/performance/benchmark run --test undo*
f9d8640              undo_translation     CPU        [done]     1.1524s
$ git checkout other-branch
$ ./tests/performance/benchmark run --test undo*
87c825e              undo_translation     CPU        [done]     1.1555s

$ ./tests/performance/benchmark status
f9d8640              undo_translation     CPU        [done]     1.1524s
87c825e              undo_translation     CPU        [done]     1.1555s
Yes, that's all correct. Some of the tests might be artificial cases, but definitely real-world scenes is what I'm thinking of. We already have a system for Cycles, the purpose would just be to make all the performance tests use a single system that developers can run on their computer and that we can run every night on the buildbot as well. Beyond that maybe include some in the Blender benchmark. Looking at open data, it's difficult to re-use a lot. There is about 500 lines of Python code, and particularly the device detection from that we can use. But part is also implemented in Go, things like running Blender with particular command-line options and parsing the Blender output to find the render time. That I think we should have in Python, each type of test needs to work a bit different there, the abstraction should be higher. I think perhaps the best way forward would be to include this code in the Blender repo, in a way that there is a simple API that the Blender Benchmark can use and provide a nice UI for, but that also can be used for developers to tests locally on their machines. I prototyped something last weekend, based on my Cycles benchmarking code but cleaner. It's incomplete of course, no support for GPU devices, no graphs, no proper JSON data format, test implementations don't really measure the right thing, etc. https://developer.blender.org/diffusion/B/browse/performance-test/tests/performance/ Example output from that: ``` $ ./tests/performance/benchmark usage: benchmark <command> [<args>] Commands: init Set up git worktree and build in ../benchmark list List available tests devices List available devices run Execute benchmarks for current revision add Queue current revision to be benchmarked remove Removed current revision clear Removed all queued and completed benchmarks status List queued and completed tests server Run as server, executing queued revisions Arguments for run, add, remove and status: --test <pattern> Pattern to match test name, may include wildcards --device <device> Use only specified device --revision <revision> Use specified instead of current revision $ ./tests/performance/benchmark list cycles_wdas_cloud CPU undo_translation CPU $ ./tests/performance/benchmark run --test undo* f9d8640 undo_translation CPU [done] 1.1524s $ git checkout other-branch $ ./tests/performance/benchmark run --test undo* 87c825e undo_translation CPU [done] 1.1555s $ ./tests/performance/benchmark status f9d8640 undo_translation CPU [done] 1.1524s 87c825e undo_translation CPU [done] 1.1555s ```
Member

That actually looks nice. I just saw that the code is in the branch performance-test. I would very much like to contribute to it. What I would like to do next is:

  • Put a layer of abstraction between environment.Test and the actual test case, e.g. AnimationTest such that adding a new test can be done by just adding a new spec, e.g. list of parameters for a modifier or a blend file path for animation or rendering.
  • Implement an interface class for the following areas:
    • Animation
    • Modifiers
    • Operators (object and edit mode)
    • Compositor
    • Cycles
    • Custom script maybe?

I still have to look at the Open Data benchmark more closely to understand the format and how a GUI would interact with the benchmark and see how much I can re-use for the input/output as well.

That actually looks nice. I just saw that the code is in the branch `performance-test`. I would very much like to contribute to it. What I would like to do next is: - Put a layer of abstraction between `environment.Test` and the actual test case, e.g. `AnimationTest` such that adding a new test can be done by just adding a new spec, e.g. list of parameters for a modifier or a blend file path for animation or rendering. - Implement an interface class for the following areas: - Animation - Modifiers - Operators (object and edit mode) - Compositor - Cycles - Custom script maybe? I still have to look at the Open Data benchmark more closely to understand the format and how a GUI would interact with the benchmark and see how much I can re-use for the input/output as well.
Author
Owner

I was expecting AnimationTest itself to be that abstraction layer, multiple instances with different .blend files can already be generated. If there is more abstraction needed for particular types of tests that's fine, just not sure what the concrete cases would be. "custom script" I don't understand, that's what is intended to be possible already.

I wonder if we wouldn't be duplicating the regression test code too much, maybe we can share code. I think creating modifier regression tests should be simplified, so that the input is only a .blend file with a bunch of test mesh object, and tests are defined fully by a line of Python code. Creating collections and expect objects should not be done manually. With that type of setup it's easier to also reuse it for performance tests.

Mainly I was thinking of actual production files and not so much synthetic tests, so I haven't thought about that design much.

I was expecting `AnimationTest` itself to be that abstraction layer, multiple instances with different .blend files can already be generated. If there is more abstraction needed for particular types of tests that's fine, just not sure what the concrete cases would be. "custom script" I don't understand, that's what is intended to be possible already. I wonder if we wouldn't be duplicating the regression test code too much, maybe we can share code. I think creating modifier regression tests should be simplified, so that the input is only a .blend file with a bunch of test mesh object, and tests are defined fully by a line of Python code. Creating collections and expect objects should not be done manually. With that type of setup it's easier to also reuse it for performance tests. Mainly I was thinking of actual production files and not so much synthetic tests, so I haven't thought about that design much.
Member

Added subscriber: @Calra

Added subscriber: @Calra
Member

In #74730#897649, @brecht wrote:
Creating collections and expect objects should not be done manually. With that type of setup it's easier to also reuse it for performance tests.

Actually I was discussing just that with @calra. But we might need a few more adjustments to make the framework usable as a performance test.

If there is more abstraction needed for particular types of tests that's fine, just not sure what the concrete cases would be

I was thinking about simplifying adding more tests. It might be simple for animation or cycles but operators need operator specific paramters/selection and I want to avoid creating a new class for each new (type of) operator.

I wonder if we wouldn't be duplicating the regression test code too much, maybe we can share code

Sure, I was thinking about creating the interface only. The implementation can use the code from regression tests.

> In #74730#897649, @brecht wrote: > Creating collections and expect objects should not be done manually. With that type of setup it's easier to also reuse it for performance tests. Actually I was discussing just that with @calra. But we might need a few more adjustments to make the framework usable as a performance test. > If there is more abstraction needed for particular types of tests that's fine, just not sure what the concrete cases would be I was thinking about simplifying adding more tests. It might be simple for animation or cycles but operators need operator specific paramters/selection and I want to avoid creating a new class for each new (type of) operator. > I wonder if we wouldn't be duplicating the regression test code too much, maybe we can share code Sure, I was thinking about creating the interface only. The implementation can use the code from regression tests.
Member

Yes, I have added it as my 1st Deliverable in Gsoc, for regression testing it would be better if the user has a choice of doing it in Blender as well as with just a line of Python.

Yes, I have added it as my 1st Deliverable in Gsoc, for regression testing it would be better if the user has a choice of doing it in Blender as well as with just a line of Python.
Philipp Oeser removed the
Interest
Platforms, Builds & Tests
label 2023-02-10 08:58:25 +01:00
Sign in to join this conversation.
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
5 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#74730
No description provided.