WIP: Rewrite Graph Editor drawing code #118662

Christoph Lendenfeld · 2024-02-23T14:21:00+01:00

2024-02-23 14:21:00 +01:00

Not meant for merging. This is an exploration of rewriting the GE drawing code to make it scale better.

Issue

The issue I am trying to solve is the massive slowdown of the Graph Editor when displaying large amounts of keys and handles.
Profiling showed this is mainly a CPU bottleneck when sifting through all FCurves and adding those to the IMM buffer.

Solution of this PR

Run the processes that prepare data for the GPU in parallel, then upload later.

The root function is graph_draw_curves.
Within that I run a threading::parallel_for over each of the FCurves to generate the data needed to feed the GPU.
After all that data is passed to draw_fcurve_render_data which handles the actual drawing.

Questions

Currently the code in the PR copies data twice, once in the build, then later into the GPU vertex array.
Is there a way to remove the second copy? The data is already exactly in the format that the GPU wants, so I thought we could just give it a pointer and upload directly?

Performance

TODO

Known Issues

drawing order isn't as before. Selected verts are not guaranteed to be on top
graph_draw_driver_debug is never called
curves only draw on the second refresh. no idea why
sampled curves will not work
curves with modifiers will not work
handles points are drawn in the same shape as keys
unselected curves are drawn too thick

Test file

https://drive.google.com/file/d/11wQhuYmhCtQgjT4uJWORTwKlWit1irp2/view?usp=sharing
Imported and extended from: https://sites.google.com/a/cgspeed.com/cgspeed/motion-capture/the-motionbuilder-friendly-bvh-conversion-release-of-cmus-motion-capture-database

Not meant for merging. This is an exploration of rewriting the GE drawing code to make it scale better. ## Issue The issue I am trying to solve is the massive slowdown of the Graph Editor when displaying large amounts of keys and handles. Profiling showed this is mainly a CPU bottleneck when sifting through all `FCurves` and adding those to the IMM buffer. ![image](/attachments/2e8a5ecd-dce7-4c15-be05-f7820f9d4525) ## Solution of this PR Run the processes that prepare data for the GPU in parallel, then upload later. The root function is `graph_draw_curves`. Within that I run a `threading::parallel_for` over each of the `FCurves` to generate the data needed to feed the GPU. After all that data is passed to `draw_fcurve_render_data` which handles the actual drawing. ## Questions Currently the code in the PR copies data twice, once in the build, then later into the GPU vertex array. Is there a way to remove the second copy? The data is already exactly in the format that the GPU wants, so I thought we could just give it a pointer and upload directly? ## Performance TODO ## Known Issues * drawing order isn't as before. Selected verts are not guaranteed to be on top * `graph_draw_driver_debug` is never called * curves only draw on the second refresh. no idea why * sampled curves will not work * curves with modifiers will not work * handles points are drawn in the same shape as keys * unselected curves are drawn too thick ## Test file https://drive.google.com/file/d/11wQhuYmhCtQgjT4uJWORTwKlWit1irp2/view?usp=sharing Imported and extended from: https://sites.google.com/a/cgspeed.com/cgspeed/motion-capture/the-motionbuilder-friendly-bvh-conversion-release-of-cmus-motion-capture-database

image.png

69 KiB

Christoph Lendenfeld added the

Module

Animation & Rigging

label 2024-02-23 14:21:00 +01:00

Christoph Lendenfeld added 3 commits 2024-02-23 14:21:05 +01:00

a wip b003bcde9c

wip but kinda working e0b60d12fb

Merge branch 'main' into ge_gpu_batch e5cdaee7d7

Hans Goudey reviewed 2024-02-23 15:15:17 +01:00

Hans Goudey left a comment

Just reading this out of curiosity and left some comments. Sorry if some of them are too preemptive!

I wonder if skipping the temporary arrays and uploading the data directly to the vertex buffer would be faster?

Just reading this out of curiosity and left some comments. Sorry if some of them are too preemptive! I wonder if skipping the temporary arrays and uploading the data directly to the vertex buffer would be faster?

source/blender/editors/space_graph/graph_draw.cc Outdated

						
				@ -1479,0 +1516,4 @@

				  ColorTheme4b color_sel = get_fcurve_vertex_color(fcu, true);

				  ColorTheme4b color_desel = get_fcurve_vertex_color(fcu, false);

				  const blender::int2 bounding_indices = get_bounding_bezt_indices(fcu, range[0], range[1]);

Hans Goudey commented

2024-02-23 15:13:18 +01:00

Might be nicer if this returned an IndexRange, which could then be iterated like for (const int i : bounding_indices)

Might be nicer if this returned an `IndexRange`, which could then be iterated like `for (const int i : bounding_indices)`

source/blender/editors/space_graph/graph_draw.cc Outdated

						
				@ -1479,0 +1523,4 @@

				  int array_index = 0;

				  for (int i = bounding_indices[0]; i <= bounding_indices[1]; i++) {

				    BezTriple *bezt = &fcu->bezt[i];

Hans Goudey commented

2024-02-23 15:15:05 +01:00

const BezTriple &bezt = fcu.bezt[i] is a bit nicer

`const BezTriple &bezt = fcu.bezt[i]` is a bit nicer

source/blender/editors/space_graph/graph_draw.cc Outdated

						
				@ -1491,3 +1616,1 @@

				   * draw curve, then handle-lines, and finally vertices in this order so that

				   * the data will be layered correctly

				   */

				  Array<bAnimListElem *> anim_list_elements = Array<bAnimListElem *>(fcurve_count);

Hans Goudey commented

2024-02-23 15:10:53 +01:00

This can be written like

Array<bAnimListElem *> anim_list_elements(fcurve_count);

This can be written like ``` Array<bAnimListElem *> anim_list_elements(fcurve_count);

source/blender/editors/space_graph/graph_draw.cc Outdated

						
				@ -1494,0 +1624,4 @@

				  const float2 v2d_range = {v2d->cur.xmin, v2d->cur.xmax};

				  Array<FCurveRenderData> render_data = Array<FCurveRenderData>(fcurve_count);

				  for (int i = 0; i < anim_list_elements.size(); i++) {

Hans Goudey commented

2024-02-23 15:11:17 +01:00

How about for (const int i : anim_list_elements.index_range()) { :)

How about `for (const int i : anim_list_elements.index_range()) {` :)

Hans Goudey referenced this pull request

2024-02-23 15:20:23 +01:00

WIP: Rewrite Graph Editor drawing code #118661

Christoph Lendenfeld commented

2024-02-23 15:55:45 +01:00

Just reading this out of curiosity and left some comments. Sorry if some of them are too preemptive!

I wonder if skipping the temporary arrays and uploading the data directly to the vertex buffer would be faster?

It's a bit preemptive but I am always open to suggestions.
I was thinking I could run the data composition in parallel and then just copy to the vertex buffer after.

What I was actually hoping is that I could give the batch a new vertex buffer pointer to upload instead of copying, but that doesn't seem to be supported.

Anyway, this PR is me learning the GPUBatch thing while at the same time removing IMM from the Graph Editor drawing code

> Just reading this out of curiosity and left some comments. Sorry if some of them are too preemptive! > > I wonder if skipping the temporary arrays and uploading the data directly to the vertex buffer would be faster? It's a bit preemptive but I am always open to suggestions. I was thinking I could run the data composition in parallel and then just copy to the vertex buffer after. What I was actually hoping is that I could give the batch a new vertex buffer pointer to upload instead of copying, but that doesn't seem to be supported. Anyway, this PR is me learning the GPUBatch thing while at the same time removing IMM from the Graph Editor drawing code

Christoph Lendenfeld added 2 commits 2024-02-23 17:03:34 +01:00

line drawing and threading b1a201643f

fix memory leak 72f2f8e81f

Christoph Lendenfeld added 1 commit 2024-02-23 17:14:42 +01:00

fix glitches 51ec4115ab

Christoph Lendenfeld added 3 commits 2024-02-27 12:49:13 +01:00

build handle data 4a88dcdb4e

Merge branch 'main' into ge_gpu_batch 01629e6f6f

implement handle drawing 385a6e6cd4

Christoph Lendenfeld added 2 commits 2024-02-27 13:31:00 +01:00

simplify handle vert drawing 32a8cc6ee7

remove unused code 963f627f6b

Christoph Lendenfeld added 3 commits 2024-02-27 15:35:31 +01:00

clean up things 8f5521378e

fix for no handles visible ff69050e3f

respect user pref setting 6ae71de722

Jeroen Bakker reviewed 2024-02-27 16:07:33 +01:00

Jeroen Bakker left a comment

It is good to actually use index buffers and vertex buffers.
It would be even better when you can reuse them between frames and tag them when they need to be updated.
This way the data transfer can be removed when nothing special happens on the screen.

Now you're creating, building, uploading, drawing and freeing every frame. Even when nothing is changed.
imm has some optimizations that the creating and freeing isn't done and the buffers are tagged in such a way that the best memory is chosen to reduce time spent in uploading.

What we do for the overlay engine (and other editors as well) is to use a layered approach. Eg when selected keys should always be on top. Create a different set of batches only containing the selected keys. and those keys will not be present in the other batches. Render the set of batches one at a time.

Curve drawing is a very researched topic. See https://www.youtube.com/watch?v=M-IMQT7yKVI for example. I don't think this PR should aim for that, but there are other ways to look at the problem.

My approach would be to use matrix transformation and static batches. Perhaps a specific fragment shader to draw lines, outlines etc. But that would become closer to an actual overlay engine and we might want to move it then inside the draw module as there more tools exist for optimizing complex drawing code.

It is good to actually use index buffers and vertex buffers. It would be even better when you can reuse them between frames and tag them when they need to be updated. This way the data transfer can be removed when nothing special happens on the screen. Now you're creating, building, uploading, drawing and freeing every frame. Even when nothing is changed. `imm` has some optimizations that the creating and freeing isn't done and the buffers are tagged in such a way that the best memory is chosen to reduce time spent in uploading. What we do for the overlay engine (and other editors as well) is to use a layered approach. Eg when selected keys should always be on top. Create a different set of batches only containing the selected keys. and those keys will not be present in the other batches. Render the set of batches one at a time. Curve drawing is a very researched topic. See https://www.youtube.com/watch?v=M-IMQT7yKVI for example. I don't think this PR should aim for that, but there are other ways to look at the problem. My approach would be to use matrix transformation and static batches. Perhaps a specific fragment shader to draw lines, outlines etc. But that would become closer to an actual overlay engine and we might want to move it then inside the draw module as there more tools exist for optimizing complex drawing code.

👍 1

Christoph Lendenfeld commented

2024-02-27 17:16:09 +01:00

What we do for the overlay engine (and other editors as well) is to use a layered approach. Eg when selected keys should always be on top. Create a different set of batches only containing the selected keys. and those keys will not be present in the other batches. Render the set of batches one at a time.

If I understand this right, then I need to rebuild the vertex buffer whenever anything changes. (because I need selected verts in a different buffer)
But as long as nothing changes I can just keep reusing it (probably stored on the SpaceGraph struct)

How do I deal with the optimization of only looking at keys that are within the viewport extents.
That means the vertex buffers would need updating every time the viewport moves. I think this is just not going to work when I keep the vertex buffers around.

My approach would be to use matrix transformation and static batches

That means I would upload the whole data to the GPU and just give it an extra matrix so it moves around accordingly when panning the view?

Perhaps a specific fragment shader to draw lines, outlines

I thought about that as well but wasn't sure if that's the way to go. If you say that's a good thing to do I can give it a try

> What we do for the overlay engine (and other editors as well) is to use a layered approach. Eg when selected keys should always be on top. Create a different set of batches only containing the selected keys. and those keys will not be present in the other batches. Render the set of batches one at a time. If I understand this right, then I need to rebuild the vertex buffer whenever anything changes. (because I need selected verts in a different buffer) But as long as nothing changes I can just keep reusing it (probably stored on the `SpaceGraph` struct) How do I deal with the optimization of only looking at keys that are within the viewport extents. That means the vertex buffers would need updating every time the viewport moves. I think this is just not going to work when I keep the vertex buffers around. > My approach would be to use matrix transformation and static batches That means I would upload the whole data to the GPU and just give it an extra matrix so it moves around accordingly when panning the view? > Perhaps a specific fragment shader to draw lines, outlines I thought about that as well but wasn't sure if that's the way to go. If you say that's a good thing to do I can give it a try

Jeroen Bakker commented

2024-02-27 18:58:48 +01:00

GPU automatically removes any geometry that isn't in the viewport, so would not be my first concern. I rather have a single buffer that contains everything, compared to updating the buffer for every tweak. If a bottleneck appears, we can always make the data set smaller.

Perhaps it is better to first do some prototyping. Some of these tasks are hard to implement, or might need to be implemented in a different area for more efficiency.

Rebuilding when something changes and split batches based on usage would be a first step. You can use a single vertex buffer, but have 3 index buffers to indicate which part of the vertex buffer is selected, active or inactive.

GPU automatically removes any geometry that isn't in the viewport, so would not be my first concern. I rather have a single buffer that contains everything, compared to updating the buffer for every tweak. If a bottleneck appears, we can always make the data set smaller. Perhaps it is better to first do some prototyping. Some of these tasks are hard to implement, or might need to be implemented in a different area for more efficiency. Rebuilding when something changes and split batches based on usage would be a first step. You can use a single vertex buffer, but have 3 index buffers to indicate which part of the vertex buffer is selected, active or inactive.

Hans Goudey commented

2024-02-27 22:59:49 +01:00

I think Jeroen has a great point about not re-uploading vertex buffers for panning. Using index buffers for selection and active state makes sense too.

The other intuitive thing I'd suggest is avoiding the intermediate buffers by first counting the number of points that need to be uploaded, allocating vertex buffers of exactly that size, then filling them in directly. There may be smaller portions of memory you can collect serially in the first pass to help multithread the last VBO filling pass.

I think Jeroen has a great point about not re-uploading vertex buffers for panning. Using index buffers for selection and active state makes sense too. The other intuitive thing I'd suggest is avoiding the intermediate buffers by first counting the number of points that need to be uploaded, allocating vertex buffers of exactly that size, then filling them in directly. There may be smaller portions of memory you can collect serially in the first pass to help multithread the last VBO filling pass.

Jeroen Bakker commented

2024-02-29 08:29:27 +01:00

Other solution would be to store these buffers per fcurve in the graph editor as runtime data. Only updating the buffers that are altered. This saves unneeded cycles updating buffers that wasn't touched and uploads smaller sets of data to the GPU. This will lead to more draw calls, but that should be optimized by the GPU backends.

Our GPU module doesn't support uploading partial buffers. It could be added, but the hard part is always on the caller side keeping track of offsets and sizes, so we never added support for it.

Other solution would be to store these buffers per fcurve in the graph editor as runtime data. Only updating the buffers that are altered. This saves unneeded cycles updating buffers that wasn't touched and uploads smaller sets of data to the GPU. This will lead to more draw calls, but that should be optimized by the GPU backends. Our GPU module doesn't support uploading partial buffers. It could be added, but the hard part is always on the caller side keeping track of offsets and sizes, so we never added support for it.

Christoph Lendenfeld commented

2024-02-29 12:41:33 +01:00

GPU automatically removes any geometry that isn't in the viewport, so would not be my first concern. I rather have a single buffer that contains everything, compared to updating the buffer for every tweak. If a bottleneck appears, we can always make the data set smaller.

Thanks got you. That was an optimization for IMM because the fewer things we add to IMM the faster. If we keep the buffers around this no longer applies.

Perhaps it is better to first do some prototyping. Some of these tasks are hard to implement, or might need to be implemented in a different area for more efficiency.

I treat this PR as a prototype where I can try things out :)

You can use a single vertex buffer, but have 3 index buffers to indicate which part of the vertex buffer is selected, active or inactive.

Thanks that makes sense, I didn't realize you can have different index buffers for the same vertex buffer.

The other intuitive thing I'd suggest is avoiding the intermediate buffers

Yes I will try that next.

Other solution would be to store these buffers per fcurve in the graph editor as runtime data

I will also give that a try. Just need to figure out how to know which data has been modified.

@Jeroen-Bakker on the topic of curve drawing. Do you think it's feasable to move the curve drawing to the GPU completely?
We already upload the keys and handles to the GPU. Instead of calculating the bezier interpolation on the CPU and uploading a line strip it might be possible to do that all on the GPU.
The reason I am asking is because we optimized the curve drawing to be view dependent. So we'd need to rebuild it on every draw call anyway since zooming in and out changes the points

> GPU automatically removes any geometry that isn't in the viewport, so would not be my first concern. I rather have a single buffer that contains everything, compared to updating the buffer for every tweak. If a bottleneck appears, we can always make the data set smaller. Thanks got you. That was an optimization for IMM because the fewer things we add to IMM the faster. If we keep the buffers around this no longer applies. > Perhaps it is better to first do some prototyping. Some of these tasks are hard to implement, or might need to be implemented in a different area for more efficiency. I treat this PR as a prototype where I can try things out :) > You can use a single vertex buffer, but have 3 index buffers to indicate which part of the vertex buffer is selected, active or inactive. Thanks that makes sense, I didn't realize you can have different index buffers for the same vertex buffer. > The other intuitive thing I'd suggest is avoiding the intermediate buffers Yes I will try that next. > Other solution would be to store these buffers per fcurve in the graph editor as runtime data I will also give that a try. Just need to figure out how to know which data has been modified. @Jeroen-Bakker on the topic of curve drawing. Do you think it's feasable to move the curve drawing to the GPU completely? We already upload the keys and handles to the GPU. Instead of calculating the bezier interpolation on the CPU and uploading a line strip it might be possible to do that all on the GPU. The reason I am asking is because we optimized the curve drawing to be view dependent. So we'd need to rebuild it on every draw call anyway since zooming in and out changes the points

Jeroen Bakker commented

2024-02-29 16:36:21 +01:00

GPU based interpolation is straight forward for OpenGL and vulkan. Issue might be metal. Would first start without gpu based interpolation, but keep in mind that this is a good optimization as a future step.

Thinking of it I would use compute shaders to move the interpolation to the GPU in that future goal

GPU based interpolation is straight forward for OpenGL and vulkan. Issue might be metal. Would first start without gpu based interpolation, but keep in mind that this is a good optimization as a future step. Thinking of it I would use compute shaders to move the interpolation to the GPU in that future goal

👍 1

Christoph Lendenfeld added 2 commits 2024-03-07 14:49:07 +01:00

Merge branch 'main' into ge_gpu_batch b8f8c92b7d

use bezier::evaluate_segment 94a5a0812d

Christoph Lendenfeld added 2 commits 2024-03-08 16:40:23 +01:00

Merge branch 'main' into ge_gpu_batch 17a5e2e76c

graph_intern.h to c++ c42d492a99

This pull request has changes conflicting with the target branch.