DRW: Curves: Indexbuf optimization for large numbers of curves #116617

Merged
Clément Foucault merged 7 commits from Eugene-Kuznetsov/blender:ek_curves_2d_draw into main 2024-02-25 17:23:06 +01:00
Contributor

This optimizes a few loops that become significant bottlenecks during
viewport rendering of scenes with large numbers of curves.

To render a curves object, Blender needs to generate a potentially
very large (but trivial) index buffer. As previously implemented,
this index buffer is generated in an extremely inefficient manner,
with a single-threaded loop and an explicit function call per entry.
The buffer then needs to be pushed onto the GPU, which is also a fairly
slow task.

The PR generates the index buffer directly on the GPU with compute
shader.

Test case attached. Branch with timing code: https://projects.blender.org/Eugene-Kuznetsov/blender/src/branch/ek_curves_2d_draw_timing

(Separated from #116545.)

Version fps, 5955WX, AMD RX7900, Linux fps, Ryzen 7 5800, AMD RX6900, Windows 10
Main 2.7 2.0
efd457aef5cce6187fabb6e5ac51b5083a816ee0 4.5 2.3
5d06fed12755993130c91cb02fe7b2398da628cb 4.5 3.0
d8e639bfbbae012d558490955ff2386ae2407d98 5.2 3.8
This optimizes a few loops that become significant bottlenecks during viewport rendering of scenes with large numbers of curves. To render a curves object, Blender needs to generate a potentially very large (but trivial) index buffer. As previously implemented, this index buffer is generated in an extremely inefficient manner, with a single-threaded loop and an explicit function call per entry. The buffer then needs to be pushed onto the GPU, which is also a fairly slow task. The PR generates the index buffer directly on the GPU with compute shader. Test case attached. Branch with timing code: https://projects.blender.org/Eugene-Kuznetsov/blender/src/branch/ek_curves_2d_draw_timing (Separated from https://projects.blender.org/blender/blender/pulls/116545.) Version | fps, 5955WX, AMD RX7900, Linux | fps, Ryzen 7 5800, AMD RX6900, Windows 10 --- | --- | --- Main | 2.7 | 2.0 efd457aef5cce6187fabb6e5ac51b5083a816ee0 | 4.5 | 2.3 5d06fed12755993130c91cb02fe7b2398da628cb | 4.5 | 3.0 d8e639bfbbae012d558490955ff2386ae2407d98 | 5.2 | 3.8
1.7 MiB
Eugene-Kuznetsov force-pushed ek_curves_2d_draw from 60b7e90ddb to b08777f001 2023-12-29 01:52:05 +01:00 Compare
Iliya Katushenock added this to the EEVEE & Viewport project 2023-12-29 09:15:03 +01:00
Eugene-Kuznetsov force-pushed ek_curves_2d_draw from b08777f001 to 85275516d7 2024-01-01 04:35:20 +01:00 Compare
Eugene-Kuznetsov force-pushed ek_curves_2d_draw from 85275516d7 to f44868e93a 2024-01-09 09:38:35 +01:00 Compare
Eugene-Kuznetsov force-pushed ek_curves_2d_draw from f44868e93a to d8e639bfbb 2024-01-09 10:27:12 +01:00 Compare
Brecht Van Lommel requested review from Jeroen Bakker 2024-01-12 15:37:31 +01:00
Brecht Van Lommel requested review from Clément Foucault 2024-01-12 15:37:32 +01:00
Brecht Van Lommel requested review from Miguel Pozo 2024-01-12 15:37:32 +01:00
Member

I think @HooglyBoogly might want to take a look at this one too.

I think @HooglyBoogly might want to take a look at this one too.
Member

Overall, nice improvements! I have a couple concerns though:

  • This curve drawing code has significant changes planned. See #96455 for details. I don't think most of this code would remain at that point.
  • The performance issues with GPU_indexbuf_add_tri_verts and GPU_indexbuf_add_line_verts and other similar functions are widespread. Mesh draw extraction has the same issue. A more general solution might be adding a different API that exposes the raw data of the index buffer to the caller.
  • Moving some index buffer creation to a shader might make sense (others would know better), but given this will change a lot as this code is improved in the nearish future, it might not be worth doing right now.
Overall, nice improvements! I have a couple concerns though: - This curve drawing code has significant changes planned. See #96455 for details. I don't think most of this code would remain at that point. - The performance issues with `GPU_indexbuf_add_tri_verts` and `GPU_indexbuf_add_line_verts` and other similar functions are widespread. Mesh draw extraction has the same issue. A more general solution might be adding a different API that exposes the raw data of the index buffer to the caller. - Moving some index buffer creation to a shader might make sense (others would know better), but given this will change a lot as this code is improved in the nearish future, it might not be worth doing right now.
Author
Contributor

Overall, nice improvements! I have a couple concerns though:

  • This curve drawing code has significant changes planned. See #96455 for details. I don't think most of this code would remain at that point.
  • The performance issues with GPU_indexbuf_add_tri_verts and GPU_indexbuf_add_line_verts and other similar functions are widespread. Mesh draw extraction has the same issue. A more general solution might be adding a different API that exposes the raw data of the index buffer to the caller.
  • Moving some index buffer creation to a shader might make sense (others would know better), but given this will change a lot as this code is improved in the nearish future, it might not be worth doing right now.

What's "nearish future"? #96455 is 2 years old and it does not reference any ongoing development. If someone is working on this now, that's one thing. If it's something that's probably going happen in another 2 years, that's something else entirely.

> Overall, nice improvements! I have a couple concerns though: > - This curve drawing code has significant changes planned. See #96455 for details. I don't think most of this code would remain at that point. > - The performance issues with `GPU_indexbuf_add_tri_verts` and `GPU_indexbuf_add_line_verts` and other similar functions are widespread. Mesh draw extraction has the same issue. A more general solution might be adding a different API that exposes the raw data of the index buffer to the caller. > - Moving some index buffer creation to a shader might make sense (others would know better), but given this will change a lot as this code is improved in the nearish future, it might not be worth doing right now. What's "nearish future"? #96455 is 2 years old and it does not reference any ongoing development. If someone is working on this now, that's one thing. If it's something that's probably going happen in another 2 years, that's something else entirely.
Member

That's probably for @fclem to answer. Anyway, for now I'd suggest focusing on adding a different API for filling index buffers that can be used elsewhere too (for example extract_tris_iter_face_mesh). But others may have different opinions.

That's probably for @fclem to answer. Anyway, for now I'd suggest focusing on adding a different API for filling index buffers that can be used elsewhere too (for example `extract_tris_iter_face_mesh`). But others may have different opinions.
Jeroen Bakker reviewed 2024-01-15 09:35:03 +01:00
Jeroen Bakker left a comment
Member

Hi thanks for this patch. I do believe in the long run generating index buffers using shaders for generative topologies would be desired, independent on other performance improvements we are planning. So +1 for me on this patch.

The approach is quite similar to what we do for curves/hair drawing and think we should separate the main part of the code to something that would also become reusable for other parts as well.

We also need to check how to handle Metal pipeline as due to their architecture it might be more wise to use a graphics shader for it.

Hi thanks for this patch. I do believe in the long run generating index buffers using shaders for generative topologies would be desired, independent on other performance improvements we are planning. So +1 for me on this patch. The approach is quite similar to what we do for curves/hair drawing and think we should separate the main part of the code to something that would also become reusable for other parts as well. We also need to check how to handle Metal pipeline as due to their architecture it might be more wise to use a graphics shader for it.
@ -485,20 +463,22 @@ static void curves_batch_cache_ensure_procedural_indices(const bke::CurvesGeomet
int verts_per_curve;
int element_count;
GPUPrimType prim_type;
int x_dim;
Member

dispatch_x_size seems a better description

`dispatch_x_size` seems a better description
Eugene-Kuznetsov marked this conversation as resolved
@ -516,0 +492,4 @@
GPUIndexBuf *ib;
if(prim_type==GPU_PRIM_LINE_STRIP || prim_type==GPU_PRIM_TRI_STRIP
|| prim_type==GPU_PRIM_TRIS) {
bool tris = (prim_type==GPU_PRIM_TRIS);
Member

Would rather make this a backend responsibility.

Using compute shaders in a graphics pipeline triggers two context switches. The metal therefore prefers to use graphics pipeline.

See draw_curves.cc (although this is still done inside the draw manager). In the future we it is the plan to make the draw manager part of the gpu module so you can also add it to the draw manager similar as draw_curves.cc

Depends on your solution we should go over the location (draw vs gpu) and the naming convention that we use. https://developer.blender.org/docs/handbook/guidelines/glsl/

Would rather make this a backend responsibility. Using compute shaders in a graphics pipeline triggers two context switches. The metal therefore prefers to use graphics pipeline. See `draw_curves.cc` (although this is still done inside the draw manager). In the future we it is the plan to make the draw manager part of the gpu module so you can also add it to the draw manager similar as `draw_curves.cc` Depends on your solution we should go over the location (draw vs gpu) and the naming convention that we use. https://developer.blender.org/docs/handbook/guidelines/glsl/
Author
Contributor

I'm not sure what you mean by this. Do you want the whole operation moved into draw_curves.cc? Or do you want an option to switch between a compute shader and a "transform feedback" shader?

I do not have a Mac so I can't test behavior with metal. l can test this with strip primitives manually disabled, but it might be safer to disable the whole pipeline until it can be properly tested on that hardware.

I'm not sure what you mean by this. Do you want the whole operation moved into draw_curves.cc? Or do you want an option to switch between a compute shader and a "transform feedback" shader? I do not have a Mac so I can't test behavior with metal. l can test this with strip primitives manually disabled, but it might be safer to disable the whole pipeline until it can be properly tested on that hardware.
Member

add a new function to GPU_index_buffer.h/cc

void GPU_indexbuf_build_curves_on_device(index_buf, prim_type, curves_num, verts_per_curve)

add a new function to `GPU_index_buffer.h/cc` `void GPU_indexbuf_build_curves_on_device(index_buf, prim_type, curves_num, verts_per_curve)`
Eugene-Kuznetsov marked this conversation as resolved
@ -516,0 +498,4 @@
: DRW_shader_curves_2d_points_get();
GPU_shader_bind(shader);
GPU_memory_barrier(GPU_BARRIER_BUFFER_UPDATE);
auto tstart = std::chrono::high_resolution_clock::now();
Member

This is debug code and should eventually be removed. Note that Blender has its own developer debug timing functions in PIL_time_utildefines.h

This is debug code and should eventually be removed. Note that Blender has its own developer debug timing functions in `PIL_time_utildefines.h`
Eugene-Kuznetsov marked this conversation as resolved
@ -516,1 +507,4 @@
GPU_indexbuf_bind_as_ssbo(ib, GPU_shader_get_ssbo_binding(shader, "out_indices"));
/* Maximum workgroup dimension is possibly platform-dependent, but OpenGL promises
* at least 65535 on each axis. We have to divide across 2 axes, because we may have
Member

You can use GPU_max_work_group_size(...) to find actual device supported workgroup sizes.

You can use `GPU_max_work_group_size(...)` to find actual device supported workgroup sizes.
Eugene-Kuznetsov marked this conversation as resolved
@ -49,2 +49,4 @@
/** True if buffer only contains restart indices. */
bool is_empty_ = false;
/** GPUUsageType from GPU_vertex_buffer.h; defaults to GPU_USAGE_STATIC */
uint8_t usage_hint_ = 1;
Member

Why not use GPUUsageType and GPU_USAGE_STATIC here.

Why not use `GPUUsageType` and `GPU_USAGE_STATIC` here.
Author
Contributor

I would have to move GPUUsageType out of GPU_vertex_buffer.h and into some common header, maybe to GPU_common_types.h

I would have to move GPUUsageType out of GPU_vertex_buffer.h and into some common header, maybe to GPU_common_types.h
Member

You can do the same what GPU_storage_buffer does. Just include the GPU_vertex_buffer.h in GPU_index_buffer.h. I didn't test, but that should be sufficient.

You can do the same what `GPU_storage_buffer` does. Just include the `GPU_vertex_buffer.h` in `GPU_index_buffer.h`. I didn't test, but that should be sufficient.
Eugene-Kuznetsov marked this conversation as resolved
Jeroen Bakker reviewed 2024-01-18 11:49:33 +01:00
@ -516,0 +502,4 @@
GPUIndexBuf* arg = GPU_indexbuf_build_on_device(2);
uint size[2] = {curves.curves_num(), (uint)x_dim};
ib = GPU_indexbuf_build_on_device(element_count);
GPU_indexbuf_bind_as_ssbo(arg, GPU_shader_get_ssbo_binding(shader, "ncurves"));
Member

Should use push_constants/or uniforms to send parameters to the shader.

Should use push_constants/or uniforms to send parameters to the shader.
Eugene-Kuznetsov marked this conversation as resolved
@ -148,6 +150,22 @@ GPUShader *DRW_shader_draw_command_generate_get()
return e_data.draw_command_generate_sh;
}
GPUShader *DRW_shader_curves_2d_points_get()
Member

These should become a builtin shaders in the gpu module.
See eGPUBuiltinShader

These should become a builtin shaders in the gpu module. See `eGPUBuiltinShader`
Eugene-Kuznetsov marked this conversation as resolved
@ -0,0 +10,4 @@
#include "gpu_shader_create_info.hh"
GPU_SHADER_CREATE_INFO(gpu_compute_2d_array_points)
.local_group_size(1,1,1)
Member

Have you experiment with different local_group_sizes. IMO this should have a performance impact as the shaders are small.

Have you experiment with different local_group_sizes. IMO this should have a performance impact as the shaders are small.
Author
Contributor

In my tests, even for largest curves objects my system can handle (e.g 1 million curves and 64 vertexes per curve), this shader takes a fraction of a millisecond to execute as it is.

EDIT: looks like my timing information was wrong, because neither GPU_memory_barrier((eGPUBarrier)65535) nor GPU_fence_signal/GPU_fence_wait were actually forcing synchronization. I had to replace glWaitSync with glClientWaitSync in GLFence::wait() to get good timing.

When I measure it properly, it's not quite as fast as that, and raising local_group_size does help quite a bit.

In my tests, even for largest curves objects my system can handle (e.g 1 million curves and 64 vertexes per curve), this shader takes a fraction of a millisecond to execute as it is. EDIT: looks like my timing information was wrong, because neither GPU_memory_barrier((eGPUBarrier)65535) nor GPU_fence_signal/GPU_fence_wait were actually forcing synchronization. I had to replace glWaitSync with glClientWaitSync in GLFence::wait() to get good timing. When I measure it properly, it's not quite as fast as that, and raising local_group_size does help quite a bit.
Eugene-Kuznetsov marked this conversation as resolved
@ -0,0 +11,4 @@
GPU_SHADER_CREATE_INFO(gpu_compute_2d_array_points)
.local_group_size(1,1,1)
.storage_buf(0, Qualifier::READ, "uint", "ncurves[]")
Member

This should be two push_constants. No need to setup a buffer to transfer parameters to a shader. Push constants are faster as they are stored in more performant memory areas, then buffers.

Similar for the other shader

This should be two push_constants. No need to setup a buffer to transfer parameters to a shader. Push constants are faster as they are stored in more performant memory areas, then buffers. Similar for the other shader
Author
Contributor

It was my impression from https://www.khronos.org/opengl/wiki/Compute_Shader that I have to use buffers to transfer data to compute shaders. (Though, be honest, my knowledge of shaders is extremely limited, I'm more familiar with CUDA.) If that is incorrect, I'll try to figure out how to make this work with uniforms.

It was my impression from https://www.khronos.org/opengl/wiki/Compute_Shader that I have to use buffers to transfer data to compute shaders. (Though, be honest, my knowledge of shaders is extremely limited, I'm more familiar with CUDA.) If that is incorrect, I'll try to figure out how to make this work with uniforms.
Eugene-Kuznetsov marked this conversation as resolved
Eugene-Kuznetsov force-pushed ek_curves_2d_draw from d8e639bfbb to 2cf50c1a3f 2024-01-24 03:23:12 +01:00 Compare
Eugene-Kuznetsov force-pushed ek_curves_2d_draw from 2cf50c1a3f to 54fa49d50d 2024-01-24 04:58:00 +01:00 Compare
Clément Foucault requested changes 2024-01-24 08:44:40 +01:00
Dismissed
Clément Foucault left a comment
Member

Given that 4.0 already requires compute shader support, I would remove the CPU fallback.

I would also remove GPU_indexbuf_set_usage_hint (this is a state modification function that is not easily supported on other backend) and just use GPU_indexbuf_init_build_on_device with the compute shader.

The patch is also missing formatting.

Given that 4.0 already requires compute shader support, I would remove the CPU fallback. I would also remove `GPU_indexbuf_set_usage_hint` (this is a state modification function that is not easily supported on other backend) and just use `GPU_indexbuf_init_build_on_device` with the compute shader. The patch is also missing formatting.
@ -244,0 +330,4 @@
if(grid_y <= max_grid_y) {
grid_z = 1;
} else {
grid_y = grid_z = (uint64_t)ceil(sqrt(double(grid_y)));

Use functional cast in C++.

Use functional cast in C++.
Eugene-Kuznetsov marked this conversation as resolved
@ -36,2 +36,2 @@
/* Sends data to GPU. */
glBufferData(GL_ELEMENT_ARRAY_BUFFER, size, data_, GL_STATIC_DRAW);
glBufferData(GL_ELEMENT_ARRAY_BUFFER, size, data_,
usage_hint_==0 ? GL_STREAM_DRAW : GL_STATIC_DRAW);

Run make format in blender git directory.

Run `make format` in blender git directory.
Eugene-Kuznetsov marked this conversation as resolved
@ -0,0 +2,4 @@
*
* SPDX-License-Identifier: GPL-2.0-or-later */
void main()

Would be nice if the shader files had a comment explaining what the output is (like the shape of the array of point and how it relates to the dispatch).

Would be nice if the shader files had a comment explaining what the output is (like the shape of the array of point and how it relates to the dispatch).
Eugene-Kuznetsov marked this conversation as resolved
@ -0,0 +4,4 @@
void main()
{
/*

Remove pseudo code.

Add comments throughout the code if that helps the understanding of it.

Remove pseudo code. Add comments throughout the code if that helps the understanding of it.
Eugene-Kuznetsov marked this conversation as resolved
@ -0,0 +18,4 @@
int y = int(gl_GlobalInvocationID.y) + int(gl_GlobalInvocationID.z)*int(gl_NumWorkGroups.y)*int(gl_WorkGroupSize.y);
int store_index = x + y*w;
if((x < w) && (y < max_curve)) {
if(x + 1 < w)

Use brackets even on single line if statements. See https://developer.blender.org/docs/handbook/guidelines/c_cpp/ .
Alternatively, you can use a ternary expression here since the code is quite small.

Use brackets even on single line if statements. See https://developer.blender.org/docs/handbook/guidelines/c_cpp/ . Alternatively, you can use a ternary expression here since the code is quite small.
Eugene-Kuznetsov marked this conversation as resolved
@ -0,0 +11,4 @@
GPU_SHADER_CREATE_INFO(gpu_compute_2d_array_points)
.local_group_size(16,16,1)
.push_constant(Type::IVEC2, "shape")

shape is not very descriptive for constant. Even more so that you split the vector inside the shader afterwards. So better just split the push constant into curves_num and verts_per_curve.

`shape` is not very descriptive for constant. Even more so that you split the vector inside the shader afterwards. So better just split the push constant into `curves_num` and `verts_per_curve`.
Eugene-Kuznetsov marked this conversation as resolved
Eugene-Kuznetsov force-pushed ek_curves_2d_draw from 54fa49d50d to 23d1af2ff5 2024-01-24 22:14:56 +01:00 Compare
Eugene-Kuznetsov force-pushed ek_curves_2d_draw from 23d1af2ff5 to d6b0bffbef 2024-01-24 22:15:48 +01:00 Compare
Eugene-Kuznetsov force-pushed ek_curves_2d_draw from d6b0bffbef to d1b217f1b6 2024-01-25 01:21:26 +01:00 Compare
Eugene-Kuznetsov force-pushed ek_curves_2d_draw from d1b217f1b6 to b8a62567a9 2024-01-26 23:27:59 +01:00 Compare
Eugene-Kuznetsov force-pushed ek_curves_2d_draw from b8a62567a9 to d27953c452 2024-01-26 23:38:31 +01:00 Compare
Eugene-Kuznetsov force-pushed ek_curves_2d_draw from d27953c452 to 96ca6ca967 2024-01-26 23:40:23 +01:00 Compare
Clément Foucault added 6 commits 2024-02-25 16:09:00 +01:00
Clément Foucault approved these changes 2024-02-25 17:19:36 +01:00
Clément Foucault left a comment
Member

I had to patch it for Metal, as it wasn't building, and took the opportunity to fix a few cosmetic stuff.

On my MacBook Pro M1 Max I get a speedup from 0.96 fps to 5.41fps which is quite impressive.

Huge props to you!

I had to patch it for Metal, as it wasn't building, and took the opportunity to fix a few cosmetic stuff. On my MacBook Pro M1 Max I get a speedup from 0.96 fps to 5.41fps which is quite impressive. Huge props to you!
Clément Foucault changed title from GPU_indexbuf optimization for large numbers of curves to DRW: Curves: Indexbuf optimization for large numbers of curves 2024-02-25 17:20:23 +01:00
Clément Foucault merged commit 7f43699ebf into main 2024-02-25 17:23:06 +01:00
Sign in to join this conversation.
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No Assignees
5 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#116617
No description provided.