DRW: Curves: Indexbuf optimization for large numbers of curves #116617

Eugene-Kuznetsov · 2023-12-29T01:50:22+01:00

Eugene-Kuznetsov commented

2023-12-29 01:50:22 +01:00

This optimizes a few loops that become significant bottlenecks during
viewport rendering of scenes with large numbers of curves.

To render a curves object, Blender needs to generate a potentially
very large (but trivial) index buffer. As previously implemented,
this index buffer is generated in an extremely inefficient manner,
with a single-threaded loop and an explicit function call per entry.
The buffer then needs to be pushed onto the GPU, which is also a fairly
slow task.

The PR generates the index buffer directly on the GPU with compute
shader.

Test case attached. Branch with timing code: https://projects.blender.org/Eugene-Kuznetsov/blender/src/branch/ek_curves_2d_draw_timing

(Separated from #116545.)

Version	fps, 5955WX, AMD RX7900, Linux	fps, Ryzen 7 5800, AMD RX6900, Windows 10
Main	2.7	2.0
efd457aef5cce6187fabb6e5ac51b5083a816ee0	4.5	2.3
5d06fed12755993130c91cb02fe7b2398da628cb	4.5	3.0
d8e639bfbbae012d558490955ff2386ae2407d98	5.2	3.8

This optimizes a few loops that become significant bottlenecks during viewport rendering of scenes with large numbers of curves. To render a curves object, Blender needs to generate a potentially very large (but trivial) index buffer. As previously implemented, this index buffer is generated in an extremely inefficient manner, with a single-threaded loop and an explicit function call per entry. The buffer then needs to be pushed onto the GPU, which is also a fairly slow task. The PR generates the index buffer directly on the GPU with compute shader. Test case attached. Branch with timing code: https://projects.blender.org/Eugene-Kuznetsov/blender/src/branch/ek_curves_2d_draw_timing (Separated from https://projects.blender.org/blender/blender/pulls/116545.) Version | fps, 5955WX, AMD RX7900, Linux | fps, Ryzen 7 5800, AMD RX6900, Windows 10 --- | --- | --- Main | 2.7 | 2.0 efd457aef5cce6187fabb6e5ac51b5083a816ee0 | 4.5 | 2.3 5d06fed12755993130c91cb02fe7b2398da628cb | 4.5 | 3.0 d8e639bfbbae012d558490955ff2386ae2407d98 | 5.2 | 3.8

rug.blend

1.7 MiB

👍 1 ❤️ 2

Eugene-Kuznetsov force-pushed ek_curves_2d_draw from 60b7e90ddb to b08777f001

2023-12-29 01:52:05 +01:00

Compare

Iliya Katushenock added this to the EEVEE & Viewport project 2023-12-29 09:15:03 +01:00

Eugene-Kuznetsov force-pushed ek_curves_2d_draw from b08777f001 to 85275516d7

2024-01-01 04:35:20 +01:00

Compare

Eugene-Kuznetsov force-pushed ek_curves_2d_draw from 85275516d7 to f44868e93a

2024-01-09 09:38:35 +01:00

Compare

Eugene-Kuznetsov force-pushed ek_curves_2d_draw from f44868e93a to d8e639bfbb

2024-01-09 10:27:12 +01:00

Compare

Eugene-Kuznetsov referenced this pull request

2024-01-10 06:35:16 +01:00

Improved Surface Deform performance #116545

Brecht Van Lommel requested review from Jeroen Bakker 2024-01-12 15:37:31 +01:00

Brecht Van Lommel requested review from Clément Foucault 2024-01-12 15:37:32 +01:00

Brecht Van Lommel requested review from Miguel Pozo 2024-01-12 15:37:32 +01:00

Miguel Pozo commented

2024-01-12 17:31:28 +01:00

I think @HooglyBoogly might want to take a look at this one too.

Hans Goudey commented

2024-01-12 17:39:47 +01:00

Overall, nice improvements! I have a couple concerns though:

This curve drawing code has significant changes planned. See #96455 for details. I don't think most of this code would remain at that point.
The performance issues with GPU_indexbuf_add_tri_verts and GPU_indexbuf_add_line_verts and other similar functions are widespread. Mesh draw extraction has the same issue. A more general solution might be adding a different API that exposes the raw data of the index buffer to the caller.
Moving some index buffer creation to a shader might make sense (others would know better), but given this will change a lot as this code is improved in the nearish future, it might not be worth doing right now.

Overall, nice improvements! I have a couple concerns though: - This curve drawing code has significant changes planned. See #96455 for details. I don't think most of this code would remain at that point. - The performance issues with `GPU_indexbuf_add_tri_verts` and `GPU_indexbuf_add_line_verts` and other similar functions are widespread. Mesh draw extraction has the same issue. A more general solution might be adding a different API that exposes the raw data of the index buffer to the caller. - Moving some index buffer creation to a shader might make sense (others would know better), but given this will change a lot as this code is improved in the nearish future, it might not be worth doing right now.

Eugene-Kuznetsov commented

2024-01-12 17:51:55 +01:00

Overall, nice improvements! I have a couple concerns though:

This curve drawing code has significant changes planned. See #96455 for details. I don't think most of this code would remain at that point.

The performance issues with GPU_indexbuf_add_tri_verts and GPU_indexbuf_add_line_verts and other similar functions are widespread. Mesh draw extraction has the same issue. A more general solution might be adding a different API that exposes the raw data of the index buffer to the caller.

Moving some index buffer creation to a shader might make sense (others would know better), but given this will change a lot as this code is improved in the nearish future, it might not be worth doing right now.

What's "nearish future"? #96455 is 2 years old and it does not reference any ongoing development. If someone is working on this now, that's one thing. If it's something that's probably going happen in another 2 years, that's something else entirely.

> Overall, nice improvements! I have a couple concerns though: > - This curve drawing code has significant changes planned. See #96455 for details. I don't think most of this code would remain at that point. > - The performance issues with `GPU_indexbuf_add_tri_verts` and `GPU_indexbuf_add_line_verts` and other similar functions are widespread. Mesh draw extraction has the same issue. A more general solution might be adding a different API that exposes the raw data of the index buffer to the caller. > - Moving some index buffer creation to a shader might make sense (others would know better), but given this will change a lot as this code is improved in the nearish future, it might not be worth doing right now. What's "nearish future"? #96455 is 2 years old and it does not reference any ongoing development. If someone is working on this now, that's one thing. If it's something that's probably going happen in another 2 years, that's something else entirely.

Hans Goudey commented

2024-01-12 18:35:56 +01:00

That's probably for @fclem to answer. Anyway, for now I'd suggest focusing on adding a different API for filling index buffers that can be used elsewhere too (for example extract_tris_iter_face_mesh). But others may have different opinions.

That's probably for @fclem to answer. Anyway, for now I'd suggest focusing on adding a different API for filling index buffers that can be used elsewhere too (for example `extract_tris_iter_face_mesh`). But others may have different opinions.

Jeroen Bakker reviewed 2024-01-15 09:35:03 +01:00

Jeroen Bakker left a comment

Hi thanks for this patch. I do believe in the long run generating index buffers using shaders for generative topologies would be desired, independent on other performance improvements we are planning. So +1 for me on this patch.

The approach is quite similar to what we do for curves/hair drawing and think we should separate the main part of the code to something that would also become reusable for other parts as well.

We also need to check how to handle Metal pipeline as due to their architecture it might be more wise to use a graphics shader for it.

Hi thanks for this patch. I do believe in the long run generating index buffers using shaders for generative topologies would be desired, independent on other performance improvements we are planning. So +1 for me on this patch. The approach is quite similar to what we do for curves/hair drawing and think we should separate the main part of the code to something that would also become reusable for other parts as well. We also need to check how to handle Metal pipeline as due to their architecture it might be more wise to use a graphics shader for it.

source/blender/draw/intern/draw_cache_impl_curves.cc Outdated

						
				@ -485,20 +463,22 @@ static void curves_batch_cache_ensure_procedural_indices(const bke::CurvesGeomet

				  int verts_per_curve;

				  int element_count;

				  GPUPrimType prim_type;

				  int x_dim;

Jeroen Bakker commented

2024-01-15 09:14:00 +01:00

dispatch_x_size seems a better description

`dispatch_x_size` seems a better description

Eugene-Kuznetsov marked this conversation as resolved

source/blender/draw/intern/draw_cache_impl_curves.cc Outdated

						
				@ -516,0 +492,4 @@

				  GPUIndexBuf *ib;

				  if(prim_type==GPU_PRIM_LINE_STRIP || prim_type==GPU_PRIM_TRI_STRIP

				                                    || prim_type==GPU_PRIM_TRIS) {

				      bool tris = (prim_type==GPU_PRIM_TRIS);

Jeroen Bakker commented

2024-01-15 09:22:51 +01:00

Would rather make this a backend responsibility.

Using compute shaders in a graphics pipeline triggers two context switches. The metal therefore prefers to use graphics pipeline.

See draw_curves.cc (although this is still done inside the draw manager). In the future we it is the plan to make the draw manager part of the gpu module so you can also add it to the draw manager similar as draw_curves.cc

Depends on your solution we should go over the location (draw vs gpu) and the naming convention that we use. https://developer.blender.org/docs/handbook/guidelines/glsl/

Would rather make this a backend responsibility. Using compute shaders in a graphics pipeline triggers two context switches. The metal therefore prefers to use graphics pipeline. See `draw_curves.cc` (although this is still done inside the draw manager). In the future we it is the plan to make the draw manager part of the gpu module so you can also add it to the draw manager similar as `draw_curves.cc` Depends on your solution we should go over the location (draw vs gpu) and the naming convention that we use. https://developer.blender.org/docs/handbook/guidelines/glsl/

Eugene-Kuznetsov commented

2024-01-15 16:18:33 +01:00

I'm not sure what you mean by this. Do you want the whole operation moved into draw_curves.cc? Or do you want an option to switch between a compute shader and a "transform feedback" shader?

I do not have a Mac so I can't test behavior with metal. l can test this with strip primitives manually disabled, but it might be safer to disable the whole pipeline until it can be properly tested on that hardware.

I'm not sure what you mean by this. Do you want the whole operation moved into draw_curves.cc? Or do you want an option to switch between a compute shader and a "transform feedback" shader? I do not have a Mac so I can't test behavior with metal. l can test this with strip primitives manually disabled, but it might be safer to disable the whole pipeline until it can be properly tested on that hardware.

Jeroen Bakker commented

2024-01-18 11:49:21 +01:00

add a new function to GPU_index_buffer.h/cc

void GPU_indexbuf_build_curves_on_device(index_buf, prim_type, curves_num, verts_per_curve)

add a new function to `GPU_index_buffer.h/cc` `void GPU_indexbuf_build_curves_on_device(index_buf, prim_type, curves_num, verts_per_curve)`

Eugene-Kuznetsov marked this conversation as resolved

source/blender/draw/intern/draw_cache_impl_curves.cc Outdated

						
				@ -516,0 +498,4 @@

				          : DRW_shader_curves_2d_points_get();

				      GPU_shader_bind(shader);

				      GPU_memory_barrier(GPU_BARRIER_BUFFER_UPDATE);

				      auto tstart = std::chrono::high_resolution_clock::now();

Jeroen Bakker commented

2024-01-15 09:15:19 +01:00

This is debug code and should eventually be removed. Note that Blender has its own developer debug timing functions in PIL_time_utildefines.h

This is debug code and should eventually be removed. Note that Blender has its own developer debug timing functions in `PIL_time_utildefines.h`

Eugene-Kuznetsov marked this conversation as resolved

source/blender/draw/intern/draw_cache_impl_curves.cc Outdated

						
				@ -516,1 +507,4 @@

				      GPU_indexbuf_bind_as_ssbo(ib, GPU_shader_get_ssbo_binding(shader, "out_indices"));

				      /* Maximum workgroup dimension is possibly platform-dependent, but OpenGL promises

				       * at least 65535 on each axis. We have to divide across 2 axes, because we may have

Jeroen Bakker commented

2024-01-15 09:16:51 +01:00

You can use GPU_max_work_group_size(...) to find actual device supported workgroup sizes.

You can use `GPU_max_work_group_size(...)` to find actual device supported workgroup sizes.

Eugene-Kuznetsov marked this conversation as resolved

source/blender/gpu/intern/gpu_index_buffer_private.hh Outdated

						
				@ -49,2 +49,4 @@

				  /** True if buffer only contains restart indices. */

				  bool is_empty_ = false;

				  /** GPUUsageType from GPU_vertex_buffer.h; defaults to GPU_USAGE_STATIC */

				  uint8_t usage_hint_ = 1;

Jeroen Bakker commented

2024-01-15 09:18:25 +01:00

Why not use GPUUsageType and GPU_USAGE_STATIC here.

Why not use `GPUUsageType` and `GPU_USAGE_STATIC` here.

Eugene-Kuznetsov commented

2024-01-15 16:50:17 +01:00

I would have to move GPUUsageType out of GPU_vertex_buffer.h and into some common header, maybe to GPU_common_types.h

Jeroen Bakker commented

2024-01-18 11:20:32 +01:00

You can do the same what GPU_storage_buffer does. Just include the GPU_vertex_buffer.h in GPU_index_buffer.h. I didn't test, but that should be sufficient.

You can do the same what `GPU_storage_buffer` does. Just include the `GPU_vertex_buffer.h` in `GPU_index_buffer.h`. I didn't test, but that should be sufficient.

Eugene-Kuznetsov marked this conversation as resolved

Jeroen Bakker reviewed 2024-01-18 11:49:33 +01:00

source/blender/draw/intern/draw_cache_impl_curves.cc Outdated

						
				@ -516,0 +502,4 @@

				      GPUIndexBuf* arg = GPU_indexbuf_build_on_device(2);

				      uint size[2] = {curves.curves_num(), (uint)x_dim};

				      ib = GPU_indexbuf_build_on_device(element_count);

				      GPU_indexbuf_bind_as_ssbo(arg, GPU_shader_get_ssbo_binding(shader, "ncurves"));

Jeroen Bakker commented

2024-01-18 11:32:24 +01:00

Should use push_constants/or uniforms to send parameters to the shader.

Eugene-Kuznetsov marked this conversation as resolved

source/blender/draw/intern/draw_shader.cc Outdated

						
				@ -148,6 +150,22 @@ GPUShader *DRW_shader_draw_command_generate_get()

				  return e_data.draw_command_generate_sh;

				}

				GPUShader *DRW_shader_curves_2d_points_get()

Jeroen Bakker commented

2024-01-18 11:39:59 +01:00

These should become a builtin shaders in the gpu module.
See eGPUBuiltinShader

These should become a builtin shaders in the gpu module. See `eGPUBuiltinShader`

Eugene-Kuznetsov marked this conversation as resolved

source/blender/gpu/shaders/infos/gpu_compute_shaders_info.hh Outdated

						
				@ -0,0 +10,4 @@

				#include "gpu_shader_create_info.hh"

				GPU_SHADER_CREATE_INFO(gpu_compute_2d_array_points)

				    .local_group_size(1,1,1)

Jeroen Bakker commented

2024-01-18 11:34:07 +01:00

Have you experiment with different local_group_sizes. IMO this should have a performance impact as the shaders are small.

Eugene-Kuznetsov commented

2024-01-23 22:45:01 +01:00

In my tests, even for largest curves objects my system can handle (e.g 1 million curves and 64 vertexes per curve), this shader takes a fraction of a millisecond to execute as it is.

EDIT: looks like my timing information was wrong, because neither GPU_memory_barrier((eGPUBarrier)65535) nor GPU_fence_signal/GPU_fence_wait were actually forcing synchronization. I had to replace glWaitSync with glClientWaitSync in GLFence::wait() to get good timing.

When I measure it properly, it's not quite as fast as that, and raising local_group_size does help quite a bit.

In my tests, even for largest curves objects my system can handle (e.g 1 million curves and 64 vertexes per curve), this shader takes a fraction of a millisecond to execute as it is. EDIT: looks like my timing information was wrong, because neither GPU_memory_barrier((eGPUBarrier)65535) nor GPU_fence_signal/GPU_fence_wait were actually forcing synchronization. I had to replace glWaitSync with glClientWaitSync in GLFence::wait() to get good timing. When I measure it properly, it's not quite as fast as that, and raising local_group_size does help quite a bit.

👍 1

Eugene-Kuznetsov marked this conversation as resolved

source/blender/gpu/shaders/infos/gpu_compute_shaders_info.hh Outdated

						
				@ -0,0 +11,4 @@

				GPU_SHADER_CREATE_INFO(gpu_compute_2d_array_points)

				    .local_group_size(1,1,1)

				    .storage_buf(0, Qualifier::READ, "uint", "ncurves[]")

Jeroen Bakker commented

2024-01-18 11:35:57 +01:00

This should be two push_constants. No need to setup a buffer to transfer parameters to a shader. Push constants are faster as they are stored in more performant memory areas, then buffers.

Similar for the other shader

This should be two push_constants. No need to setup a buffer to transfer parameters to a shader. Push constants are faster as they are stored in more performant memory areas, then buffers. Similar for the other shader

Eugene-Kuznetsov commented

2024-01-23 22:52:46 +01:00

It was my impression from https://www.khronos.org/opengl/wiki/Compute_Shader that I have to use buffers to transfer data to compute shaders. (Though, be honest, my knowledge of shaders is extremely limited, I'm more familiar with CUDA.) If that is incorrect, I'll try to figure out how to make this work with uniforms.

Eugene-Kuznetsov marked this conversation as resolved

Eugene-Kuznetsov force-pushed ek_curves_2d_draw from d8e639bfbb to 2cf50c1a3f

2024-01-24 03:23:12 +01:00

Compare

Eugene-Kuznetsov force-pushed ek_curves_2d_draw from 2cf50c1a3f to 54fa49d50d

2024-01-24 04:58:00 +01:00

Compare

Clément Foucault requested changes 2024-01-24 08:44:40 +01:00

Dismissed

Clément Foucault left a comment

Given that 4.0 already requires compute shader support, I would remove the CPU fallback.

I would also remove GPU_indexbuf_set_usage_hint (this is a state modification function that is not easily supported on other backend) and just use GPU_indexbuf_init_build_on_device with the compute shader.

The patch is also missing formatting.

Given that 4.0 already requires compute shader support, I would remove the CPU fallback. I would also remove `GPU_indexbuf_set_usage_hint` (this is a state modification function that is not easily supported on other backend) and just use `GPU_indexbuf_init_build_on_device` with the compute shader. The patch is also missing formatting.

source/blender/gpu/intern/gpu_index_buffer.cc Outdated

						
				@ -244,0 +330,4 @@

				  if(grid_y <= max_grid_y) {

				    grid_z = 1;

				  } else {

				    grid_y = grid_z = (uint64_t)ceil(sqrt(double(grid_y)));

Clément Foucault commented

2024-01-24 08:39:29 +01:00

Use functional cast in C++.

Eugene-Kuznetsov marked this conversation as resolved

source/blender/gpu/opengl/gl_index_buffer.cc Outdated

						
				@ -36,2 +36,2 @@

				    /* Sends data to GPU. */

				    glBufferData(GL_ELEMENT_ARRAY_BUFFER, size, data_, GL_STATIC_DRAW);

				    glBufferData(GL_ELEMENT_ARRAY_BUFFER, size, data_, 

				      usage_hint_==0 ? GL_STREAM_DRAW : GL_STATIC_DRAW);

Clément Foucault commented

2024-01-24 08:23:09 +01:00

Run make format in blender git directory.

Run `make format` in blender git directory.

Eugene-Kuznetsov marked this conversation as resolved

source/blender/gpu/shaders/gpu_compute_2d_array_points.glsl Outdated

						
				@ -0,0 +2,4 @@

				 *

				 * SPDX-License-Identifier: GPL-2.0-or-later */

				void main()

Clément Foucault commented

2024-01-24 08:38:19 +01:00

Would be nice if the shader files had a comment explaining what the output is (like the shape of the array of point and how it relates to the dispatch).

Eugene-Kuznetsov marked this conversation as resolved

source/blender/gpu/shaders/gpu_compute_2d_array_points.glsl Outdated

						
				@ -0,0 +4,4 @@

				void main()

				{

				/*

Clément Foucault commented

2024-01-24 08:31:18 +01:00

Remove pseudo code.

Add comments throughout the code if that helps the understanding of it.

Remove pseudo code. Add comments throughout the code if that helps the understanding of it.

Eugene-Kuznetsov marked this conversation as resolved

source/blender/gpu/shaders/gpu_compute_2d_array_points.glsl Outdated

						
				@ -0,0 +18,4 @@

				  int y = int(gl_GlobalInvocationID.y) + int(gl_GlobalInvocationID.z)*int(gl_NumWorkGroups.y)*int(gl_WorkGroupSize.y);

				  int store_index = x + y*w;

				  if((x < w) && (y < max_curve)) {

				    if(x + 1 < w)

Clément Foucault commented

2024-01-24 08:29:08 +01:00

Use brackets even on single line if statements. See https://developer.blender.org/docs/handbook/guidelines/c_cpp/ .
Alternatively, you can use a ternary expression here since the code is quite small.

Use brackets even on single line if statements. See https://developer.blender.org/docs/handbook/guidelines/c_cpp/ . Alternatively, you can use a ternary expression here since the code is quite small.

Eugene-Kuznetsov marked this conversation as resolved

source/blender/gpu/shaders/infos/gpu_compute_shaders_info.hh Outdated

						
				@ -0,0 +11,4 @@

				GPU_SHADER_CREATE_INFO(gpu_compute_2d_array_points)

				    .local_group_size(16,16,1)

				    .push_constant(Type::IVEC2, "shape")

Clément Foucault commented

2024-01-24 08:44:37 +01:00

shape is not very descriptive for constant. Even more so that you split the vector inside the shader afterwards. So better just split the push constant into curves_num and verts_per_curve.

`shape` is not very descriptive for constant. Even more so that you split the vector inside the shader afterwards. So better just split the push constant into `curves_num` and `verts_per_curve`.

Eugene-Kuznetsov marked this conversation as resolved

Eugene-Kuznetsov force-pushed ek_curves_2d_draw from 54fa49d50d to 23d1af2ff5

2024-01-24 22:14:56 +01:00

Compare

Eugene-Kuznetsov force-pushed ek_curves_2d_draw from 23d1af2ff5 to d6b0bffbef

2024-01-24 22:15:48 +01:00

Compare

Eugene-Kuznetsov force-pushed ek_curves_2d_draw from d6b0bffbef to d1b217f1b6

2024-01-25 01:21:26 +01:00

Compare

Eugene-Kuznetsov force-pushed ek_curves_2d_draw from d1b217f1b6 to b8a62567a9

2024-01-26 23:27:59 +01:00

Compare

Eugene-Kuznetsov force-pushed ek_curves_2d_draw from b8a62567a9 to d27953c452

2024-01-26 23:38:31 +01:00

Compare

Eugene-Kuznetsov force-pushed ek_curves_2d_draw from d27953c452 to 96ca6ca967

2024-01-26 23:40:23 +01:00

Compare

Clément Foucault added 6 commits 2024-02-25 16:09:00 +01:00

fca61cd1de Merge branch 'main' into ek_curves_2d_draw

f662fffe45 Fix compilation on Metal and avoid mixed int type casts

8548a04868 Cleanups

74b1faa8ac Fix memory barrier types

6bf37ab5ec Renaming

3415a625c9 Cleanup codestyle discrepancy

Clément Foucault approved these changes 2024-02-25 17:19:36 +01:00

Clément Foucault left a comment

I had to patch it for Metal, as it wasn't building, and took the opportunity to fix a few cosmetic stuff.

On my MacBook Pro M1 Max I get a speedup from 0.96 fps to 5.41fps which is quite impressive.

Huge props to you!

I had to patch it for Metal, as it wasn't building, and took the opportunity to fix a few cosmetic stuff. On my MacBook Pro M1 Max I get a speedup from 0.96 fps to 5.41fps which is quite impressive. Huge props to you!

Clément Foucault changed title from ~~GPU_indexbuf optimization for large numbers of curves~~ to DRW: Curves: Indexbuf optimization for large numbers of curves

2024-02-25 17:20:23 +01:00

Clément Foucault merged commit 7f43699ebf into main

2024-02-25 17:23:06 +01:00

Clément Foucault referenced this issue from a commit

2024-02-25 17:23:07 +01:00

DRW: Curves: Indexbuf optimization for large numbers of curves

Laurynas Duburas referenced this pull request

2024-03-01 09:28:23 +01:00

Fix: GPU: Ensures length of curves GPUIndexBuf to be multiple of 4 #118951

Clément Foucault referenced this issue from a commit

2024-03-03 16:39:21 +01:00

Fix: GPU: Ensures length of curves GPUIndexBuf to be multiple of 4