DRW: Curves: Indexbuf optimization for large numbers of curves #116617
No reviewers
Labels
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
5 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: blender/blender#116617
Loading…
Reference in New Issue
No description provided.
Delete Branch "Eugene-Kuznetsov/blender:ek_curves_2d_draw"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This optimizes a few loops that become significant bottlenecks during
viewport rendering of scenes with large numbers of curves.
To render a curves object, Blender needs to generate a potentially
very large (but trivial) index buffer. As previously implemented,
this index buffer is generated in an extremely inefficient manner,
with a single-threaded loop and an explicit function call per entry.
The buffer then needs to be pushed onto the GPU, which is also a fairly
slow task.
The PR generates the index buffer directly on the GPU with compute
shader.
Test case attached. Branch with timing code: https://projects.blender.org/Eugene-Kuznetsov/blender/src/branch/ek_curves_2d_draw_timing
(Separated from #116545.)
60b7e90ddb
tob08777f001
b08777f001
to85275516d7
85275516d7
tof44868e93a
f44868e93a
tod8e639bfbb
I think @HooglyBoogly might want to take a look at this one too.
Overall, nice improvements! I have a couple concerns though:
GPU_indexbuf_add_tri_verts
andGPU_indexbuf_add_line_verts
and other similar functions are widespread. Mesh draw extraction has the same issue. A more general solution might be adding a different API that exposes the raw data of the index buffer to the caller.What's "nearish future"? #96455 is 2 years old and it does not reference any ongoing development. If someone is working on this now, that's one thing. If it's something that's probably going happen in another 2 years, that's something else entirely.
That's probably for @fclem to answer. Anyway, for now I'd suggest focusing on adding a different API for filling index buffers that can be used elsewhere too (for example
extract_tris_iter_face_mesh
). But others may have different opinions.Hi thanks for this patch. I do believe in the long run generating index buffers using shaders for generative topologies would be desired, independent on other performance improvements we are planning. So +1 for me on this patch.
The approach is quite similar to what we do for curves/hair drawing and think we should separate the main part of the code to something that would also become reusable for other parts as well.
We also need to check how to handle Metal pipeline as due to their architecture it might be more wise to use a graphics shader for it.
@ -485,20 +463,22 @@ static void curves_batch_cache_ensure_procedural_indices(const bke::CurvesGeomet
int verts_per_curve;
int element_count;
GPUPrimType prim_type;
int x_dim;
dispatch_x_size
seems a better description@ -516,0 +492,4 @@
GPUIndexBuf *ib;
if(prim_type==GPU_PRIM_LINE_STRIP || prim_type==GPU_PRIM_TRI_STRIP
|| prim_type==GPU_PRIM_TRIS) {
bool tris = (prim_type==GPU_PRIM_TRIS);
Would rather make this a backend responsibility.
Using compute shaders in a graphics pipeline triggers two context switches. The metal therefore prefers to use graphics pipeline.
See
draw_curves.cc
(although this is still done inside the draw manager). In the future we it is the plan to make the draw manager part of the gpu module so you can also add it to the draw manager similar asdraw_curves.cc
Depends on your solution we should go over the location (draw vs gpu) and the naming convention that we use. https://developer.blender.org/docs/handbook/guidelines/glsl/
I'm not sure what you mean by this. Do you want the whole operation moved into draw_curves.cc? Or do you want an option to switch between a compute shader and a "transform feedback" shader?
I do not have a Mac so I can't test behavior with metal. l can test this with strip primitives manually disabled, but it might be safer to disable the whole pipeline until it can be properly tested on that hardware.
add a new function to
GPU_index_buffer.h/cc
void GPU_indexbuf_build_curves_on_device(index_buf, prim_type, curves_num, verts_per_curve)
@ -516,0 +498,4 @@
: DRW_shader_curves_2d_points_get();
GPU_shader_bind(shader);
GPU_memory_barrier(GPU_BARRIER_BUFFER_UPDATE);
auto tstart = std::chrono::high_resolution_clock::now();
This is debug code and should eventually be removed. Note that Blender has its own developer debug timing functions in
PIL_time_utildefines.h
@ -516,1 +507,4 @@
GPU_indexbuf_bind_as_ssbo(ib, GPU_shader_get_ssbo_binding(shader, "out_indices"));
/* Maximum workgroup dimension is possibly platform-dependent, but OpenGL promises
* at least 65535 on each axis. We have to divide across 2 axes, because we may have
You can use
GPU_max_work_group_size(...)
to find actual device supported workgroup sizes.@ -49,2 +49,4 @@
/** True if buffer only contains restart indices. */
bool is_empty_ = false;
/** GPUUsageType from GPU_vertex_buffer.h; defaults to GPU_USAGE_STATIC */
uint8_t usage_hint_ = 1;
Why not use
GPUUsageType
andGPU_USAGE_STATIC
here.I would have to move GPUUsageType out of GPU_vertex_buffer.h and into some common header, maybe to GPU_common_types.h
You can do the same what
GPU_storage_buffer
does. Just include theGPU_vertex_buffer.h
inGPU_index_buffer.h
. I didn't test, but that should be sufficient.@ -516,0 +502,4 @@
GPUIndexBuf* arg = GPU_indexbuf_build_on_device(2);
uint size[2] = {curves.curves_num(), (uint)x_dim};
ib = GPU_indexbuf_build_on_device(element_count);
GPU_indexbuf_bind_as_ssbo(arg, GPU_shader_get_ssbo_binding(shader, "ncurves"));
Should use push_constants/or uniforms to send parameters to the shader.
@ -148,6 +150,22 @@ GPUShader *DRW_shader_draw_command_generate_get()
return e_data.draw_command_generate_sh;
}
GPUShader *DRW_shader_curves_2d_points_get()
These should become a builtin shaders in the gpu module.
See
eGPUBuiltinShader
@ -0,0 +10,4 @@
#include "gpu_shader_create_info.hh"
GPU_SHADER_CREATE_INFO(gpu_compute_2d_array_points)
.local_group_size(1,1,1)
Have you experiment with different local_group_sizes. IMO this should have a performance impact as the shaders are small.
In my tests, even for largest curves objects my system can handle (e.g 1 million curves and 64 vertexes per curve), this shader takes a fraction of a millisecond to execute as it is.
EDIT: looks like my timing information was wrong, because neither GPU_memory_barrier((eGPUBarrier)65535) nor GPU_fence_signal/GPU_fence_wait were actually forcing synchronization. I had to replace glWaitSync with glClientWaitSync in GLFence::wait() to get good timing.
When I measure it properly, it's not quite as fast as that, and raising local_group_size does help quite a bit.
@ -0,0 +11,4 @@
GPU_SHADER_CREATE_INFO(gpu_compute_2d_array_points)
.local_group_size(1,1,1)
.storage_buf(0, Qualifier::READ, "uint", "ncurves[]")
This should be two push_constants. No need to setup a buffer to transfer parameters to a shader. Push constants are faster as they are stored in more performant memory areas, then buffers.
Similar for the other shader
It was my impression from https://www.khronos.org/opengl/wiki/Compute_Shader that I have to use buffers to transfer data to compute shaders. (Though, be honest, my knowledge of shaders is extremely limited, I'm more familiar with CUDA.) If that is incorrect, I'll try to figure out how to make this work with uniforms.
d8e639bfbb
to2cf50c1a3f
2cf50c1a3f
to54fa49d50d
Given that 4.0 already requires compute shader support, I would remove the CPU fallback.
I would also remove
GPU_indexbuf_set_usage_hint
(this is a state modification function that is not easily supported on other backend) and just useGPU_indexbuf_init_build_on_device
with the compute shader.The patch is also missing formatting.
@ -244,0 +330,4 @@
if(grid_y <= max_grid_y) {
grid_z = 1;
} else {
grid_y = grid_z = (uint64_t)ceil(sqrt(double(grid_y)));
Use functional cast in C++.
@ -36,2 +36,2 @@
/* Sends data to GPU. */
glBufferData(GL_ELEMENT_ARRAY_BUFFER, size, data_, GL_STATIC_DRAW);
glBufferData(GL_ELEMENT_ARRAY_BUFFER, size, data_,
usage_hint_==0 ? GL_STREAM_DRAW : GL_STATIC_DRAW);
Run
make format
in blender git directory.@ -0,0 +2,4 @@
*
* SPDX-License-Identifier: GPL-2.0-or-later */
void main()
Would be nice if the shader files had a comment explaining what the output is (like the shape of the array of point and how it relates to the dispatch).
@ -0,0 +4,4 @@
void main()
{
/*
Remove pseudo code.
Add comments throughout the code if that helps the understanding of it.
@ -0,0 +18,4 @@
int y = int(gl_GlobalInvocationID.y) + int(gl_GlobalInvocationID.z)*int(gl_NumWorkGroups.y)*int(gl_WorkGroupSize.y);
int store_index = x + y*w;
if((x < w) && (y < max_curve)) {
if(x + 1 < w)
Use brackets even on single line if statements. See https://developer.blender.org/docs/handbook/guidelines/c_cpp/ .
Alternatively, you can use a ternary expression here since the code is quite small.
@ -0,0 +11,4 @@
GPU_SHADER_CREATE_INFO(gpu_compute_2d_array_points)
.local_group_size(16,16,1)
.push_constant(Type::IVEC2, "shape")
shape
is not very descriptive for constant. Even more so that you split the vector inside the shader afterwards. So better just split the push constant intocurves_num
andverts_per_curve
.54fa49d50d
to23d1af2ff5
23d1af2ff5
tod6b0bffbef
d6b0bffbef
tod1b217f1b6
d1b217f1b6
tob8a62567a9
b8a62567a9
tod27953c452
d27953c452
to96ca6ca967
I had to patch it for Metal, as it wasn't building, and took the opportunity to fix a few cosmetic stuff.
On my MacBook Pro M1 Max I get a speedup from 0.96 fps to 5.41fps which is quite impressive.
Huge props to you!
GPU_indexbuf optimization for large numbers of curvesto DRW: Curves: Indexbuf optimization for large numbers of curves