Highpoly mesh sculpting performance #68873

New Issue

Pablo Dobarro · 2019-08-20T17:03:21+02:00

Pablo Dobarro commented

2019-08-20 17:03:21 +02:00

Updating smooth normals

BKE_pbvh_update_normals takes up 50% of of time in some cases. The use of atomics in pbvh_update_normals_accum_task_cb to add normals from faces to vertices is problematic. Ideally those should be avoided entirely but it's not simple to do so. Possibilities:

Figure out which vertices are shared with other nodes, and only use atomics for those.
Store adjacency info for the entire mesh, and gather normals per vertex.
Store adjacent faces of all border vertices of a node, and avoid a global vertex normal array and atomics.
Use a vertex normal buffer per node with duplicated vertices, and somehow merge the border vertices in a second step.
...?

Smoothing tools may also be able to benefit from this, to work without the overhead of storing adjacency info of the entire mesh.

Depending if the brush tool needs normals, we could delay updating normals outside of the viewport.

Coherent memory access

Each PBVH node contains a subset of the mesh vertices and faces. These are not contiguous so iterating over all vertices in node leads to incoherent memory access.

Two things we can do here:

Ensure vertex and face indices in a node are at least sorted.
Reorder vertex and face indices in the original mesh, so that all vertices and faces unique to a node are all stored next to each other in the global mesh array. Also could be used to reduce PBVH memory usage, since not all indices need to be store then, only a range of unique vertices + indices of border vertices owned by other nodes.

For multires this is less of an issue since all vertices within a grids are in one block, though it might help a little bit to not allocate every grid individually and instead have one allocation per node.

Partial redraw

#70295 (Sculpt partial redraw not working)
For symmetry, the region covered by partial redraw can become arbitrarily big. We could add a test to see if one side is offscreen entirely. Other than that, we'd need to define multiple regions, which could then be used for culling the PBVH nodes. Drawing the viewport one time for each region is likely slow though, so it could be rendered at full viewport resolution but blit multiple separate regions.

Draw buffers

D5926: Sculpt: multithread GPU draw buffer filling
D5922: Sculpt: only update GPU buffers of PBVH nodes inside the viewport
Use GPU_vertbuf_raw_step to reduce overhead of creating buffers in gpu_buffers.c (only used in a few places now)
Submitting vertex buffers to the GPU has some overhead. It may be possible to do those copies asynchronously in the driver.

Masks

Tagging PBVH nodes as fully masked would let us skip iterating over their vertices for sculpt tools. Drawing code could also avoid storing a buffer in this case, though the overhead of allocating/freeing that often may not be worth it.

Tagging PBVH nodes as fully unmasked would let us quickly skip drawing them as part of the overlay.

Mask are currently draw in a separate pass as part of the overlays. It would be more efficient to draw then along with the original faces, so we can draw faces just once.

Consolidate vertex loops

There are various operations that loop over all vertices or faces. The sculpt brush operation, merging results for symmetry, bounding box updates, normal updates, draw buffer updates, etc.

Some of these may be possible to merge together, to reduce the overhead of threading any the cost of memory access and cache misses.

Bounding box frustum tests

Sculpt tools that take into account the frustum only use 4 clipping planes, we should add another plane to clip nodes behind the camera. But unlike drawing, don't do use clip end and always have clip start equal to 0.

Frustum - AABB intersection tests do not appear to be a bottleneck currently. But some possible optimizations here:

For inner nodes detected to be fully contained in the frustum, skip tests for all child nodes
Multithreaded tree traversal, these tests are single threaded in most cases now
Cache visibility of nodes per viewport
Use center + half size instead of min + max for storing bounding boxes

Threading

It may be worth testing if the current settings for BLI_parallel_range_settings_defaults are still optimal. Maybe the node limit can be removed, chunk size code be reduced or increased, or scheduling could be dynamic instead of static.

Changed now to remove node limit and use dynamic scheduling with chunk size 1, gave about a 10% performance improvement. For a high number of nodes it may be worth increasing the chunk size.

Symmetry

For X symmetry we currently do 2 loops over all vertices, and then do another loop to merge them. These 3 could perhaps be merged into one loop, though code might become significantly more complicated as every brush tool may need to code to handle symmetry.

Low level optimizations

Overall, this kind of optimization requires carefully analyzing code that runs per mesh element, and trying to make it faster.

Sculpt tools support many settings, and the number of functions calls, conditionals and following of pointers adds up. It can be worth testing what happens when most of the code is removed, what kind of overhead there is.

It can help to copy some commonly used variables onto the stack functions, ensuring that they can stay in registers and avoiding pointer aliasing. Test that check multiple variables could be precomputed and the result stored in a bitflag.

More functions can be inlined in some cases. For example bmesh iterators used for dyntopo go through function pointers and function calls, while they really can be a simple double loop over chunks and the elements within the chunks.

PBVH building

Building the PBVH is not the most performance critical since it only happens when entering sculpt mode, but there is room for optimization anyway. The most obvious one is multithreading.

Brush radius bounds

Culling of nodes outside the brush radius is disabled for 2D Falloff:

bool sculpt_search_circle_cb(PBVHNode *node, void *data_v)
{
  ...
  return dist_sq < data->radius_squared || 1;
}

Elastic Deform has no bounds, but it may be possible to compute some even if they are bigger than the brush radius.

Memory allocations for all vertices

Some sculpt tools allocate arrays the size of all vertices for temporary data. For operations that are local, it would be better to allocate arrays per PBVH node when possible.

In some cases this might make little difference, virtual memory pages may be mapped on demand until there are actual reads/writes (though this is not obviously guaranteed for all allocators and operating systems?).

Also regarding coherent memory access, this could improve performance, if vertices are grouped per node as described above.

Undo

Undo pushes all nodes that are whose bounding boxes are within the brush radius. However that doesn't mean any vertices in that node are actually affected by the brush. In a simple test painting on a sphere, it pushed e.g. 18 nodes but only actually modified 7.

We can reduce undo memory by delaying the undo push until we know any vertices within the node are about to be modified, though this may have a small performance impact. Ideally this would take into account both the brush radius test and masking/textures.

Similarly, we also sometimes call BKE_pbvh_node_mark_redraw or BKE_pbvh_node_mark_normals_update for nodes without checking if any vertices within have actually been modified.

## Updating smooth normals `BKE_pbvh_update_normals` takes up 50% of of time in some cases. The use of atomics in `pbvh_update_normals_accum_task_cb` to add normals from faces to vertices is problematic. Ideally those should be avoided entirely but it's not simple to do so. Possibilities: * Figure out which vertices are shared with other nodes, and only use atomics for those. * Store adjacency info for the entire mesh, and gather normals per vertex. * Store adjacent faces of all border vertices of a node, and avoid a global vertex normal array and atomics. * Use a vertex normal buffer per node with duplicated vertices, and somehow merge the border vertices in a second step. * ...? Smoothing tools may also be able to benefit from this, to work without the overhead of storing adjacency info of the entire mesh. Depending if the brush tool needs normals, we could delay updating normals outside of the viewport. ## Coherent memory access Each PBVH node contains a subset of the mesh vertices and faces. These are not contiguous so iterating over all vertices in node leads to incoherent memory access. Two things we can do here: * Ensure vertex and face indices in a node are at least sorted. * Reorder vertex and face indices in the original mesh, so that all vertices and faces unique to a node are all stored next to each other in the global mesh array. Also could be used to reduce PBVH memory usage, since not all indices need to be store then, only a range of unique vertices + indices of border vertices owned by other nodes. For multires this is less of an issue since all vertices within a grids are in one block, though it might help a little bit to not allocate every grid individually and instead have one allocation per node. ## Partial redraw * #70295 (Sculpt partial redraw not working) * For symmetry, the region covered by partial redraw can become arbitrarily big. We could add a test to see if one side is offscreen entirely. Other than that, we'd need to define multiple regions, which could then be used for culling the PBVH nodes. Drawing the viewport one time for each region is likely slow though, so it could be rendered at full viewport resolution but blit multiple separate regions. ## Draw buffers - [x] [D5926: Sculpt: multithread GPU draw buffer filling](https://archive.blender.org/developer/D5926) - [x] [D5922: Sculpt: only update GPU buffers of PBVH nodes inside the viewport](https://archive.blender.org/developer/D5922) - [ ] Use `GPU_vertbuf_raw_step` to reduce overhead of creating buffers in `gpu_buffers.c` (only used in a few places now) - [ ] Submitting vertex buffers to the GPU has some overhead. It may be possible to do those copies asynchronously in the driver. ## Masks Tagging PBVH nodes as fully masked would let us skip iterating over their vertices for sculpt tools. Drawing code could also avoid storing a buffer in this case, though the overhead of allocating/freeing that often may not be worth it. Tagging PBVH nodes as fully unmasked would let us quickly skip drawing them as part of the overlay. Mask are currently draw in a separate pass as part of the overlays. It would be more efficient to draw then along with the original faces, so we can draw faces just once. ## Consolidate vertex loops There are various operations that loop over all vertices or faces. The sculpt brush operation, merging results for symmetry, bounding box updates, normal updates, draw buffer updates, etc. Some of these may be possible to merge together, to reduce the overhead of threading any the cost of memory access and cache misses. ## Bounding box frustum tests Sculpt tools that take into account the frustum only use 4 clipping planes, we should add another plane to clip nodes behind the camera. But unlike drawing, don't do use clip end and always have clip start equal to 0. Frustum - AABB intersection tests do not appear to be a bottleneck currently. But some possible optimizations here: * For inner nodes detected to be fully contained in the frustum, skip tests for all child nodes * Multithreaded tree traversal, these tests are single threaded in most cases now * Cache visibility of nodes per viewport * Use center + half size instead of min + max for storing bounding boxes ## Threading - [x] It may be worth testing if the current settings for `BLI_parallel_range_settings_defaults` are still optimal. Maybe the node limit can be removed, chunk size code be reduced or increased, or scheduling could be dynamic instead of static. Changed now to remove node limit and use dynamic scheduling with chunk size 1, gave about a 10% performance improvement. For a high number of nodes it may be worth increasing the chunk size. ## Symmetry For X symmetry we currently do 2 loops over all vertices, and then do another loop to merge them. These 3 could perhaps be merged into one loop, though code might become significantly more complicated as every brush tool may need to code to handle symmetry. ## Low level optimizations Overall, this kind of optimization requires carefully analyzing code that runs per mesh element, and trying to make it faster. Sculpt tools support many settings, and the number of functions calls, conditionals and following of pointers adds up. It can be worth testing what happens when most of the code is removed, what kind of overhead there is. It can help to copy some commonly used variables onto the stack functions, ensuring that they can stay in registers and avoiding pointer aliasing. Test that check multiple variables could be precomputed and the result stored in a bitflag. More functions can be inlined in some cases. For example bmesh iterators used for dyntopo go through function pointers and function calls, while they really can be a simple double loop over chunks and the elements within the chunks. ## PBVH building Building the PBVH is not the most performance critical since it only happens when entering sculpt mode, but there is room for optimization anyway. The most obvious one is multithreading. ## Brush radius bounds Culling of nodes outside the brush radius is disabled for 2D Falloff: ``` bool sculpt_search_circle_cb(PBVHNode *node, void *data_v) { ... return dist_sq < data->radius_squared || 1; } ``` Elastic Deform has no bounds, but it may be possible to compute some even if they are bigger than the brush radius. ## Memory allocations for all vertices Some sculpt tools allocate arrays the size of all vertices for temporary data. For operations that are local, it would be better to allocate arrays per PBVH node when possible. In some cases this might make little difference, virtual memory pages may be mapped on demand until there are actual reads/writes (though this is not obviously guaranteed for all allocators and operating systems?). Also regarding coherent memory access, this could improve performance, if vertices are grouped per node as described above. ## Undo Undo pushes all nodes that are whose bounding boxes are within the brush radius. However that doesn't mean any vertices in that node are actually affected by the brush. In a simple test painting on a sphere, it pushed e.g. 18 nodes but only actually modified 7. We can reduce undo memory by delaying the undo push until we know any vertices within the node are about to be modified, though this may have a small performance impact. Ideally this would take into account both the brush radius test and masking/textures. Similarly, we also sometimes call `BKE_pbvh_node_mark_redraw` or `BKE_pbvh_node_mark_normals_update` for nodes without checking if any vertices within have actually been modified.

Pablo Dobarro commented

2019-08-20 17:03:21 +02:00

Added subscriber: @PabloDobarro

Brecht Van Lommel commented

2019-08-20 17:21:01 +02:00

Added subscriber: @brecht

Brecht Van Lommel commented

2019-08-20 17:21:01 +02:00

The point of the PBVH is to be able to do partial updates quickly. If doing many partial updates is someone significantly slower than updating the mesh as a whole, that is something to be fixed. There is no good reason for it to be slower.

The solution should not be to take some separate code path that updates the mesh as a whole, but rather fixing the bottleneck in the partial updates.

The point of the PBVH is to be able to do partial updates quickly. If doing many partial updates is someone significantly slower than updating the mesh as a whole, that is something to be fixed. There is no good reason for it to be slower. The solution should not be to take some separate code path that updates the mesh as a whole, but rather fixing the bottleneck in the partial updates.

Erick Tukuniata commented

2019-08-20 22:42:30 +02:00

Added subscriber: @ErickNyanduKabongo

item412 commented

2019-08-26 21:34:24 +02:00

Added subscriber: @item412

Christina McKay commented

2019-09-04 18:51:12 +02:00

Added subscriber: @CMC

Tiago Cruz commented

2019-09-04 19:29:17 +02:00

Added subscriber: @tiagoffcruz

blender-admin commented

2019-09-29 01:38:44 +02:00

This issue was referenced by c931a0057f

This issue was referenced by c931a0057ffea26175a2dc111718e5f3590b00f8

reza sanjaya commented

2019-09-29 14:52:08 +02:00

Added subscriber: @ReguzaEi

Brecht Van Lommel commented

2019-09-29 16:24:56 +02:00

Some profiles from a 3 million poly mesh after the latest optimizations.

Running single threaded with -t 1. The multithreaded one is not as readable as a screenshot, but the hotspots are similar.

Large draw brush. Bottleneck is mainly the sculpting itself, with symmetry here.

Mesh filter. Clearly normal update is the problem here. Not using atomics there make it 2x faster overall, but also can give wrong results then.

The impact incoherent memory access is not possible to see in profiles like this, but it's probably worth trying to hack together some code for that and evaluate how much it helps, and then see if it's worth implementing properly.

Some profiles from a 3 million poly mesh after the latest optimizations. Running single threaded with `-t 1`. The multithreaded one is not as readable as a screenshot, but the hotspots are similar. **Large draw brush**. Bottleneck is mainly the sculpting itself, with symmetry here. ![sculpt_perf_large_brush.png](https://archive.blender.org/developer/F7779796/sculpt_perf_large_brush.png) **Mesh filter**. Clearly normal update is the problem here. Not using atomics there make it 2x faster overall, but also can give wrong results then. ![sculpt_perf_filter.png](https://archive.blender.org/developer/F7779793/sculpt_perf_filter.png) The impact incoherent memory access is not possible to see in profiles like this, but it's probably worth trying to hack together some code for that and evaluate how much it helps, and then see if it's worth implementing properly.

Alberto Velázquez commented

2019-10-04 23:50:23 +02:00

Added subscriber: @AlbertoVelazquez

Joseph Brandenburg commented

2019-10-10 04:18:10 +02:00

Added subscriber: @Josephbburg

s12a commented

2019-10-26 13:47:47 +02:00

Added subscriber: @s12a

Paweł Łyczkowski commented

2020-01-03 14:44:40 +01:00

Added subscriber: @PawelLyczkowski-1

clinton oragwu commented

2020-01-24 15:38:28 +01:00

Added subscriber: @ClinToch

Tom Musgrove commented

2020-01-31 18:10:21 +01:00

Added subscriber: @TomMusgrove

Tom Musgrove commented

2020-01-31 18:10:21 +01:00

@PabloDobarro - another performance suggestion for sculpt/paint is to maintain a lower resolution version of what you are working on that is updated and rendered immediately as the stroke occurs; then the stroke is applied to the higher resoltution version of the mesh/image in seperate threads and they are rendered and replace the low res rendering as they are completed. This can reduce the amount of mesh and image data kept in memory or allow meshes/images that would greatly exceed memory; and allow compression of the parts of the mesh/image not in use.

The lower res object and image data can use about 1/4 to 1/8 the memory of the full object and images (or even drastically less for large images that are zoomed out); and then only the chunks of mesh data and image data that are actively being changed need to be kept in memory. Which chunks are needed are fairly predictible based on stroke direction, so loading and unloading them shouldn't introduce lag.

@PabloDobarro - another performance suggestion for sculpt/paint is to maintain a lower resolution version of what you are working on that is updated and rendered immediately as the stroke occurs; then the stroke is applied to the higher resoltution version of the mesh/image in seperate threads and they are rendered and replace the low res rendering as they are completed. This can reduce the amount of mesh and image data kept in memory or allow meshes/images that would greatly exceed memory; and allow compression of the parts of the mesh/image not in use. The lower res object and image data can use about 1/4 to 1/8 the memory of the full object and images (or even drastically less for large images that are zoomed out); and then only the chunks of mesh data and image data that are actively being changed need to be kept in memory. Which chunks are needed are fairly predictible based on stroke direction, so loading and unloading them shouldn't introduce lag.

Bataev Artem commented

2020-02-01 17:43:59 +01:00

Added subscriber: @ArtemBataev