GPU: Mesh Drawing Performance #87835
Blender performance when working with huge meshes can be improved.
Here are some ideas and reasoning.
- Selection: how much is wasted when rebuilding selection. Could we rearrange the VBOs to have less overhead?
- Profile with meshes of different sizes and common tasks. Add performance test cases for the common tasks.
Move the display normals to draw module:
After reevaluating the modifier stack the display normals are updated (depsgraph update). drawing code could take better decisions when doing it as part of the draw module. To calculate the display normals a reverse lookup structure is build. This structure isn't kept around. Performance could be improved when geometry doesn't change between recalc.
Other buffers can also use this data (adjacency IBO for example). Does cycles use the display normals? If not we could eliminate it from the DNA/RNA
Use data streaming optimized data structures in MeshRenderData (do not lookup polies inside a loop). Reduce cache misses by storing data in arrays and only allow sequential access.
normals are precalculated, but uses additional memory that can lead to less performance (L2 caches) Check if calculating in inner loop speeds up.
Split edit mode/object mode cache: currently the edit mode cache or object mode cache reuses the same memory location. When constructing the VBO/IBO the logic branches of. Would the code quality improve and also the innerloops. Expected tiny speedup. less branching between Mesh and Bmesh evaluation.
Migrate to CPP and reduce branching by using template functions and classes.
Can we use compute shaders to convert the MeshRenderData. This way we don't need to upload all the data from ram -> GPU.
Need to research about the data transfer before and after such a change. The hair IBO is actually a simple formula. No need to do it on CPU.
From my own tests, uploading data to the GPU is currently the main bottleneck when transforming geometry, see: #88021.
Split edit mode/object mode cache
This seems worth doing early on, it should make code more maintainable.
Partial updates might be worth exploring:
- Only update modified data-types: initially this could be limited to deforming vertices, changing selection & UV's.
- Only update modified data: only send positions of vertices that are being transformed for e.g. This could increase code-complexity significantly, so it may not be worth doing early on.
The data layout could be optimized, I recall @fclem mentioning we could avoid uploading vertex coordinates multiple times for e.g.
Hello, I don´t know if this is the right place for my statement but blender is massivly slow when trying to select an object in the viewport.
The more objects are in the scene and the more poligons they have it gets slower and slower. Our typical scenes have >10000 objects and >20million polygons. That is not extreme.
We use a very old 3D-program that was not updated since 2012 to select and shade all our objects in the scene, because in Blender it is not possible. In Blender you wait 5-10sec after clicking on an object in the viewport until it gets selected.
In the other packages selection is instantly. This Program can handle more than 100million polygons and instant selection with ease.
I hope this behavior could be solved when you now try to improve the mesh drawing performance!
When going to edit mode and transforming a single vert the next batches are recalculated:
ibo.tris ibo.points vbo.pos_nor vbo.lnor vbo.edit_data
ibo.tris sort the triangles by material. by looping twice once to count and the second time to assign. will not add hidden faces. implementation is single threaded
ibo.points single loop will not add hidden verts. implementation is single threaded.
vbo.edit_data updates flag (vert, edge, crease and weight.
Unchanged buffers are:
assuming high poly count
num_tris * 3 *sizeof(int) +
num_vert * sizeof(int) +
num_vert * sizeof(int) are recalculated and resend.
vbo.edit_data uses threading.
Hi there, random question. You mention in the description that "To calculate the display normals a reverse lookup structure is build. This structure isn't kept around. Performance could be improved when geometry doesn't change between recalc", where is this exactly?
For selection buffers, we in the sculpt module are planning to make bmesh PBVH available as a general-purpose API (separate from sculpt mode and DynTopo). It could potentially replace gpu selection picking entirely while providing a pretty significant boost to drawing performance (PBVH stores drawing buffers in the leaves which can be selectively updated, so you're not sending an entire mesh to the GPU just because you moved a single vertex).
Btw if you don't want to use PBVH for drawing, pretty much any mesh segmentation method will work. You can assign triangles to drawing buffers randomly or even in indexed order and it'll still be a lot faster. The idea is to associate mesh elements with segments so you only update the segments that have changed.
has any thought been given to occlusion culling or LOD/adaptive tesselation?
uploading geometry to the gpu is expensive - but so is drawing all of it even when it's contributing much if anything.
render screen smaller / use AMD super resolution now that it's open source?
LODs needs to be preprocessed and would not work with gpu selection.
Even if supported by opengl 3.0 how would adaptive tesselation work with complex object like we support with blender. Preprocessing this would be a CPU job and would slow down in the areas where the user is working at.
As you already say uploading geometry is the issue, not rendering the uploaded geometry, FSR would only help in the latter part.
I have an idea to experiment with FSR, but that is to increase the details in icon/button preview rendering.
What about distance and occlusion culling for the viewport?
Sometimes very dense mesh objects are drawn when they don't need to be. Such as a 100k polygon diamond ring, sitting on the finger of a human character, roughly 1km away from the camera.
Or a finely detailed piano sitting in the next room over behind a wall (as part of a sweeping camera move where it will eventually come into view but hasn't yet).
A viewport optimisation setting available to users to cull any object if it's distance and bounding sphere diameter would result in the object not covering more than Y pixels number of pixels. A default of 2 for example could be enough to cull anything that would be virtually invisible anyway. Users with more aggressive culling needs could bump that number as high as they need.
Occlusion culling could be implemented with the technique "Hierarchical-Z map based occlusion culling".
For reference, description of the approach here: https://rastergrid.com/blog/2010/10/hierarchical-z-map-based-occlusion-culling/
It has the benefit of requiring very little preprocessing or additional work, it's GPU based, it can in fact be almost automatic with the right approach, or manually tweaked for even better performance.
In short, just render a depth only, low resolution pass of only the 'large' objects in the scene, use that to form a tile map of the 'minimum Z value' and use that to determine occlusion of objects before rendering them.
Or alternatively, allow users to manually specify which objects are occluders, so users can create low poly shapes to represent the silhouette of some highly detailed objects, or a low poly shape of for example, the wall structure of a large archviz scene, culling out drawing most of the objects in the scene that aren't directly visible.
This could also be used to cull things other than objects, such as lights in Eevee.
+1 for occlusion culling
about the LOD -
'weld vertex' + round vertex to grid, but the grid size changes based on distance from camera - seems to be a really fast LOD even on the CPU
this solution mostly applies to things like terrain
No due date set.
No dependencies set.
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?