GPU: Mesh Drawing Performance #87835

Closed
opened 2021-04-26 16:48:14 +02:00 by Jeroen Bakker · 32 comments
Member

Blender performance when working with huge meshes can be improved.
Here are some ideas and reasoning.

Research Topics

  • Selection: how much is wasted when rebuilding selection. Could we rearrange the VBOs to have less overhead?
  • Profile with meshes of different sizes and common tasks. Add performance test cases for the common tasks.

Technical tasks

  • Move the display normals to draw module:
    After reevaluating the modifier stack the display normals are updated (depsgraph update). drawing code could take better decisions when doing it as part of the draw module. To calculate the display normals a reverse lookup structure is build. This structure isn't kept around. Performance could be improved when geometry doesn't change between recalc.
    Other buffers can also use this data (adjacency IBO for example). Does cycles use the display normals? If not we could eliminate it from the DNA/RNA

  • Use data streaming optimized data structures in MeshRenderData (do not lookup polies inside a loop). Reduce cache misses by storing data in arrays and only allow sequential access.

  • normals are precalculated, but uses additional memory that can lead to less performance (L2 caches) Check if calculating in inner loop speeds up.

  • Split edit mode/object mode cache: currently the edit mode cache or object mode cache reuses the same memory location. When constructing the VBO/IBO the logic branches of. Would the code quality improve and also the innerloops. Expected tiny speedup. less branching between Mesh and Bmesh evaluation.

  • Migrate to CPP and reduce branching by using template functions and classes.

  • Can we use compute shaders to convert the MeshRenderData. This way we don't need to upload all the data from ram -> GPU.
    Need to research about the data transfer before and after such a change. The hair IBO is actually a simple formula. No need to do it on CPU.

Blender performance when working with huge meshes can be improved. Here are some ideas and reasoning. **Research Topics** * Selection: how much is wasted when rebuilding selection. Could we rearrange the VBOs to have less overhead? * Profile with meshes of different sizes and common tasks. Add performance test cases for the common tasks. **Technical tasks** * Move the display normals to draw module: **After reevaluating the modifier stack the display normals are updated (depsgraph update). drawing code could take better decisions when doing it as part of the draw module.** To calculate the display normals a reverse lookup structure is build. This structure isn't kept around. Performance could be improved when geometry doesn't change between recalc. **Other buffers can also use this data (adjacency IBO for example).** Does cycles use the display normals? If not we could eliminate it from the DNA/RNA * Use data streaming optimized data structures in MeshRenderData (do not lookup polies inside a loop). Reduce cache misses by storing data in arrays and only allow sequential access. * normals are precalculated, but uses additional memory that can lead to less performance (L2 caches) Check if calculating in inner loop speeds up. * Split edit mode/object mode cache: currently the edit mode cache or object mode cache reuses the same memory location. When constructing the VBO/IBO the logic branches of. Would the code quality improve and also the innerloops. Expected tiny speedup. less branching between Mesh and Bmesh evaluation. * Migrate to CPP and reduce branching by using template functions and classes. * Can we use compute shaders to convert the MeshRenderData. This way we don't need to upload all the data from ram -> GPU. **Need to research about the data transfer before and after such a change.** The hair IBO is actually a simple formula. No need to do it on CPU.
Jeroen Bakker self-assigned this 2021-04-26 16:48:15 +02:00
Author
Member

Added subscribers: @Jeroen-Bakker, @ideasman42

Added subscribers: @Jeroen-Bakker, @ideasman42

Added subscriber: @JacobMerrill-1

Added subscriber: @JacobMerrill-1

Added subscriber: @TheRedWaxPolice

Added subscriber: @TheRedWaxPolice

Added subscriber: @elias.andersson92

Added subscriber: @elias.andersson92
Member

Added subscriber: @JorgeBernalMartinez

Added subscriber: @JorgeBernalMartinez

Added subscriber: @kioku

Added subscriber: @kioku

Added subscriber: @GeorgiaPacific

Added subscriber: @GeorgiaPacific

Added subscriber: @fclem

Added subscriber: @fclem

From my own tests, uploading data to the GPU is currently the main bottleneck when transforming geometry, see: #88021.


Split edit mode/object mode cache

This seems worth doing early on, it should make code more maintainable.


  • Partial updates might be worth exploring:

    • Only update modified data-types: initially this could be limited to deforming vertices, changing selection & UV's.
    • Only update modified data: only send positions of vertices that are being transformed for e.g. This could increase code-complexity significantly, so it may not be worth doing early on.
  • The data layout could be optimized, I recall @fclem mentioning we could avoid uploading vertex coordinates multiple times for e.g.

From my own tests, uploading data to the GPU is currently the main bottleneck when transforming geometry, see: #88021. ---- > Split edit mode/object mode cache *This seems worth doing early on, it should make code more maintainable.* ---- - Partial updates might be worth exploring: - Only update modified data-types: initially this could be limited to deforming vertices, changing selection & UV's. - Only update modified data: only send positions of vertices that are being transformed for e.g. This could increase code-complexity significantly, so it may not be worth doing early on. - The data layout could be optimized, I recall @fclem mentioning we could avoid uploading vertex coordinates multiple times for e.g.

Added subscriber: @warcanin

Added subscriber: @warcanin

Added subscriber: @machieb

Added subscriber: @machieb

Hello, I don´t know if this is the right place for my statement but blender is massivly slow when trying to select an object in the viewport.
The more objects are in the scene and the more poligons they have it gets slower and slower. Our typical scenes have >10000 objects and >20million polygons. That is not extreme.
We use a very old 3D-program that was not updated since 2012 to select and shade all our objects in the scene, because in Blender it is not possible. In Blender you wait 5-10sec after clicking on an object in the viewport until it gets selected.
In the other packages selection is instantly. This Program can handle more than 100million polygons and instant selection with ease.
I hope this behavior could be solved when you now try to improve the mesh drawing performance!
Thanks

Hello, I don´t know if this is the right place for my statement but blender is massivly slow when trying to select an object in the viewport. The more objects are in the scene and the more poligons they have it gets slower and slower. Our typical scenes have >10000 objects and >20million polygons. That is not extreme. We use a very old 3D-program that was not updated since 2012 to select and shade all our objects in the scene, because in Blender it is not possible. In Blender you wait 5-10sec after clicking on an object in the viewport until it gets selected. In the other packages selection is instantly. This Program can handle more than 100million polygons and instant selection with ease. I hope this behavior could be solved when you now try to improve the mesh drawing performance! Thanks
Author
Member

This comment was removed by @Jeroen-Bakker

*This comment was removed by @Jeroen-Bakker*
Author
Member

Transforming verts.

When going to edit mode and transforming a single vert the next batches are recalculated:

ibo.tris
ibo.points
vbo.pos_nor
vbo.lnor
vbo.edit_data

ibo.tris sort the triangles by material. by looping twice once to count and the second time to assign. will not add hidden faces. implementation is single threaded
ibo.points single loop will not add hidden verts. implementation is single threaded.
vbo.edit_data updates flag (vert, edge, crease and weight.

Unchanged buffers are:

  • ibo.tris
  • ibo.points
  • vbo.edit_data

assuming high poly count num_tris * 3 *sizeof(int) + num_vert * sizeof(int) + num_vert * sizeof(int) are recalculated and resend. vbo.edit_data uses threading.

Callgraph when run single threaded
image.png

Transforming verts. When going to edit mode and transforming a single vert the next batches are recalculated: ``` ibo.tris ibo.points vbo.pos_nor vbo.lnor vbo.edit_data ``` `ibo.tris` sort the triangles by material. by looping twice once to count and the second time to assign. will not add hidden faces. implementation is single threaded `ibo.points` single loop will not add hidden verts. implementation is single threaded. `vbo.edit_data` updates flag (vert, edge, crease and weight. Unchanged buffers are: - ibo.tris - ibo.points - vbo.edit_data assuming high poly count `num_tris * 3 *sizeof(int)` + `num_vert * sizeof(int)` + `num_vert * sizeof(int)` are recalculated and resend. `vbo.edit_data` uses threading. Callgraph when run single threaded ![image.png](https://archive.blender.org/developer/F10087908/image.png)

Added subscriber: @easythrees

Added subscriber: @easythrees

Added subscriber: @ArtemBataev

Added subscriber: @ArtemBataev

Hi there, random question. You mention in the description that "To calculate the display normals a reverse lookup structure is build. This structure isn't kept around. Performance could be improved when geometry doesn't change between recalc", where is this exactly?

Hi there, random question. You mention in the description that "To calculate the display normals a reverse lookup structure is build. This structure isn't kept around. Performance could be improved when geometry doesn't change between recalc", where is this exactly?

Added subscriber: @rjg

Added subscriber: @rjg
Member

Added subscriber: @JosephEagar

Added subscriber: @JosephEagar
Member

For selection buffers, we in the sculpt module are planning to make bmesh PBVH available as a general-purpose API (separate from sculpt mode and DynTopo). It could potentially replace gpu selection picking entirely while providing a pretty significant boost to drawing performance (PBVH stores drawing buffers in the leaves which can be selectively updated, so you're not sending an entire mesh to the GPU just because you moved a single vertex).

Btw if you don't want to use PBVH for drawing, pretty much any mesh segmentation method will work. You can assign triangles to drawing buffers randomly or even in indexed order and it'll still be a lot faster. The idea is to associate mesh elements with segments so you only update the segments that have changed.

For selection buffers, we in the sculpt module are planning to make bmesh PBVH available as a general-purpose API (separate from sculpt mode and DynTopo). It could potentially replace gpu selection picking entirely while providing a pretty significant boost to drawing performance (PBVH stores drawing buffers in the leaves which can be selectively updated, so you're not sending an entire mesh to the GPU just because you moved a single vertex). Btw if you don't want to use PBVH for drawing, pretty much any mesh segmentation method will work. You can assign triangles to drawing buffers randomly or even in indexed order and it'll still be a lot faster. The idea is to associate mesh elements with segments so you only update the segments that have changed.
Contributor

Added subscriber: @RedMser

Added subscriber: @RedMser

Added subscriber: @PetterLundh

Added subscriber: @PetterLundh

Added subscriber: @ckohl_art

Added subscriber: @ckohl_art

has any thought been given to occlusion culling or LOD/adaptive tesselation?

uploading geometry to the gpu is expensive - but so is drawing all of it even when it's contributing much if anything.

other idea

render screen smaller / use AMD super resolution now that it's open source?

has any thought been given to occlusion culling or LOD/adaptive tesselation? uploading geometry to the gpu is expensive - but so is drawing all of it even when it's contributing much if anything. other idea render screen smaller / use AMD super resolution now that it's open source?
Author
Member

LODs needs to be preprocessed and would not work with gpu selection.
Even if supported by opengl 3.0 how would adaptive tesselation work with complex object like we support with blender. Preprocessing this would be a CPU job and would slow down in the areas where the user is working at.

As you already say uploading geometry is the issue, not rendering the uploaded geometry, FSR would only help in the latter part.

I have an idea to experiment with FSR, but that is to increase the details in icon/button preview rendering.

LODs needs to be preprocessed and would not work with gpu selection. Even if supported by opengl 3.0 how would adaptive tesselation work with complex object like we support with blender. Preprocessing this would be a CPU job and would slow down in the areas where the user is working at. As you already say uploading geometry is the issue, not rendering the uploaded geometry, FSR would only help in the latter part. I have an idea to experiment with FSR, but that is to increase the details in icon/button preview rendering.

Added subscriber: @Grady

Added subscriber: @Grady

What about distance and occlusion culling for the viewport?

Sometimes very dense mesh objects are drawn when they don't need to be. Such as a 100k polygon diamond ring, sitting on the finger of a human character, roughly 1km away from the camera.
Or a finely detailed piano sitting in the next room over behind a wall (as part of a sweeping camera move where it will eventually come into view but hasn't yet).

Distance Culling

A viewport optimisation setting available to users to cull any object if it's distance and bounding sphere diameter would result in the object not covering more than Y pixels number of pixels. A default of 2 for example could be enough to cull anything that would be virtually invisible anyway. Users with more aggressive culling needs could bump that number as high as they need.

Occlusion Culling
Occlusion culling could be implemented with the technique "Hierarchical-Z map based occlusion culling".
For reference, description of the approach here: https://rastergrid.com/blog/2010/10/hierarchical-z-map-based-occlusion-culling/
It has the benefit of requiring very little preprocessing or additional work, it's GPU based, it can in fact be almost automatic with the right approach, or manually tweaked for even better performance.

In short, just render a depth only, low resolution pass of only the 'large' objects in the scene, use that to form a tile map of the 'minimum Z value' and use that to determine occlusion of objects before rendering them.

Or alternatively, allow users to manually specify which objects are occluders, so users can create low poly shapes to represent the silhouette of some highly detailed objects, or a low poly shape of for example, the wall structure of a large archviz scene, culling out drawing most of the objects in the scene that aren't directly visible.

This could also be used to cull things other than objects, such as lights in Eevee.

What about distance and occlusion culling for the viewport? Sometimes very dense mesh objects are drawn when they don't need to be. Such as a 100k polygon diamond ring, sitting on the finger of a human character, roughly 1km away from the camera. Or a finely detailed piano sitting in the next room over behind a wall (as part of a sweeping camera move where it will eventually come into view but hasn't yet). **Distance Culling** A viewport optimisation setting available to users to cull any object if it's distance and bounding sphere diameter would result in the object not covering more than Y pixels number of pixels. A default of 2 for example could be enough to cull anything that would be virtually invisible anyway. Users with more aggressive culling needs could bump that number as high as they need. **Occlusion Culling** Occlusion culling could be implemented with the technique "Hierarchical-Z map based occlusion culling". For reference, description of the approach here: https://rastergrid.com/blog/2010/10/hierarchical-z-map-based-occlusion-culling/ It has the benefit of requiring very little preprocessing or additional work, it's GPU based, it can in fact be almost automatic with the right approach, or manually tweaked for even better performance. In short, just render a depth only, low resolution pass of only the 'large' objects in the scene, use that to form a tile map of the 'minimum Z value' and use that to determine occlusion of objects before rendering them. Or alternatively, allow users to manually specify which objects are occluders, so users can create low poly shapes to represent the silhouette of some highly detailed objects, or a low poly shape of for example, the wall structure of a large archviz scene, culling out drawing most of the objects in the scene that aren't directly visible. This could also be used to cull things other than objects, such as lights in Eevee.

+1 for occlusion culling

about the LOD -

'weld vertex' + round vertex to grid, but the grid size changes based on distance from camera - seems to be a really fast LOD even on the CPU

this solution mostly applies to things like terrain

LOD_Test.blend

+1 for occlusion culling about the LOD - 'weld vertex' + round vertex to grid, but the grid size changes based on distance from camera - seems to be a really fast LOD even on the CPU this solution mostly applies to things like terrain [LOD_Test.blend](https://archive.blender.org/developer/F10278095/LOD_Test.blend)
Member

Added subscriber: @EAW

Added subscriber: @EAW

Added subscriber: @SpencerMagnusson

Added subscriber: @SpencerMagnusson

Added subscriber: @Yuro

Added subscriber: @Yuro
Philipp Oeser removed the
Interest
EEVEE & Viewport
label 2023-02-09 15:13:44 +01:00
Jeroen Bakker removed the
Status
Needs Triage
label 2024-03-18 13:59:08 +01:00
Author
Member

Closing this issue due to the new rewrite of batch extraction. Ideas of this PR are out-dated.

Closing this issue due to the new rewrite of batch extraction. Ideas of this PR are out-dated.
Blender Bot added the
Status
Archived
label 2024-03-18 14:01:44 +01:00
Sign in to join this conversation.
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset System
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Asset Browser Project
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
21 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#87835
No description provided.