OpenSubdiv GPU acceleration #68996
Closed
opened 2019-08-21 16:22:15 +02:00 by Dalai Felinto
·
70 comments
No Branch/Tag Specified
blender-v4.0-release
main
universal-scene-description
blender-v3.6-release
temp-sculpt-dyntopo
temp-sculpt-brush-channel
blender-v3.3-release
asset-browser-frontend-split
brush-assets-project
asset-shelf
anim/armature-drawing-refactor-3
temp-sculpt-dyntopo-hive-alloc
tmp-usd-python-mtl
tmp-usd-3.6
blender-v3.5-release
blender-projects-basics
blender-v2.93-release
temp-sculpt-attr-api
realtime-clock
sculpt-dev
gpencil-next
bevelv2
microfacet_hair
xr-dev
principled-v2
v3.6.4
v3.6.3
v3.3.11
v3.6.2
v3.3.10
v3.6.1
v3.3.9
v3.6.0
v3.3.8
v3.3.7
v2.93.18
v3.5.1
v3.3.6
v2.93.17
v3.5.0
v2.93.16
v3.3.5
v3.3.4
v2.93.15
v2.93.14
v3.3.3
v2.93.13
v2.93.12
v3.4.1
v3.3.2
v3.4.0
v3.3.1
v2.93.11
v3.3.0
v3.2.2
v2.93.10
v3.2.1
v3.2.0
v2.83.20
v2.93.9
v3.1.2
v3.1.1
v3.1.0
v2.83.19
v2.93.8
v3.0.1
v2.93.7
v3.0.0
v2.93.6
v2.93.5
v2.83.18
v2.93.4
v2.93.3
v2.83.17
v2.93.2
v2.93.1
v2.83.16
v2.93.0
v2.83.15
v2.83.14
v2.83.13
v2.92.0
v2.83.12
v2.91.2
v2.83.10
v2.91.0
v2.83.9
v2.83.8
v2.83.7
v2.90.1
v2.83.6.1
v2.83.6
v2.90.0
v2.83.5
v2.83.4
v2.83.3
v2.83.2
v2.83.1
v2.83
v2.82a
v2.82
v2.81a
v2.81
v2.80
v2.80-rc3
v2.80-rc2
v2.80-rc1
v2.79b
v2.79a
v2.79
v2.79-rc2
v2.79-rc1
v2.78c
v2.78b
v2.78a
v2.78
v2.78-rc2
v2.78-rc1
v2.77a
v2.77
v2.77-rc2
v2.77-rc1
v2.76b
v2.76a
v2.76
v2.76-rc3
v2.76-rc2
v2.76-rc1
v2.75a
v2.75
v2.75-rc2
v2.75-rc1
v2.74
v2.74-rc4
v2.74-rc3
v2.74-rc2
v2.74-rc1
v2.73a
v2.73
v2.73-rc1
v2.72b
2.72b
v2.72a
v2.72
v2.72-rc1
v2.71
v2.71-rc2
v2.71-rc1
v2.70a
v2.70
v2.70-rc2
v2.70-rc
v2.69
v2.68a
v2.68
v2.67b
v2.67a
v2.67
v2.66a
v2.66
v2.65a
v2.65
v2.64a
v2.64
v2.63a
v2.63
v2.61
v2.60a
v2.60
v2.59
v2.58a
v2.58
v2.57b
v2.57a
v2.57
v2.56a
v2.56
v2.55
v2.54
v2.53
v2.52
v2.51
v2.50
v2.49b
v2.49a
v2.49
v2.48a
v2.48
v2.47
v2.46
v2.45
v2.44
v2.43
v2.42a
v2.42
v2.41
v2.40
v2.37a
v2.37
v2.36
v2.35a
v2.35
v2.34
v2.33a
v2.33
v2.32
v2.31a
v2.31
v2.30
v2.28c
v2.28a
v2.28
v2.27
v2.26
v2.25
Labels
Clear labels
This issue affects/is about backward or forward compatibility
Issues relating to security: https://wiki.blender.org/wiki/Process/Vulnerability_Reports
Apply labels
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
This issue affects/is about backward or forward compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Issues relating to security: https://wiki.blender.org/wiki/Process/Vulnerability_Reports
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
Milestone
Set milestone
Clear milestone
No items
No Milestone
Projects
Set Project
Clear projects
No project
Assignees
Assign users
Clear assignees
No Assignees
48 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.
No due date set.
Dependencies
No dependencies set.
Reference: blender/blender#68996
Reference in New Issue
There is no content yet.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may exist for a short time before cleaning up, in most cases it CANNOT be undone. Continue?
Status: Needs to be formatted as a project once there is someone to tackle this. Including use cases, milestones, task breakdown, etc.
We want to have a per-object subdivision object that operates on top of the entire stack of transformations. The options would be the same (or very similar) to the existing modifier (when using Catmull-Clark), If the last modifier in the stack is a Subdivision, an heuristic can take care of conciliating both results.
Said subdivision in the viewport is to be performed on the GPU. For rendering it would also use OpenSubdiv but on the CPU.
Added subscriber: @dfelinto
Added subscriber: @xdanic
Added subscriber: @sozap
Added subscriber: @daven
Added subscriber: @item412
Added subscriber: @ucupumar
Added subscriber: @LapisSea
Added subscriber: @SergienkoEugene
Added subscriber: @lemenicier_julien
Added subscriber: @snubilo
Added subscriber: @Le.Citron
Added subscriber: @Fux
Added subscriber: @DanielPaul
Added subscriber: @PolyGreen
Added subscriber: @RORU
Added subscriber: @LucasVeber
Added subscriber: @1seby
Added subscriber: @filibis
Added subscriber: @ogotay
Added subscriber: @Andruxa696
May we please get a status update by a developer on this task? It is my understanding that the two main reasons for implementing OpenSubdiv in Blender were:
Of the two points only the second one is implemented - partially. Modelling of SubD assets using OpenSubdiv creases is well supported but the algorithm needs higher subdivision levels to work properly which makes the actual rendering of scenes that contain more than a few of such assets virtually impossible due to the increased memory consumption (also creases do not work properly with adaptive subdivision and vertex creasing is not implemented at all).
I appreciate the work you developers are doing and understand you cannot fix every problem at once. However, this task has been open since last August while the actual problems with the current implementation have been known since before the 2.8 release. Yet it seems like nothing is happening on that front. Could we please get some feedback by a developer on what the current status on fixing this issue is?
Thanks again for all your ongoing work on this great piece of open source software.
The current SubD modifier slows everything down.
In addition to the animation playback (shape and bones), modeling, cloth simulation and shape creation are slowed down, you have to do everything without viewing the SubD in the viewport. With heavy mesh + SubD everything gets jerky. We hope that you can solve at least 2.84 or 2.85, thanks!
Why is it lower priority than #68908?
This would allow realtime playbackand realtime posing (a mesh cacher can't do that) of rigged characters with subdivision, which is quite essential when animating facial expressions for example.
Typically, realtime subsurf calculation would avoid mesh caching in many cases, which would be a significant benefit.
Fast SubDiv is a standard that is difficult to give up on.
Added subscriber: @Ace_Dragon
For 2.81 and 2.82, many users acknowledged that you had a lot of loose ends and technical debt to clean up, so we were patient.
Now, and especially with the tracker curfew completing phase 1, the users want action. On BA, you have users threatening to abandon Blender or consign Blender as an app. with no future. I propose that these regressions get tackled for 2.83, and delay 2.83 itself as long as needed to make sure that it at least has subsurf at 2.79 performance or better. Based on what I've read on this site, the core team knows where the bottlenecks are and what could be causing them, so any inaction here will simply be the result of bad priorities and poor management decisions.
There are no mid-range apps, under active development and the only equivalent apps. cost over 1K with pricey subscriptions, a lot of people have their very ability to work with CGI tied to Blender, please don't let them down.
Added subscriber: @brecht
To be clear, there are multiple performance projects for 2020:
These all have equal priority and will be mostly worked on by different developers in parallel. High-poly mesh editing and animation playback both are affected by subdivision surfaces and performance will be looked at in the context of both.
I'm removing the last line from the description since it only adds confusion and is not accurate in general, it depends on the specific use case. For some heavy rigs subdivision surfaces might not be the first concern, for other rigs it may be what is holding back performance.
Added subscriber: @0o00o0oo
Added subscriber: @SamGreen
Added subscriber: @MichaelWeisheim
Added subscriber: @realeyez
Added subscriber: @FrankMartin
Added subscriber: @Sergey
Added subscriber: @lrevardel
Added subscriber: @Miraste
Added subscriber: @2046411367
Added subscriber: @Positivity
Added subscriber: @filiperino
Added subscriber: @KenzieMac130
Added subscriber: @breakingspell
Added subscriber: @stop_using_recaptcha
Added subscriber: @Loxioss
Added subscriber: @MeshVoid
This comment was removed by @LucasVeber
Added subscriber: @moisessalvador
Bump
There is no update. Team is overloaded with a lot of other projects. It is still a planned-to-be-worked-on project, but there are no time allocated for it a far as I know. As soon as there is anything, you'll see it reflected in this task.
P.S. Bump comment an hour after state update question. Comon ;)
This comment was removed by @LucasVeber
Added subscriber: @JacobMerrill-1
for simple subdivide I have a method that could accelerate things significantly for simple subdivide.
this only works for quads / triangles - so 'triangulate' would need to be applied to remove ngons.
basically we take a quad (square) - and use point on line to compare a subdivided patch if quads vs the plane 1 time
then use this data to 'emit' a patch that is skewed on the quad
for triangle I use barycentric transform to compare the points
I use this method currently to emit meshes and join them - (expensive!)
but if one simply created a new mesh doing the same stuff in C it should be really really fast.
parametric_BPR.zip
side note - it can emit shapes in 3d as well - so it could be the basis of a new modifier
https://www.youtube.com/watch?v=E3KGYRnTLak
That is not how we prefer to work though. For the popularity there are tokens.
Such comments introduce an unnecessary noise to the communication, making it harder to follow and (re)read conversation.
Simple subdivisions should be implemented as a BMesh-based modifier, which is to be moved out of the Catmull-Clark modifier.
There are many various optimizations possible, but please stay on topic. In this case it is a GPU side integration.
Added subscriber: @jack.quiver
Hello,
Sorry for the long post, I hope you guys find it as interesting as I found to actual learn all this...
I wrote a prototype that uses the CUDA implementation of Opensubdiv and I wanted to share my results and experience.
I am sure some of the things I am going to share are obvious to the official Devs, but I feel they could be a good source of discussion.
I also hope they could help the people in this thread to understand why it is not so easy to address this issue and why it will take some time.
A quick disclaimer about me, since this is my first post here.
I am not affiliated in any form with Blender and its development.
I am just an hobbits that wants to understand a bit more about the internals of this wonderful project.
I consider myself fairly experienced with C/C++, GPU programming and high performance computing in general.
Back to Opensubdiv on GPU:
The bottom line is that there is much more to it than just calling Opensubdiv CUDA back-end (or any other GPU back-end) to benefit for GPU acceleration.
Ultimately, after several optimizations I have got something in the range of ~1.5x speed-up on my laptop for the entire end-to-end subdivision process with respect to the CPU based version, but I still feel it is not enough and certainly not a game changer as I had hoped.
These are some of the problems I have encountered:
A. Limited use of Opensubdiv:
Opensubdiv is a only a fraction of what happens during the entire "BKE_subdiv_to_mesh" (which is the end-to-end subdivision process).
For the test case I was evaluating (more about the test case below) I estimate that less than 40-50% of the entire time is spent in Opensubdiv.
This means that even making the entire Opensubdiv code infinitely fast, we get only 2x speedup end-to-end (Amdahl's law sucks!).
I am not sure if more code can be ported to use Opensubdiv, but I see a lot of interpolations, copies, etc. which are not strictly related to Opensubdiv.
B. Batching:
All the calls to Opensubdiv are currently performed at the granularity of a single vertex, both for the "evaluation" part as well as for the "updates" parts.
The "evaluations" translate to CUDA kernel launches, and "updates" translate to cudaMemcpys in the CUDA back-end (similarly in the other Opensubdiv GPU back-ends).
So, just by naively calling the Opensubdiv CUDA with the rest of the code as it is, makes the all thing crippling slow with thousands of minuscule kernel calls and device cudaMemcpys (100x slow-down).
In order to go around this, there is some relatively large refactoring involved, which consists in transforming all calls in the subdiv_foreach.c and subdiv_mesh.c files to operate in batches before calling "updates" and "evaluations" of the Opensubdiv library. Even forgetting to batch few of the calls makes the entire approach worthless.
This refactoring involves changing the routines to iterate on the original vertexes/edges/loops structures to populate large temporary buffers (as well as creating some output buffers for the result of the Opensubdiv calculations).
After that, the Opensubdiv routines can be called in few shots.
However, this still requires copy-in the input data to the GPU, calling the evaluation kernel, copying the output data from the GPU and then iterate on the output buffers to copy back the results in the original structures.
With these changes, I managed to get only 4 Opensubdiv calls for single threaded part of the "subdiv_foreach_single_thread_tasks" and 2 extra Opensubdiv calls for each parallel CPU thread operating on "subdiv_foreach_task".
In my system with 4 cores (8 CPU threads are started) it results in 4 + 8 threads * 2 = 20 Opensubdiv calls (for an end-to-end subdivision).
I can force only one CPU thread to work on "subdiv_foreach_task", which results in only 4 + 1 thread * 2 = 6 Opensubdiv calls.
However, while this makes better use of the Opensubdiv library, it makes the end-to-end solution slower because, as explained in of point A above, there is a lot of work that needs to be done which not strictly in Opensubdiv (and more CPU threads are better).
Anyway, with the changes described I got some hope but still I was not able to beat the default CPU implementation.
C. GPU allocator
I then realized there was an issue was the cudaMalloc allocator.
For each call to Opensubdiv GPU side, temporary GPU buffers need to be created (essentially the mirror buffer of what described above) and a cudaMalloc needs to be performed (and a cudaFree after the evaluation is done).
This ultimately is very slow so I had to design a custom allocator in which CUDA buffers are reused for the entire life of the CUDA evaluator.
There is also a possible optimization (which I have not done) that involves creating the temporary CPU buffers using pinned memory cudaMallocHost() which should speedup some of the CPU<->GPU transfers.
I will try this later if I have some time.
Conclusions:
Ultimately on my laptop (4 core i7-6820HQ with a M1000M GPU) a viewport subdivision level 3 of a torus with 48 major segments and 12 minor segments goes from ~90ms CPU based Opensubdiv to ~60ms CUDA based Opensubdiv.
Again, I don't call this a game changer, and given the added complexity with batching it is totally questionable if it is the right approach (unless more work can be pushed to Opensubdiv).
Also I wanted to share a profiling (collected with CUDA-nsight-system) of what is happening in the 60ms of end-to-end subdiv on my laptop.
At the bottom of the plot you can see the GPU activity.
"Memory" are the copy-in (green bars) and copy-out (red bars) to/from the GPU.
"Kernels" are the actual invocation to the main CUDA kernel of the Opensubdiv library (20 blue blocks corresponding to the the 20 calls I described above).
You can see that there is a lot of CPU activity "black color" which is not part of Opensubdiv and even if we made Opensubdiv infinitely fast we would still get 40-50ms to execute the end-to-end subdivision.
I hope you found this interesting, let me know if you have some questions.
I can also push a branch somewhere if someone is interested, I would need to do some cleanup before.
Also, I did not implement all the possible cases (specially when it comes to ngons) so it may not work as expected on all the scenes.
Regards,
JQuiver
@jack.quiver Thanks a lot for sharing your researches on it, I'm pretty sure many users would be interested in testing it, even if it's work in progress. Patches can be sent here:
https://developer.blender.org/differential/diff/create/
According to the description, the initial plan of this task was to make surface subdivision a mesh setting for GPU evaluation, instead of a modifier, probably to overcome the bottlenecks that you mention. Maybe Blender developers could be interesting in collaborating with you on it.
The plan here indeed is to push the data to the GPU and keep it there for drawing. As you found the CPU-GPU transfer has a high latency, which makes transferring data back and forth between CPU and GPU not great for realtime.
I'm not sure we would add an approach that relies on such transfers.
Thanks for the feedbacks.
Keeping the data entirely on GPU for subdivision and sub-sequent rendering would be ideal.
However, I am not sure it can all be achieved only with opensubdiv, it may require some custom extra CUDA kernels to work on the mesh before display.
Also, something to keep in mind is that the Opensubdiv CUDA kernel (or OpenCL) is very poorly optimized (see here
9be5a29827/opensubdiv/osd/cudaKernel.cu (L259)
). I would not use that kernel as it is in any performance critical project. Also, the last update on that code was 2 years ago.I wonder what it means to Blender in relying on the Opensubdiv library as it is.
I would like to hear some opinions.
Blender 2.79 had GPU subdivision implemented in a way that data would stay on the GPU.
I think that most likely there is no point in using CUDA or OpenCL backends, we might as well use an OpenGL or future Vulkan backend.
Added subscriber: @Nominous
Added subscriber: @mysticfall
Added subscriber: @jackiejake
Added subscriber: @DarkKnight
Added subscriber: @chadking
So, Armature skinning on the CPU should be done in the stack - if you need the feedback from the shader
for playblasts a 'GPU skinning algorithm" should happen "After the stack" with no feedback.
https://github.com/tcoppex/aer-engine/blob/master/demos/aura/data/shaders/Skinning.glsl
maybe we can setup a way users can do this in py if it's not going to be supported in master?
I was thinking that each material could save it's shader sources
material.vert_shader
material_frag_shader
and we can have a comment in there to find / replace
at the top near the vertex uniforms
//Begin_Vert_unforms
just before 'final output' we place our offsets here (skinning etc)
//Begin_Vert_offsets
then the user can find / replace these comments with working code / compile the shader and replace the material with it.
Added subscriber: @matthewg.3d
So, it's 2021 and Blender still doesn't have a proper OpenSubdiv implementation. I understand that things take time and that it's more fun to work on new features than on fixing existing bugs but this is getting a bit ridiculous. Could we PLEASE get one of the 20+ full time Blender developers to finally look into this issue? Preferably before the heat death of the universe engulfs us all and grinds our souls into interstellar dust?
Changed status from 'Confirmed' to: 'Resolved'
Actually it does in Blender 3.1, see D12406: OpenSubDiv: add support for an OpenGL evaluator.
Further improvements are tracked in #94644 (GPU subdivision improvements).
Thank you for the quick response, Brecht. The summary is a bit technical so I'm not sure if I understood everything correctly, but I'm glad something is finally happening and I look forward to testing it. Thank you!