OpenSubdiv GPU acceleration #68996

New Issue

Dalai Felinto · 2019-08-21T16:22:15+02:00

Dalai Felinto commented

2019-08-21 16:22:15 +02:00

Status: Needs to be formatted as a project once there is someone to tackle this. Including use cases, milestones, task breakdown, etc.

We want to have a per-object subdivision object that operates on top of the entire stack of transformations. The options would be the same (or very similar) to the existing modifier (when using Catmull-Clark), If the last modifier in the stack is a Subdivision, an heuristic can take care of conciliating both results.

Said subdivision in the viewport is to be performed on the GPU. For rendering it would also use OpenSubdiv but on the CPU.

Note that this needs a more indepth design and discussion between the GPU/Viewport team and @Sergey to make a design

**Status:** Needs to be formatted as a [project ](https://wiki.blender.org/wiki/Process/Projects) once there is someone to tackle this. Including use cases, milestones, task breakdown, etc. --- We want to have a per-object subdivision object that operates on top of the entire stack of transformations. The options would be the same (or very similar) to the existing modifier (when using Catmull-Clark), If the last modifier in the stack is a Subdivision, an heuristic can take care of conciliating both results. Said subdivision in the viewport is to be performed on the GPU. For rendering it would also use OpenSubdiv but on the CPU. - Note that this needs a more indepth design and discussion between the GPU/Viewport team and @Sergey to make a design

Dalai Felinto commented

2019-08-21 16:22:15 +02:00

Added subscriber: @dfelinto

Daniel Calderón commented

2019-08-25 16:51:04 +02:00

Added subscriber: @xdanic

Paul Szajner commented

2019-08-25 17:10:32 +02:00

Added subscriber: @sozap

Daven Hughes commented

2019-08-26 14:28:02 +02:00

Added subscriber: @daven

item412 commented

2019-08-27 13:12:06 +02:00

Added subscriber: @item412

Yusuf Umar commented

2019-08-31 06:40:56 +02:00

Added subscriber: @ucupumar

LapisSea commented

2019-09-08 23:38:04 +02:00

Added subscriber: @LapisSea

Eugene commented

2019-09-22 17:50:40 +02:00

Added subscriber: @SergienkoEugene

Julien Leménicier commented

2019-11-04 21:39:12 +01:00

Added subscriber: @lemenicier_julien

Matias commented

2019-11-13 11:27:12 +01:00

Added subscriber: @snubilo

Kristian Emil commented

2019-11-20 14:07:53 +01:00

Added subscriber: @Le.Citron

Tobias Fuchsberger commented

2019-11-20 14:55:23 +01:00

Added subscriber: @Fux

Daniel Paul commented

2019-11-21 12:40:42 +01:00

Added subscriber: @DanielPaul

Remo commented

2019-11-21 17:43:55 +01:00

Added subscriber: @PolyGreen

Riccardo Russi commented

2019-11-23 11:40:27 +01:00

Added subscriber: @RORU

lucas veber commented

2019-11-23 16:28:56 +01:00

Added subscriber: @LucasVeber

Seby commented

2019-11-26 10:20:39 +01:00

Added subscriber: @1seby

Emir Sinan Gürlek commented

2019-11-27 18:53:05 +01:00

Added subscriber: @filibis

Marcus Holtgräwe commented

2019-12-18 18:30:28 +01:00

Added subscriber: @ogotay

Andrey commented

2020-01-19 22:20:31 +01:00

Added subscriber: @Andruxa696

Marcus Holtgräwe commented

2020-02-15 14:53:58 +01:00

May we please get a status update by a developer on this task? It is my understanding that the two main reasons for implementing OpenSubdiv in Blender were:

to enable rendering of SubD models at much higher speed and subdivision levels without the increased memory footprint
to add support for proper creasing on SubD models

Of the two points only the second one is implemented - partially. Modelling of SubD assets using OpenSubdiv creases is well supported but the algorithm needs higher subdivision levels to work properly which makes the actual rendering of scenes that contain more than a few of such assets virtually impossible due to the increased memory consumption (also creases do not work properly with adaptive subdivision and vertex creasing is not implemented at all).

I appreciate the work you developers are doing and understand you cannot fix every problem at once. However, this task has been open since last August while the actual problems with the current implementation have been known since before the 2.8 release. Yet it seems like nothing is happening on that front. Could we please get some feedback by a developer on what the current status on fixing this issue is?

Thanks again for all your ongoing work on this great piece of open source software.

May we please get a status update by a developer on this task? It is my understanding that the two main reasons for implementing OpenSubdiv in Blender were: - to enable rendering of SubD models at much higher speed and subdivision levels without the increased memory footprint - to add support for proper creasing on SubD models Of the two points only the second one is implemented - partially. Modelling of SubD assets using OpenSubdiv creases is well supported but the algorithm needs higher subdivision levels to work properly which makes the actual rendering of scenes that contain more than a few of such assets virtually impossible due to the increased memory consumption (also creases do not work properly with adaptive subdivision and vertex creasing is not implemented at all). I appreciate the work you developers are doing and understand you cannot fix every problem at once. However, this task has been open since last August while the actual problems with the current implementation have been known since before the 2.8 release. Yet it seems like nothing is happening on that front. Could we please get some feedback by a developer on what the current status on fixing this issue is? Thanks again for all your ongoing work on this great piece of open source software.

Riccardo Russi commented

2020-02-19 16:45:43 +01:00

The current SubD modifier slows everything down.
In addition to the animation playback (shape and bones), modeling, cloth simulation and shape creation are slowed down, you have to do everything without viewing the SubD in the viewport. With heavy mesh + SubD everything gets jerky. We hope that you can solve at least 2.84 or 2.85, thanks!

The current SubD modifier slows everything down. In addition to the animation playback (shape and bones), modeling, cloth simulation and shape creation are slowed down, you have to do everything without viewing the SubD in the viewport. With heavy mesh + SubD everything gets jerky. We hope that you can solve at least 2.84 or 2.85, thanks!

lucas veber commented

2020-02-19 16:49:26 +01:00

Why is it lower priority than #68908?
This would allow realtime playbackand realtime posing (a mesh cacher can't do that) of rigged characters with subdivision, which is quite essential when animating facial expressions for example.
Typically, realtime subsurf calculation would avoid mesh caching in many cases, which would be a significant benefit.

Why is it lower priority than #68908? This would allow realtime playback**and realtime posing** (a mesh cacher can't do that) of rigged characters with subdivision, which is quite essential when animating facial expressions for example. Typically, realtime subsurf calculation would avoid mesh caching in many cases, which would be a significant benefit.

Riccardo Russi commented

2020-02-19 16:56:47 +01:00

Fast SubDiv is a standard that is difficult to give up on.

Adam Friesen commented

2020-02-19 17:04:24 +01:00

Added subscriber: @Ace_Dragon

Adam Friesen commented

2020-02-19 17:04:24 +01:00

For 2.81 and 2.82, many users acknowledged that you had a lot of loose ends and technical debt to clean up, so we were patient.

Now, and especially with the tracker curfew completing phase 1, the users want action. On BA, you have users threatening to abandon Blender or consign Blender as an app. with no future. I propose that these regressions get tackled for 2.83, and delay 2.83 itself as long as needed to make sure that it at least has subsurf at 2.79 performance or better. Based on what I've read on this site, the core team knows where the bottlenecks are and what could be causing them, so any inaction here will simply be the result of bad priorities and poor management decisions.

There are no mid-range apps, under active development and the only equivalent apps. cost over 1K with pricey subscriptions, a lot of people have their very ability to work with CGI tied to Blender, please don't let them down.

For 2.81 and 2.82, many users acknowledged that you had a lot of loose ends and technical debt to clean up, so we were patient. Now, and especially with the tracker curfew completing phase 1, the users want action. On BA, you have users threatening to abandon Blender or consign Blender as an app. with no future. I propose that these regressions get tackled for 2.83, and delay 2.83 itself as long as needed to make sure that it at least has subsurf at 2.79 performance or better. Based on what I've read on this site, the core team knows where the bottlenecks are and what could be causing them, so any inaction here will simply be the result of bad priorities and poor management decisions. There are no mid-range apps, under active development and the only equivalent apps. cost over 1K with pricey subscriptions, a lot of people have their very ability to work with CGI tied to Blender, please don't let them down.

Brecht Van Lommel commented

2020-02-19 17:23:45 +01:00

Added subscriber: @brecht

Brecht Van Lommel commented

2020-02-19 17:23:45 +01:00

To be clear, there are multiple performance projects for 2020:

Faster high-poly mesh editing
Faster animation playback
Faster object mode performance

These all have equal priority and will be mostly worked on by different developers in parallel. High-poly mesh editing and animation playback both are affected by subdivision surfaces and performance will be looked at in the context of both.

To be clear, there are multiple [performance projects for 2020](https://code.blender.org/2020/01/2020-blender-big-projects/): * Faster high-poly mesh editing * Faster animation playback * Faster object mode performance These all have equal priority and will be mostly worked on by different developers in parallel. High-poly mesh editing and animation playback both are affected by subdivision surfaces and performance will be looked at in the context of both.

Brecht Van Lommel commented

2020-02-19 17:25:46 +01:00

I'm removing the last line from the description since it only adds confusion and is not accurate in general, it depends on the specific use case. For some heavy rigs subdivision surfaces might not be the first concern, for other rigs it may be what is holding back performance.

KiJeon commented

2020-02-20 03:33:37 +01:00

Added subscriber: @0o00o0oo

Sam Green commented

2020-02-25 10:47:13 +01:00

Added subscriber: @SamGreen

Michael Weisheim commented

2020-02-26 22:09:57 +01:00

Added subscriber: @MichaelWeisheim

Connor McCampbell commented

2020-03-14 08:42:26 +01:00

Added subscriber: @realeyez

Frank Martin commented

2020-03-22 21:47:35 +01:00

Added subscriber: @FrankMartin

Jeroen Bakker commented

2020-04-14 10:43:49 +02:00

Added subscriber: @Sergey

Luc Revardel commented

2020-04-20 13:11:32 +02:00

Added subscriber: @lrevardel

Henry Harrison commented

2020-05-08 00:48:49 +02:00

Added subscriber: @Miraste

hujin commented

2020-05-23 06:46:55 +02:00

Added subscriber: @2046411367

Aivis Ozols commented

2020-06-07 14:31:58 +02:00

Added subscriber: @Positivity

filiperino commented

2020-06-16 09:08:05 +02:00

Added subscriber: @filiperino

Kenzie commented

2020-07-18 20:56:38 +02:00

Added subscriber: @KenzieMac130

Blazer commented

2020-07-30 09:44:53 +02:00

Added subscriber: @breakingspell

no way commented

2020-08-19 12:37:58 +02:00

Added subscriber: @stop_using_recaptcha

Thomas Battistetti commented

2020-09-04 01:16:12 +02:00

Added subscriber: @Loxioss

Chingiz Jumagulov commented

2020-09-15 18:22:12 +02:00

Added subscriber: @MeshVoid

lucas veber commented

2020-10-16 16:19:12 +02:00

This comment was removed by @LucasVeber

*This comment was removed by @LucasVeber*

msalv commented

2020-10-16 17:31:19 +02:00

Added subscriber: @moisessalvador

msalv commented

2020-10-16 17:31:19 +02:00

Bump

Sergey Sharybin commented

2020-10-16 17:52:09 +02:00

There is no update. Team is overloaded with a lot of other projects. It is still a planned-to-be-worked-on project, but there are no time allocated for it a far as I know. As soon as there is anything, you'll see it reflected in this task.

P.S. Bump comment an hour after state update question. Comon ;)

There is no update. Team is overloaded with a lot of other projects. It is still a planned-to-be-worked-on project, but there are no time allocated for it a far as I know. As soon as there is anything, you'll see it reflected in this task. P.S. Bump comment an hour after state update question. Comon ;)

lucas veber commented

2020-10-16 18:20:13 +02:00

This comment was removed by @LucasVeber

*This comment was removed by @LucasVeber*

Jacob Merrill commented

2020-10-16 18:53:43 +02:00

Added subscriber: @JacobMerrill-1

Jacob Merrill commented

2020-10-16 18:53:43 +02:00

for simple subdivide I have a method that could accelerate things significantly for simple subdivide.

this only works for quads / triangles - so 'triangulate' would need to be applied to remove ngons.

basically we take a quad (square) - and use point on line to compare a subdivided patch if quads vs the plane 1 time
then use this data to 'emit' a patch that is skewed on the quad

for triangle I use barycentric transform to compare the points

I use this method currently to emit meshes and join them - (expensive!)

but if one simply created a new mesh doing the same stuff in C it should be really really fast.

parametric_BPR.zip

side note - it can emit shapes in 3d as well - so it could be the basis of a new modifier
https://www.youtube.com/watch?v=E3KGYRnTLak

for simple subdivide I have a method that could accelerate things significantly for simple subdivide. this only works for quads / triangles - so 'triangulate' would need to be applied to remove ngons. basically we take a quad (square) - and use point on line to compare a subdivided patch if quads vs the plane 1 time then use this data to 'emit' a patch that is skewed on the quad for triangle I use barycentric transform to compare the points I use this method currently to emit meshes and join them - (expensive!) but if one simply created a new mesh doing the same stuff in C it should be really really fast. [parametric_BPR.zip](https://archive.blender.org/developer/F9000259/parametric_BPR.zip) side note - it can emit shapes in 3d as well - so it could be the basis of a new modifier https://www.youtube.com/watch?v=E3KGYRnTLak

Sergey Sharybin commented

2020-10-19 09:23:13 +02:00

The "bump" comment witnesses the popularity of this request

That is not how we prefer to work though. For the popularity there are tokens.
Such comments introduce an unnecessary noise to the communication, making it harder to follow and (re)read conversation.

for simple subdivide I have a method that could accelerate things significantly for simple subdivide.

Simple subdivisions should be implemented as a BMesh-based modifier, which is to be moved out of the Catmull-Clark modifier.

There are many various optimizations possible, but please stay on topic. In this case it is a GPU side integration.

> The "bump" comment witnesses the popularity of this request That is not how we prefer to work though. For the popularity there are tokens. Such comments introduce an unnecessary noise to the communication, making it harder to follow and (re)read conversation. > for simple subdivide I have a method that could accelerate things significantly for simple subdivide. Simple subdivisions should be implemented as a BMesh-based modifier, which is to be moved out of the Catmull-Clark modifier. There are many various optimizations possible, but please stay on topic. In this case it is a GPU side integration.

Jack Quiver commented

2020-11-25 01:15:10 +01:00

Added subscriber: @jack.quiver

Jack Quiver commented

2020-11-25 01:15:10 +01:00

Hello,

Sorry for the long post, I hope you guys find it as interesting as I found to actual learn all this...

I wrote a prototype that uses the CUDA implementation of Opensubdiv and I wanted to share my results and experience.
I am sure some of the things I am going to share are obvious to the official Devs, but I feel they could be a good source of discussion.
I also hope they could help the people in this thread to understand why it is not so easy to address this issue and why it will take some time.

A quick disclaimer about me, since this is my first post here.
I am not affiliated in any form with Blender and its development.
I am just an hobbits that wants to understand a bit more about the internals of this wonderful project.
I consider myself fairly experienced with C/C++, GPU programming and high performance computing in general.

Back to Opensubdiv on GPU:
The bottom line is that there is much more to it than just calling Opensubdiv CUDA back-end (or any other GPU back-end) to benefit for GPU acceleration.
Ultimately, after several optimizations I have got something in the range of ~1.5x speed-up on my laptop for the entire end-to-end subdivision process with respect to the CPU based version, but I still feel it is not enough and certainly not a game changer as I had hoped.

These are some of the problems I have encountered:

A. Limited use of Opensubdiv:
Opensubdiv is a only a fraction of what happens during the entire "BKE_subdiv_to_mesh" (which is the end-to-end subdivision process).
For the test case I was evaluating (more about the test case below) I estimate that less than 40-50% of the entire time is spent in Opensubdiv.
This means that even making the entire Opensubdiv code infinitely fast, we get only 2x speedup end-to-end (Amdahl's law sucks!).
I am not sure if more code can be ported to use Opensubdiv, but I see a lot of interpolations, copies, etc. which are not strictly related to Opensubdiv.

B. Batching:
All the calls to Opensubdiv are currently performed at the granularity of a single vertex, both for the "evaluation" part as well as for the "updates" parts.
The "evaluations" translate to CUDA kernel launches, and "updates" translate to cudaMemcpys in the CUDA back-end (similarly in the other Opensubdiv GPU back-ends).

So, just by naively calling the Opensubdiv CUDA with the rest of the code as it is, makes the all thing crippling slow with thousands of minuscule kernel calls and device cudaMemcpys (100x slow-down).
In order to go around this, there is some relatively large refactoring involved, which consists in transforming all calls in the subdiv_foreach.c and subdiv_mesh.c files to operate in batches before calling "updates" and "evaluations" of the Opensubdiv library. Even forgetting to batch few of the calls makes the entire approach worthless.

This refactoring involves changing the routines to iterate on the original vertexes/edges/loops structures to populate large temporary buffers (as well as creating some output buffers for the result of the Opensubdiv calculations).
After that, the Opensubdiv routines can be called in few shots.
However, this still requires copy-in the input data to the GPU, calling the evaluation kernel, copying the output data from the GPU and then iterate on the output buffers to copy back the results in the original structures.

With these changes, I managed to get only 4 Opensubdiv calls for single threaded part of the "subdiv_foreach_single_thread_tasks" and 2 extra Opensubdiv calls for each parallel CPU thread operating on "subdiv_foreach_task".
In my system with 4 cores (8 CPU threads are started) it results in 4 + 8 threads * 2 = 20 Opensubdiv calls (for an end-to-end subdivision).
I can force only one CPU thread to work on "subdiv_foreach_task", which results in only 4 + 1 thread * 2 = 6 Opensubdiv calls.
However, while this makes better use of the Opensubdiv library, it makes the end-to-end solution slower because, as explained in of point A above, there is a lot of work that needs to be done which not strictly in Opensubdiv (and more CPU threads are better).

Anyway, with the changes described I got some hope but still I was not able to beat the default CPU implementation.

C. GPU allocator
I then realized there was an issue was the cudaMalloc allocator.
For each call to Opensubdiv GPU side, temporary GPU buffers need to be created (essentially the mirror buffer of what described above) and a cudaMalloc needs to be performed (and a cudaFree after the evaluation is done).
This ultimately is very slow so I had to design a custom allocator in which CUDA buffers are reused for the entire life of the CUDA evaluator.
There is also a possible optimization (which I have not done) that involves creating the temporary CPU buffers using pinned memory cudaMallocHost() which should speedup some of the CPU<->GPU transfers.
I will try this later if I have some time.

Conclusions:
Ultimately on my laptop (4 core i7-6820HQ with a M1000M GPU) a viewport subdivision level 3 of a torus with 48 major segments and 12 minor segments goes from ~90ms CPU based Opensubdiv to ~60ms CUDA based Opensubdiv.
Again, I don't call this a game changer, and given the added complexity with batching it is totally questionable if it is the right approach (unless more work can be pushed to Opensubdiv).

Also I wanted to share a profiling (collected with CUDA-nsight-system) of what is happening in the 60ms of end-to-end subdiv on my laptop.
At the bottom of the plot you can see the GPU activity.
"Memory" are the copy-in (green bars) and copy-out (red bars) to/from the GPU.
"Kernels" are the actual invocation to the main CUDA kernel of the Opensubdiv library (20 blue blocks corresponding to the the 20 calls I described above).
You can see that there is a lot of CPU activity "black color" which is not part of Opensubdiv and even if we made Opensubdiv infinitely fast we would still get 40-50ms to execute the end-to-end subdivision.

I hope you found this interesting, let me know if you have some questions.
I can also push a branch somewhere if someone is interested, I would need to do some cleanup before.
Also, I did not implement all the possible cases (specially when it comes to ngons) so it may not work as expected on all the scenes.

Regards,

JQuiver

Hello, Sorry for the long post, I hope you guys find it as interesting as I found to actual learn all this... I wrote a prototype that uses the CUDA implementation of Opensubdiv and I wanted to share my results and experience. I am sure some of the things I am going to share are obvious to the official Devs, but I feel they could be a good source of discussion. I also hope they could help the people in this thread to understand why it is not so easy to address this issue and why it will take some time. A quick disclaimer about me, since this is my first post here. I am not affiliated in any form with Blender and its development. I am just an hobbits that wants to understand a bit more about the internals of this wonderful project. I consider myself fairly experienced with C/C++, GPU programming and high performance computing in general. Back to Opensubdiv on GPU: The bottom line is that there is much more to it than just calling Opensubdiv CUDA back-end (or any other GPU back-end) to benefit for GPU acceleration. Ultimately, after several optimizations I have got something in the range of ~1.5x speed-up on my laptop for the entire end-to-end subdivision process with respect to the CPU based version, but I still feel it is not enough and certainly not a game changer as I had hoped. These are some of the problems I have encountered: **A. Limited use of Opensubdiv:** Opensubdiv is a only a fraction of what happens during the entire "BKE_subdiv_to_mesh" (which is the end-to-end subdivision process). For the test case I was evaluating (more about the test case below) I estimate that less than 40-50% of the entire time is spent in Opensubdiv. This means that even making the entire Opensubdiv code infinitely fast, we get only 2x speedup end-to-end (Amdahl's law sucks!). I am not sure if more code can be ported to use Opensubdiv, but I see a lot of interpolations, copies, etc. which are not strictly related to Opensubdiv. **B. Batching:** All the calls to Opensubdiv are currently performed at the granularity of a single vertex, both for the "evaluation" part as well as for the "updates" parts. The "evaluations" translate to CUDA kernel launches, and "updates" translate to cudaMemcpys in the CUDA back-end (similarly in the other Opensubdiv GPU back-ends). So, just by naively calling the Opensubdiv CUDA with the rest of the code as it is, makes the all thing crippling slow with thousands of minuscule kernel calls and device cudaMemcpys (100x slow-down). In order to go around this, there is some relatively large refactoring involved, which consists in transforming all calls in the subdiv_foreach.c and subdiv_mesh.c files to operate in batches before calling "updates" and "evaluations" of the Opensubdiv library. Even forgetting to batch few of the calls makes the entire approach worthless. This refactoring involves changing the routines to iterate on the original vertexes/edges/loops structures to populate large temporary buffers (as well as creating some output buffers for the result of the Opensubdiv calculations). After that, the Opensubdiv routines can be called in few shots. However, this still requires copy-in the input data to the GPU, calling the evaluation kernel, copying the output data from the GPU and then iterate on the output buffers to copy back the results in the original structures. With these changes, I managed to get only 4 Opensubdiv calls for single threaded part of the "subdiv_foreach_single_thread_tasks" and 2 extra Opensubdiv calls for each parallel CPU thread operating on "subdiv_foreach_task". In my system with 4 cores (8 CPU threads are started) it results in 4 + 8 threads * 2 = 20 Opensubdiv calls (for an end-to-end subdivision). I can force only one CPU thread to work on "subdiv_foreach_task", which results in only 4 + 1 thread * 2 = 6 Opensubdiv calls. However, while this makes better use of the Opensubdiv library, it makes the end-to-end solution slower because, as explained in of point A above, there is a lot of work that needs to be done which not strictly in Opensubdiv (and more CPU threads are better). Anyway, with the changes described I got some hope but still I was not able to beat the default CPU implementation. **C. GPU allocator** I then realized there was an issue was the cudaMalloc allocator. For each call to Opensubdiv GPU side, temporary GPU buffers need to be created (essentially the mirror buffer of what described above) and a cudaMalloc needs to be performed (and a cudaFree after the evaluation is done). This ultimately is very slow so I had to design a custom allocator in which CUDA buffers are reused for the entire life of the CUDA evaluator. There is also a possible optimization (which I have not done) that involves creating the temporary CPU buffers using pinned memory cudaMallocHost() which should speedup some of the CPU<->GPU transfers. I will try this later if I have some time. **Conclusions:** Ultimately on my laptop (4 core i7-6820HQ with a M1000M GPU) a viewport subdivision level 3 of a torus with 48 major segments and 12 minor segments goes from ~90ms CPU based Opensubdiv to ~60ms CUDA based Opensubdiv. Again, I don't call this a game changer, and given the added complexity with batching it is totally questionable if it is the right approach (unless more work can be pushed to Opensubdiv). ![timeline-8threads - Copy.jpg](https://archive.blender.org/developer/F9368451/timeline-8threads_-_Copy.jpg) Also I wanted to share a profiling (collected with CUDA-nsight-system) of what is happening in the 60ms of end-to-end subdiv on my laptop. At the bottom of the plot you can see the GPU activity. "Memory" are the copy-in (green bars) and copy-out (red bars) to/from the GPU. "Kernels" are the actual invocation to the main CUDA kernel of the Opensubdiv library (20 blue blocks corresponding to the the 20 calls I described above). You can see that there is a lot of CPU activity "black color" which is not part of Opensubdiv and even if we made Opensubdiv infinitely fast we would still get 40-50ms to execute the end-to-end subdivision. I hope you found this interesting, let me know if you have some questions. I can also push a branch somewhere if someone is interested, I would need to do some cleanup before. Also, I did not implement all the possible cases (specially when it comes to ngons) so it may not work as expected on all the scenes. Regards, JQuiver

lucas veber commented

2020-11-25 11:04:27 +01:00

@jack.quiver Thanks a lot for sharing your researches on it, I'm pretty sure many users would be interested in testing it, even if it's work in progress. Patches can be sent here:
https://developer.blender.org/differential/diff/create/
According to the description, the initial plan of this task was to make surface subdivision a mesh setting for GPU evaluation, instead of a modifier, probably to overcome the bottlenecks that you mention. Maybe Blender developers could be interesting in collaborating with you on it.

@jack.quiver Thanks a lot for sharing your researches on it, I'm pretty sure many users would be interested in testing it, even if it's work in progress. Patches can be sent here: https://developer.blender.org/differential/diff/create/ According to the description, the initial plan of this task was to make surface subdivision a mesh setting for GPU evaluation, instead of a modifier, probably to overcome the bottlenecks that you mention. Maybe Blender developers could be interesting in collaborating with you on it.

Brecht Van Lommel commented

2020-11-25 16:39:56 +01:00

The plan here indeed is to push the data to the GPU and keep it there for drawing. As you found the CPU-GPU transfer has a high latency, which makes transferring data back and forth between CPU and GPU not great for realtime.

I'm not sure we would add an approach that relies on such transfers.

The plan here indeed is to push the data to the GPU and keep it there for drawing. As you found the CPU-GPU transfer has a high latency, which makes transferring data back and forth between CPU and GPU not great for realtime. I'm not sure we would add an approach that relies on such transfers.

Jack Quiver commented

2020-11-25 18:12:42 +01:00

Thanks for the feedbacks.

Keeping the data entirely on GPU for subdivision and sub-sequent rendering would be ideal.
However, I am not sure it can all be achieved only with opensubdiv, it may require some custom extra CUDA kernels to work on the mesh before display.

Also, something to keep in mind is that the Opensubdiv CUDA kernel (or OpenCL) is very poorly optimized (see here 9be5a29827/opensubdiv/osd/cudaKernel.cu (L259)). I would not use that kernel as it is in any performance critical project. Also, the last update on that code was 2 years ago.

I wonder what it means to Blender in relying on the Opensubdiv library as it is.
I would like to hear some opinions.

Thanks for the feedbacks. Keeping the data entirely on GPU for subdivision and sub-sequent rendering would be ideal. However, I am not sure it can all be achieved only with opensubdiv, it may require some custom extra CUDA kernels to work on the mesh before display. Also, something to keep in mind is that the Opensubdiv CUDA kernel (or OpenCL) is very poorly optimized (see here https://github.com/PixarAnimationStudios/OpenSubdiv/blob/9be5a298270a42d89d7925c8cdd9e335869ec6ce/opensubdiv/osd/cudaKernel.cu#L259). I would not use that kernel as it is in any performance critical project. Also, the last update on that code was 2 years ago. I wonder what it means to Blender in relying on the Opensubdiv library as it is. I would like to hear some opinions.

Brecht Van Lommel commented

2020-11-26 11:52:11 +01:00

Blender 2.79 had GPU subdivision implemented in a way that data would stay on the GPU.

I think that most likely there is no point in using CUDA or OpenCL backends, we might as well use an OpenGL or future Vulkan backend.

Blender 2.79 had GPU subdivision implemented in a way that data would stay on the GPU. I think that most likely there is no point in using CUDA or OpenCL backends, we might as well use an OpenGL or future Vulkan backend.

Min commented

2020-12-03 20:34:31 +01:00

Added subscriber: @Nominous

Xavier Cho commented

2021-01-26 01:42:09 +01:00

Added subscriber: @mysticfall

jackie commented

2021-02-02 05:28:56 +01:00

Added subscriber: @jackiejake

Silas Opel commented

2021-06-03 16:50:54 +02:00

Added subscriber: @DarkKnight

MMaker commented

2021-06-13 10:27:22 +02:00

Added subscriber: @chadking

Jacob Merrill commented

2021-09-07 02:18:10 +02:00

So, Armature skinning on the CPU should be done in the stack - if you need the feedback from the shader

for playblasts a 'GPU skinning algorithm" should happen "After the stack" with no feedback.

https://github.com/tcoppex/aer-engine/blob/master/demos/aura/data/shaders/Skinning.glsl

maybe we can setup a way users can do this in py if it's not going to be supported in master?

I was thinking that each material could save it's shader sources

material.vert_shader
material_frag_shader

and we can have a comment in there to find / replace

at the top near the vertex uniforms
//Begin_Vert_unforms

just before 'final output' we place our offsets here (skinning etc)

//Begin_Vert_offsets

then the user can find / replace these comments with working code / compile the shader and replace the material with it.

So, Armature skinning on the CPU should be done in the stack - if you need the feedback from the shader for playblasts a 'GPU skinning algorithm" should happen "After the stack" with no feedback. https://github.com/tcoppex/aer-engine/blob/master/demos/aura/data/shaders/Skinning.glsl maybe we can setup a way users can do this in py if it's not going to be supported in master? I was thinking that each material could save it's shader sources material.vert_shader material_frag_shader and we can have a comment in there to find / replace at the top near the vertex uniforms //Begin_Vert_unforms # just before 'final output' we place our offsets here (skinning etc) //Begin_Vert_offsets then the user can find / replace these comments with working code / compile the shader and replace the material with it.

Matthew commented

2021-12-24 07:46:16 +01:00

Added subscriber: @matthewg.3d

Marcus Holtgräwe commented

2022-01-10 14:48:16 +01:00

So, it's 2021 and Blender still doesn't have a proper OpenSubdiv implementation. I understand that things take time and that it's more fun to work on new features than on fixing existing bugs but this is getting a bit ridiculous. Could we PLEASE get one of the 20+ full time Blender developers to finally look into this issue? Preferably before the heat death of the universe engulfs us all and grinds our souls into interstellar dust?

So, it's 2021 and Blender *still* doesn't have a proper OpenSubdiv implementation. I understand that things take time and that it's more fun to work on new features than on fixing existing bugs but this is getting a bit ridiculous. Could we *PLEASE* get one of the 20+ full time Blender developers to finally look into this issue? Preferably before the heat death of the universe engulfs us all and grinds our souls into interstellar dust?

Brecht Van Lommel commented

2022-01-10 14:50:55 +01:00

Changed status from 'Confirmed' to: 'Resolved'

Brecht Van Lommel closed this issue

2022-01-10 14:50:55 +01:00

Brecht Van Lommel self-assigned this 2022-01-10 14:50:55 +01:00

Brecht Van Lommel commented

2022-01-10 14:50:55 +01:00

Actually it does in Blender 3.1, see D12406: OpenSubDiv: add support for an OpenGL evaluator.

Further improvements are tracked in #94644 (GPU subdivision improvements).

Actually it does in Blender 3.1, see [D12406: OpenSubDiv: add support for an OpenGL evaluator](https://archive.blender.org/developer/D12406). Further improvements are tracked in #94644 (GPU subdivision improvements).

Marcus Holtgräwe commented

2022-01-10 15:00:35 +01:00

Thank you for the quick response, Brecht. The summary is a bit technical so I'm not sure if I understood everything correctly, but I'm glad something is finally happening and I look forward to testing it. Thank you!

Sign in to join this conversation.

No Label

Download

What's New

Blender Studio

Manual

Developers Blog

Documentation

Benchmark

Blender Conference

Development Fund

One-time Donations

OpenSubdiv GPU acceleration #68996

just before 'final output' we place our offsets here (skinning etc)