OpenSubdiv GPU acceleration #68996

Closed
opened 2019-08-21 16:22:15 +02:00 by Dalai Felinto · 70 comments

Status: Needs to be formatted as a project once there is someone to tackle this. Including use cases, milestones, task breakdown, etc.


We want to have a per-object subdivision object that operates on top of the entire stack of transformations. The options would be the same (or very similar) to the existing modifier (when using Catmull-Clark), If the last modifier in the stack is a Subdivision, an heuristic can take care of conciliating both results.

Said subdivision in the viewport is to be performed on the GPU. For rendering it would also use OpenSubdiv but on the CPU.

  • Note that this needs a more indepth design and discussion between the GPU/Viewport team and @Sergey to make a design
**Status:** Needs to be formatted as a [project ](https://wiki.blender.org/wiki/Process/Projects) once there is someone to tackle this. Including use cases, milestones, task breakdown, etc. --- We want to have a per-object subdivision object that operates on top of the entire stack of transformations. The options would be the same (or very similar) to the existing modifier (when using Catmull-Clark), If the last modifier in the stack is a Subdivision, an heuristic can take care of conciliating both results. Said subdivision in the viewport is to be performed on the GPU. For rendering it would also use OpenSubdiv but on the CPU. - Note that this needs a more indepth design and discussion between the GPU/Viewport team and @Sergey to make a design
Author
Owner

Added subscriber: @dfelinto

Added subscriber: @dfelinto

Added subscriber: @xdanic

Added subscriber: @xdanic

Added subscriber: @sozap

Added subscriber: @sozap

Added subscriber: @daven

Added subscriber: @daven

Added subscriber: @item412

Added subscriber: @item412

Added subscriber: @ucupumar

Added subscriber: @ucupumar

Added subscriber: @LapisSea

Added subscriber: @LapisSea

Added subscriber: @SergienkoEugene

Added subscriber: @SergienkoEugene

Added subscriber: @lemenicier_julien

Added subscriber: @lemenicier_julien

Added subscriber: @snubilo

Added subscriber: @snubilo

Added subscriber: @Le.Citron

Added subscriber: @Le.Citron

Added subscriber: @Fux

Added subscriber: @Fux

Added subscriber: @DanielPaul

Added subscriber: @DanielPaul

Added subscriber: @PolyGreen

Added subscriber: @PolyGreen

Added subscriber: @RORU

Added subscriber: @RORU

Added subscriber: @LucasVeber

Added subscriber: @LucasVeber

Added subscriber: @1seby

Added subscriber: @1seby

Added subscriber: @filibis

Added subscriber: @filibis

Added subscriber: @ogotay

Added subscriber: @ogotay

Added subscriber: @Andruxa696

Added subscriber: @Andruxa696

May we please get a status update by a developer on this task? It is my understanding that the two main reasons for implementing OpenSubdiv in Blender were:

  • to enable rendering of SubD models at much higher speed and subdivision levels without the increased memory footprint
  • to add support for proper creasing on SubD models

Of the two points only the second one is implemented - partially. Modelling of SubD assets using OpenSubdiv creases is well supported but the algorithm needs higher subdivision levels to work properly which makes the actual rendering of scenes that contain more than a few of such assets virtually impossible due to the increased memory consumption (also creases do not work properly with adaptive subdivision and vertex creasing is not implemented at all).

I appreciate the work you developers are doing and understand you cannot fix every problem at once. However, this task has been open since last August while the actual problems with the current implementation have been known since before the 2.8 release. Yet it seems like nothing is happening on that front. Could we please get some feedback by a developer on what the current status on fixing this issue is?

Thanks again for all your ongoing work on this great piece of open source software.

May we please get a status update by a developer on this task? It is my understanding that the two main reasons for implementing OpenSubdiv in Blender were: - to enable rendering of SubD models at much higher speed and subdivision levels without the increased memory footprint - to add support for proper creasing on SubD models Of the two points only the second one is implemented - partially. Modelling of SubD assets using OpenSubdiv creases is well supported but the algorithm needs higher subdivision levels to work properly which makes the actual rendering of scenes that contain more than a few of such assets virtually impossible due to the increased memory consumption (also creases do not work properly with adaptive subdivision and vertex creasing is not implemented at all). I appreciate the work you developers are doing and understand you cannot fix every problem at once. However, this task has been open since last August while the actual problems with the current implementation have been known since before the 2.8 release. Yet it seems like nothing is happening on that front. Could we please get some feedback by a developer on what the current status on fixing this issue is? Thanks again for all your ongoing work on this great piece of open source software.

The current SubD modifier slows everything down.
In addition to the animation playback (shape and bones), modeling, cloth simulation and shape creation are slowed down, you have to do everything without viewing the SubD in the viewport. With heavy mesh + SubD everything gets jerky. We hope that you can solve at least 2.84 or 2.85, thanks!

The current SubD modifier slows everything down. In addition to the animation playback (shape and bones), modeling, cloth simulation and shape creation are slowed down, you have to do everything without viewing the SubD in the viewport. With heavy mesh + SubD everything gets jerky. We hope that you can solve at least 2.84 or 2.85, thanks!

Why is it lower priority than #68908?
This would allow realtime playbackand realtime posing (a mesh cacher can't do that) of rigged characters with subdivision, which is quite essential when animating facial expressions for example.
Typically, realtime subsurf calculation would avoid mesh caching in many cases, which would be a significant benefit.

Why is it lower priority than #68908? This would allow realtime playback**and realtime posing** (a mesh cacher can't do that) of rigged characters with subdivision, which is quite essential when animating facial expressions for example. Typically, realtime subsurf calculation would avoid mesh caching in many cases, which would be a significant benefit.

Fast SubDiv is a standard that is difficult to give up on.

Fast SubDiv is a standard that is difficult to give up on.

Added subscriber: @Ace_Dragon

Added subscriber: @Ace_Dragon

For 2.81 and 2.82, many users acknowledged that you had a lot of loose ends and technical debt to clean up, so we were patient.

Now, and especially with the tracker curfew completing phase 1, the users want action. On BA, you have users threatening to abandon Blender or consign Blender as an app. with no future. I propose that these regressions get tackled for 2.83, and delay 2.83 itself as long as needed to make sure that it at least has subsurf at 2.79 performance or better. Based on what I've read on this site, the core team knows where the bottlenecks are and what could be causing them, so any inaction here will simply be the result of bad priorities and poor management decisions.

There are no mid-range apps, under active development and the only equivalent apps. cost over 1K with pricey subscriptions, a lot of people have their very ability to work with CGI tied to Blender, please don't let them down.

For 2.81 and 2.82, many users acknowledged that you had a lot of loose ends and technical debt to clean up, so we were patient. Now, and especially with the tracker curfew completing phase 1, the users want action. On BA, you have users threatening to abandon Blender or consign Blender as an app. with no future. I propose that these regressions get tackled for 2.83, and delay 2.83 itself as long as needed to make sure that it at least has subsurf at 2.79 performance or better. Based on what I've read on this site, the core team knows where the bottlenecks are and what could be causing them, so any inaction here will simply be the result of bad priorities and poor management decisions. There are no mid-range apps, under active development and the only equivalent apps. cost over 1K with pricey subscriptions, a lot of people have their very ability to work with CGI tied to Blender, please don't let them down.

Added subscriber: @brecht

Added subscriber: @brecht

To be clear, there are multiple performance projects for 2020:

  • Faster high-poly mesh editing
  • Faster animation playback
  • Faster object mode performance

These all have equal priority and will be mostly worked on by different developers in parallel. High-poly mesh editing and animation playback both are affected by subdivision surfaces and performance will be looked at in the context of both.

To be clear, there are multiple [performance projects for 2020](https://code.blender.org/2020/01/2020-blender-big-projects/): * Faster high-poly mesh editing * Faster animation playback * Faster object mode performance These all have equal priority and will be mostly worked on by different developers in parallel. High-poly mesh editing and animation playback both are affected by subdivision surfaces and performance will be looked at in the context of both.

I'm removing the last line from the description since it only adds confusion and is not accurate in general, it depends on the specific use case. For some heavy rigs subdivision surfaces might not be the first concern, for other rigs it may be what is holding back performance.

I'm removing the last line from the description since it only adds confusion and is not accurate in general, it depends on the specific use case. For some heavy rigs subdivision surfaces might not be the first concern, for other rigs it may be what is holding back performance.

Added subscriber: @0o00o0oo

Added subscriber: @0o00o0oo

Added subscriber: @SamGreen

Added subscriber: @SamGreen

Added subscriber: @MichaelWeisheim

Added subscriber: @MichaelWeisheim

Added subscriber: @realeyez

Added subscriber: @realeyez

Added subscriber: @FrankMartin

Added subscriber: @FrankMartin
Member

Added subscriber: @Sergey

Added subscriber: @Sergey

Added subscriber: @lrevardel

Added subscriber: @lrevardel

Added subscriber: @Miraste

Added subscriber: @Miraste

Added subscriber: @2046411367

Added subscriber: @2046411367

Added subscriber: @Positivity

Added subscriber: @Positivity

Added subscriber: @filiperino

Added subscriber: @filiperino
Contributor

Added subscriber: @KenzieMac130

Added subscriber: @KenzieMac130

Added subscriber: @breakingspell

Added subscriber: @breakingspell

Added subscriber: @stop_using_recaptcha

Added subscriber: @stop_using_recaptcha

Added subscriber: @Loxioss

Added subscriber: @Loxioss

Added subscriber: @MeshVoid

Added subscriber: @MeshVoid

This comment was removed by @LucasVeber

*This comment was removed by @LucasVeber*

Added subscriber: @moisessalvador

Added subscriber: @moisessalvador

Bump

Bump

There is no update. Team is overloaded with a lot of other projects. It is still a planned-to-be-worked-on project, but there are no time allocated for it a far as I know. As soon as there is anything, you'll see it reflected in this task.

P.S. Bump comment an hour after state update question. Comon ;)

There is no update. Team is overloaded with a lot of other projects. It is still a planned-to-be-worked-on project, but there are no time allocated for it a far as I know. As soon as there is anything, you'll see it reflected in this task. P.S. Bump comment an hour after state update question. Comon ;)

This comment was removed by @LucasVeber

*This comment was removed by @LucasVeber*

Added subscriber: @JacobMerrill-1

Added subscriber: @JacobMerrill-1

for simple subdivide I have a method that could accelerate things significantly for simple subdivide.

this only works for quads / triangles - so 'triangulate' would need to be applied to remove ngons.

basically we take a quad (square) - and use point on line to compare a subdivided patch if quads vs the plane 1 time
then use this data to 'emit' a patch that is skewed on the quad

for triangle I use barycentric transform to compare the points

I use this method currently to emit meshes and join them - (expensive!)

but if one simply created a new mesh doing the same stuff in C it should be really really fast.

parametric_BPR.zip

side note - it can emit shapes in 3d as well - so it could be the basis of a new modifier
https://www.youtube.com/watch?v=E3KGYRnTLak

for simple subdivide I have a method that could accelerate things significantly for simple subdivide. this only works for quads / triangles - so 'triangulate' would need to be applied to remove ngons. basically we take a quad (square) - and use point on line to compare a subdivided patch if quads vs the plane 1 time then use this data to 'emit' a patch that is skewed on the quad for triangle I use barycentric transform to compare the points I use this method currently to emit meshes and join them - (expensive!) but if one simply created a new mesh doing the same stuff in C it should be really really fast. [parametric_BPR.zip](https://archive.blender.org/developer/F9000259/parametric_BPR.zip) side note - it can emit shapes in 3d as well - so it could be the basis of a new modifier https://www.youtube.com/watch?v=E3KGYRnTLak

The "bump" comment witnesses the popularity of this request

That is not how we prefer to work though. For the popularity there are tokens.
Such comments introduce an unnecessary noise to the communication, making it harder to follow and (re)read conversation.

for simple subdivide I have a method that could accelerate things significantly for simple subdivide.

Simple subdivisions should be implemented as a BMesh-based modifier, which is to be moved out of the Catmull-Clark modifier.

There are many various optimizations possible, but please stay on topic. In this case it is a GPU side integration.

> The "bump" comment witnesses the popularity of this request That is not how we prefer to work though. For the popularity there are tokens. Such comments introduce an unnecessary noise to the communication, making it harder to follow and (re)read conversation. > for simple subdivide I have a method that could accelerate things significantly for simple subdivide. Simple subdivisions should be implemented as a BMesh-based modifier, which is to be moved out of the Catmull-Clark modifier. There are many various optimizations possible, but please stay on topic. In this case it is a GPU side integration.

Added subscriber: @jack.quiver

Added subscriber: @jack.quiver

Hello,

Sorry for the long post, I hope you guys find it as interesting as I found to actual learn all this...

I wrote a prototype that uses the CUDA implementation of Opensubdiv and I wanted to share my results and experience.
I am sure some of the things I am going to share are obvious to the official Devs, but I feel they could be a good source of discussion.
I also hope they could help the people in this thread to understand why it is not so easy to address this issue and why it will take some time.

A quick disclaimer about me, since this is my first post here.
I am not affiliated in any form with Blender and its development.
I am just an hobbits that wants to understand a bit more about the internals of this wonderful project.
I consider myself fairly experienced with C/C++, GPU programming and high performance computing in general.

Back to Opensubdiv on GPU:
The bottom line is that there is much more to it than just calling Opensubdiv CUDA back-end (or any other GPU back-end) to benefit for GPU acceleration.
Ultimately, after several optimizations I have got something in the range of ~1.5x speed-up on my laptop for the entire end-to-end subdivision process with respect to the CPU based version, but I still feel it is not enough and certainly not a game changer as I had hoped.

These are some of the problems I have encountered:

A. Limited use of Opensubdiv:
Opensubdiv is a only a fraction of what happens during the entire "BKE_subdiv_to_mesh" (which is the end-to-end subdivision process).
For the test case I was evaluating (more about the test case below) I estimate that less than 40-50% of the entire time is spent in Opensubdiv.
This means that even making the entire Opensubdiv code infinitely fast, we get only 2x speedup end-to-end (Amdahl's law sucks!).
I am not sure if more code can be ported to use Opensubdiv, but I see a lot of interpolations, copies, etc. which are not strictly related to Opensubdiv.

B. Batching:
All the calls to Opensubdiv are currently performed at the granularity of a single vertex, both for the "evaluation" part as well as for the "updates" parts.
The "evaluations" translate to CUDA kernel launches, and "updates" translate to cudaMemcpys in the CUDA back-end (similarly in the other Opensubdiv GPU back-ends).

So, just by naively calling the Opensubdiv CUDA with the rest of the code as it is, makes the all thing crippling slow with thousands of minuscule kernel calls and device cudaMemcpys (100x slow-down).
In order to go around this, there is some relatively large refactoring involved, which consists in transforming all calls in the subdiv_foreach.c and subdiv_mesh.c files to operate in batches before calling "updates" and "evaluations" of the Opensubdiv library. Even forgetting to batch few of the calls makes the entire approach worthless.

This refactoring involves changing the routines to iterate on the original vertexes/edges/loops structures to populate large temporary buffers (as well as creating some output buffers for the result of the Opensubdiv calculations).
After that, the Opensubdiv routines can be called in few shots.
However, this still requires copy-in the input data to the GPU, calling the evaluation kernel, copying the output data from the GPU and then iterate on the output buffers to copy back the results in the original structures.

With these changes, I managed to get only 4 Opensubdiv calls for single threaded part of the "subdiv_foreach_single_thread_tasks" and 2 extra Opensubdiv calls for each parallel CPU thread operating on "subdiv_foreach_task".
In my system with 4 cores (8 CPU threads are started) it results in 4 + 8 threads * 2 = 20 Opensubdiv calls (for an end-to-end subdivision).
I can force only one CPU thread to work on "subdiv_foreach_task", which results in only 4 + 1 thread * 2 = 6 Opensubdiv calls.
However, while this makes better use of the Opensubdiv library, it makes the end-to-end solution slower because, as explained in of point A above, there is a lot of work that needs to be done which not strictly in Opensubdiv (and more CPU threads are better).

Anyway, with the changes described I got some hope but still I was not able to beat the default CPU implementation.

C. GPU allocator
I then realized there was an issue was the cudaMalloc allocator.
For each call to Opensubdiv GPU side, temporary GPU buffers need to be created (essentially the mirror buffer of what described above) and a cudaMalloc needs to be performed (and a cudaFree after the evaluation is done).
This ultimately is very slow so I had to design a custom allocator in which CUDA buffers are reused for the entire life of the CUDA evaluator.
There is also a possible optimization (which I have not done) that involves creating the temporary CPU buffers using pinned memory cudaMallocHost() which should speedup some of the CPU<->GPU transfers.
I will try this later if I have some time.

Conclusions:
Ultimately on my laptop (4 core i7-6820HQ with a M1000M GPU) a viewport subdivision level 3 of a torus with 48 major segments and 12 minor segments goes from ~90ms CPU based Opensubdiv to ~60ms CUDA based Opensubdiv.
Again, I don't call this a game changer, and given the added complexity with batching it is totally questionable if it is the right approach (unless more work can be pushed to Opensubdiv).

timeline-8threads - Copy.jpg

Also I wanted to share a profiling (collected with CUDA-nsight-system) of what is happening in the 60ms of end-to-end subdiv on my laptop.
At the bottom of the plot you can see the GPU activity.
"Memory" are the copy-in (green bars) and copy-out (red bars) to/from the GPU.
"Kernels" are the actual invocation to the main CUDA kernel of the Opensubdiv library (20 blue blocks corresponding to the the 20 calls I described above).
You can see that there is a lot of CPU activity "black color" which is not part of Opensubdiv and even if we made Opensubdiv infinitely fast we would still get 40-50ms to execute the end-to-end subdivision.

I hope you found this interesting, let me know if you have some questions.
I can also push a branch somewhere if someone is interested, I would need to do some cleanup before.
Also, I did not implement all the possible cases (specially when it comes to ngons) so it may not work as expected on all the scenes.

Regards,

JQuiver

Hello, Sorry for the long post, I hope you guys find it as interesting as I found to actual learn all this... I wrote a prototype that uses the CUDA implementation of Opensubdiv and I wanted to share my results and experience. I am sure some of the things I am going to share are obvious to the official Devs, but I feel they could be a good source of discussion. I also hope they could help the people in this thread to understand why it is not so easy to address this issue and why it will take some time. A quick disclaimer about me, since this is my first post here. I am not affiliated in any form with Blender and its development. I am just an hobbits that wants to understand a bit more about the internals of this wonderful project. I consider myself fairly experienced with C/C++, GPU programming and high performance computing in general. Back to Opensubdiv on GPU: The bottom line is that there is much more to it than just calling Opensubdiv CUDA back-end (or any other GPU back-end) to benefit for GPU acceleration. Ultimately, after several optimizations I have got something in the range of ~1.5x speed-up on my laptop for the entire end-to-end subdivision process with respect to the CPU based version, but I still feel it is not enough and certainly not a game changer as I had hoped. These are some of the problems I have encountered: **A. Limited use of Opensubdiv:** Opensubdiv is a only a fraction of what happens during the entire "BKE_subdiv_to_mesh" (which is the end-to-end subdivision process). For the test case I was evaluating (more about the test case below) I estimate that less than 40-50% of the entire time is spent in Opensubdiv. This means that even making the entire Opensubdiv code infinitely fast, we get only 2x speedup end-to-end (Amdahl's law sucks!). I am not sure if more code can be ported to use Opensubdiv, but I see a lot of interpolations, copies, etc. which are not strictly related to Opensubdiv. **B. Batching:** All the calls to Opensubdiv are currently performed at the granularity of a single vertex, both for the "evaluation" part as well as for the "updates" parts. The "evaluations" translate to CUDA kernel launches, and "updates" translate to cudaMemcpys in the CUDA back-end (similarly in the other Opensubdiv GPU back-ends). So, just by naively calling the Opensubdiv CUDA with the rest of the code as it is, makes the all thing crippling slow with thousands of minuscule kernel calls and device cudaMemcpys (100x slow-down). In order to go around this, there is some relatively large refactoring involved, which consists in transforming all calls in the subdiv_foreach.c and subdiv_mesh.c files to operate in batches before calling "updates" and "evaluations" of the Opensubdiv library. Even forgetting to batch few of the calls makes the entire approach worthless. This refactoring involves changing the routines to iterate on the original vertexes/edges/loops structures to populate large temporary buffers (as well as creating some output buffers for the result of the Opensubdiv calculations). After that, the Opensubdiv routines can be called in few shots. However, this still requires copy-in the input data to the GPU, calling the evaluation kernel, copying the output data from the GPU and then iterate on the output buffers to copy back the results in the original structures. With these changes, I managed to get only 4 Opensubdiv calls for single threaded part of the "subdiv_foreach_single_thread_tasks" and 2 extra Opensubdiv calls for each parallel CPU thread operating on "subdiv_foreach_task". In my system with 4 cores (8 CPU threads are started) it results in 4 + 8 threads * 2 = 20 Opensubdiv calls (for an end-to-end subdivision). I can force only one CPU thread to work on "subdiv_foreach_task", which results in only 4 + 1 thread * 2 = 6 Opensubdiv calls. However, while this makes better use of the Opensubdiv library, it makes the end-to-end solution slower because, as explained in of point A above, there is a lot of work that needs to be done which not strictly in Opensubdiv (and more CPU threads are better). Anyway, with the changes described I got some hope but still I was not able to beat the default CPU implementation. **C. GPU allocator** I then realized there was an issue was the cudaMalloc allocator. For each call to Opensubdiv GPU side, temporary GPU buffers need to be created (essentially the mirror buffer of what described above) and a cudaMalloc needs to be performed (and a cudaFree after the evaluation is done). This ultimately is very slow so I had to design a custom allocator in which CUDA buffers are reused for the entire life of the CUDA evaluator. There is also a possible optimization (which I have not done) that involves creating the temporary CPU buffers using pinned memory cudaMallocHost() which should speedup some of the CPU<->GPU transfers. I will try this later if I have some time. **Conclusions:** Ultimately on my laptop (4 core i7-6820HQ with a M1000M GPU) a viewport subdivision level 3 of a torus with 48 major segments and 12 minor segments goes from ~90ms CPU based Opensubdiv to ~60ms CUDA based Opensubdiv. Again, I don't call this a game changer, and given the added complexity with batching it is totally questionable if it is the right approach (unless more work can be pushed to Opensubdiv). ![timeline-8threads - Copy.jpg](https://archive.blender.org/developer/F9368451/timeline-8threads_-_Copy.jpg) Also I wanted to share a profiling (collected with CUDA-nsight-system) of what is happening in the 60ms of end-to-end subdiv on my laptop. At the bottom of the plot you can see the GPU activity. "Memory" are the copy-in (green bars) and copy-out (red bars) to/from the GPU. "Kernels" are the actual invocation to the main CUDA kernel of the Opensubdiv library (20 blue blocks corresponding to the the 20 calls I described above). You can see that there is a lot of CPU activity "black color" which is not part of Opensubdiv and even if we made Opensubdiv infinitely fast we would still get 40-50ms to execute the end-to-end subdivision. I hope you found this interesting, let me know if you have some questions. I can also push a branch somewhere if someone is interested, I would need to do some cleanup before. Also, I did not implement all the possible cases (specially when it comes to ngons) so it may not work as expected on all the scenes. Regards, JQuiver

@jack.quiver Thanks a lot for sharing your researches on it, I'm pretty sure many users would be interested in testing it, even if it's work in progress. Patches can be sent here:
https://developer.blender.org/differential/diff/create/
According to the description, the initial plan of this task was to make surface subdivision a mesh setting for GPU evaluation, instead of a modifier, probably to overcome the bottlenecks that you mention. Maybe Blender developers could be interesting in collaborating with you on it.

@jack.quiver Thanks a lot for sharing your researches on it, I'm pretty sure many users would be interested in testing it, even if it's work in progress. Patches can be sent here: https://developer.blender.org/differential/diff/create/ According to the description, the initial plan of this task was to make surface subdivision a mesh setting for GPU evaluation, instead of a modifier, probably to overcome the bottlenecks that you mention. Maybe Blender developers could be interesting in collaborating with you on it.

The plan here indeed is to push the data to the GPU and keep it there for drawing. As you found the CPU-GPU transfer has a high latency, which makes transferring data back and forth between CPU and GPU not great for realtime.

I'm not sure we would add an approach that relies on such transfers.

The plan here indeed is to push the data to the GPU and keep it there for drawing. As you found the CPU-GPU transfer has a high latency, which makes transferring data back and forth between CPU and GPU not great for realtime. I'm not sure we would add an approach that relies on such transfers.

Thanks for the feedbacks.

Keeping the data entirely on GPU for subdivision and sub-sequent rendering would be ideal.
However, I am not sure it can all be achieved only with opensubdiv, it may require some custom extra CUDA kernels to work on the mesh before display.

Also, something to keep in mind is that the Opensubdiv CUDA kernel (or OpenCL) is very poorly optimized (see here 9be5a29827/opensubdiv/osd/cudaKernel.cu (L259)). I would not use that kernel as it is in any performance critical project. Also, the last update on that code was 2 years ago.

I wonder what it means to Blender in relying on the Opensubdiv library as it is.
I would like to hear some opinions.

Thanks for the feedbacks. Keeping the data entirely on GPU for subdivision and sub-sequent rendering would be ideal. However, I am not sure it can all be achieved only with opensubdiv, it may require some custom extra CUDA kernels to work on the mesh before display. Also, something to keep in mind is that the Opensubdiv CUDA kernel (or OpenCL) is very poorly optimized (see here https://github.com/PixarAnimationStudios/OpenSubdiv/blob/9be5a298270a42d89d7925c8cdd9e335869ec6ce/opensubdiv/osd/cudaKernel.cu#L259). I would not use that kernel as it is in any performance critical project. Also, the last update on that code was 2 years ago. I wonder what it means to Blender in relying on the Opensubdiv library as it is. I would like to hear some opinions.

Blender 2.79 had GPU subdivision implemented in a way that data would stay on the GPU.

I think that most likely there is no point in using CUDA or OpenCL backends, we might as well use an OpenGL or future Vulkan backend.

Blender 2.79 had GPU subdivision implemented in a way that data would stay on the GPU. I think that most likely there is no point in using CUDA or OpenCL backends, we might as well use an OpenGL or future Vulkan backend.

Added subscriber: @Nominous

Added subscriber: @Nominous

Added subscriber: @mysticfall

Added subscriber: @mysticfall

Added subscriber: @jackiejake

Added subscriber: @jackiejake

Added subscriber: @DarkKnight

Added subscriber: @DarkKnight

Added subscriber: @chadking

Added subscriber: @chadking

So, Armature skinning on the CPU should be done in the stack - if you need the feedback from the shader

for playblasts a 'GPU skinning algorithm" should happen "After the stack" with no feedback.

https://github.com/tcoppex/aer-engine/blob/master/demos/aura/data/shaders/Skinning.glsl

maybe we can setup a way users can do this in py if it's not going to be supported in master?

I was thinking that each material could save it's shader sources

material.vert_shader
material_frag_shader

and we can have a comment in there to find / replace

at the top near the vertex uniforms
//Begin_Vert_unforms

just before 'final output' we place our offsets here (skinning etc)

//Begin_Vert_offsets

then the user can find / replace these comments with working code / compile the shader and replace the material with it.

So, Armature skinning on the CPU should be done in the stack - if you need the feedback from the shader for playblasts a 'GPU skinning algorithm" should happen "After the stack" with no feedback. https://github.com/tcoppex/aer-engine/blob/master/demos/aura/data/shaders/Skinning.glsl maybe we can setup a way users can do this in py if it's not going to be supported in master? I was thinking that each material could save it's shader sources material.vert_shader material_frag_shader and we can have a comment in there to find / replace at the top near the vertex uniforms //Begin_Vert_unforms # just before 'final output' we place our offsets here (skinning etc) //Begin_Vert_offsets then the user can find / replace these comments with working code / compile the shader and replace the material with it.

Added subscriber: @matthewg.3d

Added subscriber: @matthewg.3d

So, it's 2021 and Blender still doesn't have a proper OpenSubdiv implementation. I understand that things take time and that it's more fun to work on new features than on fixing existing bugs but this is getting a bit ridiculous. Could we PLEASE get one of the 20+ full time Blender developers to finally look into this issue? Preferably before the heat death of the universe engulfs us all and grinds our souls into interstellar dust?

So, it's 2021 and Blender *still* doesn't have a proper OpenSubdiv implementation. I understand that things take time and that it's more fun to work on new features than on fixing existing bugs but this is getting a bit ridiculous. Could we *PLEASE* get one of the 20+ full time Blender developers to finally look into this issue? Preferably before the heat death of the universe engulfs us all and grinds our souls into interstellar dust?

Changed status from 'Confirmed' to: 'Resolved'

Changed status from 'Confirmed' to: 'Resolved'
Brecht Van Lommel self-assigned this 2022-01-10 14:50:55 +01:00

Actually it does in Blender 3.1, see D12406: OpenSubDiv: add support for an OpenGL evaluator.

Further improvements are tracked in #94644 (GPU subdivision improvements).

Actually it does in Blender 3.1, see [D12406: OpenSubDiv: add support for an OpenGL evaluator](https://archive.blender.org/developer/D12406). Further improvements are tracked in #94644 (GPU subdivision improvements).

Thank you for the quick response, Brecht. The summary is a bit technical so I'm not sure if I understood everything correctly, but I'm glad something is finally happening and I look forward to testing it. Thank you!

Thank you for the quick response, Brecht. The summary is a bit technical so I'm not sure if I understood everything correctly, but I'm glad something is finally happening and I look forward to testing it. Thank you!
Sign in to join this conversation.
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
48 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#68996
No description provided.