GPU: API Redesign (high level) #120174

New Issue

Jeroen Bakker · 2024-04-02T15:03:39+02:00

Jeroen Bakker commented

2024-04-02 15:03:39 +02:00

GPU API Redesign

Management info (TLDR)

A feedback I received from the Vulkan Render graph design, my own analysis and external expertise is that in a future we require a render graph on GPU level which replaces/integrates parts of the draw module. At this time it is unclear how that API will look like, but on high level the requirements of such API can be described.

This design task will elaborate on how we got to these requirements and how these requirements could potential influence the GPU and its users in its whole.

The outcome of this task is to continue with a vulkan specific render graph, but take into account that in 1 year this could lead to a GPU API. This new API will exist besides the other API (GPUBatch, GPUImmediate) due to large impact to fully replace them. In a realistic situation the draw manager and Python GPU module would be fully work on the GPU rendergraph API. Prototyping is needed to design the detailed API.

References

#118330 - Vulkan Rendergraph design to solve synchronization issues.
#118963 - Prototype of a Vulkan render graph implementation.

Current state

Currently we have several APIs related to GPU drawing.

classDiagram
  namespace UI {
    class Editors
  }
  namespace Python {
    class PyGPU
  }
  namespace GPU {
    class GPUBatch
    class GPUImmediate
  }
  namespace Draw {
    class DrawManager
  }
  <<Interface>> GPUBatch
  <<Interface>> GPUImmediate
  <<Interface>> DrawManager

  Editors --> GPUBatch
  Editors --> GPUImmediate
  Editors --> DrawManager
  PyGPU --> GPUBatch
  DrawManager --> GPUBatch

GPUImmediate

GPUImmediate is a compatibility layer to use OpenGL pre-core programming model. It was introduced during Blender 2.8 when we switched to the OpenGL core profile.

The API should be deprecated and it usages should be replaced by GPUBatch. However due to the understandability of this API to non GPU developers and not promoting that this API is deprecated it is actually the most commonly used API concerning UI/Editor drawing.

Draw backs of this API is that geometry/data aren't kept on the GPU and data needs to be resent each time.

GPUBatch

GPUBatch uses a shader based approach. Where geometry batches are created and uploaded and when the geometry doesn't change it can be reused by other shaders or for the next frame.

Due to the ability to prepare geometry batches it is faster then GPUImmediate mode.

DrawManager

DrawManager is an API on top of GPUBatch that adds a programming model for high performance rendering. It is typically used to draw the 3D viewport where performance matters.

The API uses some best practices on how to order draw commands to reduce context switches.

Limitations

Current APIs have a disconnect with how modern GL APIs are structured that leads to non optimal performance and complex translation code in our backends.

Pipelines

Modern GL-APIs (Vulkan/Metal) provides accessibility to pipelines. A pipeline is a configuration of the GPU when performing drawing or compute commands.

Most change to the pipeline configuration can trigger a recompilation of the pipeline, including a recompilation of the shader. This can already happen by changing the blend mode using GPU_blend of using geometry with just a different layout.

Texture layouts

Pixels of textures that are used in a pipeline must be in a specific layout. The layout depends on how the texture is used inside the pipeline.

There are different layouts for

TRANSFER_READ/WRITE
SHADER_READ/WRITE
ATTACHMENT
PRESENT

When transferring a texture to a new layout, the previous layout needs to be provided as well. Changing a layout can be done by providing so called pipeline barriers. A pipeline can alter the layout of a whole texture, but also a single layer or LOD level. In this case a texture can have a mixed layout.

More information about pipeline barriers will be provided in the section about resource versions.

Command reordering

Commands that are send to the GPU stack can be executed in a different order on the GPU. This is done to reduce pipeline recompilation and resource layout transitions.

Depending on the backend the responsibility can be somewhere different. In OpenGL this is a driver responsibility and the application isn't aware and cannot influence the reordering directly. In Metal this is also a driver responsibility, but the application can provide hints to influence the reordering. In Vulkan this is the sole responsibility of the Application.

Resource versions

With the command reordering in mind it is important to track versions of resources. You don't want to reorder the commands in a way that it uses a different content of the resource.

Every time a pipeline (or CPU code) alters a resource a new version will be tracked. It is the same resource, only the commands before the change are scoped and must not use the content of new resource.

A pipeline barrier can be added before the commands to guard a resource between read and write actions. It also is used to transform the layout of a texture. These pipeline barriers also need to know how resources are going to be used until the next pipeline will be added so it only locks the GPU when it is needed.

Backend implementation

Currently our APIs are limited to a single batch and logic is required to fulfill the requirements of modern GL-APIs.

Before sending the commands to the GL-API the actual commands are recorded in an intermediate buffer. When the intermediate buffer is send to the GPU (via a flush/finish, or other event) the intermediate buffer is analyzed to reorder commands and to generate the correct pipeline barriers.

It could be that the reordering and pipeline barriers will be the same when looking over frames, but due to the API granularity level the GPU backend doesn't know what it is actually drawing/computing.

Other software

How do game engines and other GPU frameworks solve this?

WebGPU/wgpu

WebGPU is a standard to provide low level access to GPU devices on the Web. wgpu is an widely used implementation of this standard. The API is designed in such way that the developer can create a flow between pipelines and point out how resources are used between them. These pipeline flows are called RenderPipelines and ComputePipelines. I used the term pipeline flow as to not confuse vulkan and metal developers with with graphics pipeline and compute pipeline.

The implementation can extract and cache pipeline barriers with each pipeline flow. The next time a pipeline flow is used the resource handles are updated and the already extracted commands are submitted to the GL-API

NOTE: the pipeline barriers created are not optimized for performance, but rather for clarity. It generates barriers that are stalling the GPU more than actually needed. Reason is that it only tries to optimize resource usages within a flow o it can reuse the prerecorded commands.

Godot

Godot has a similar implementation as we do. They have their own GPU API which is also accessible to game developers. The API is shader based. After reaching out to them about what they think would be their target API they responded that they are also inspired by WebGPU. If that WebGPU was defined before they developed their API they might have used it.

AAA game engines.

Since 2017 there is a lot of presentations done at GDC and other conferences about optimization. Most APIs I have seen are based on a similar approach as WebGPU, but with resource tracking across pipeline flows.

This approach has multiple names, but more often it is called a render graph. The render graph contains nodes. Each node has a list of relations with its resources. Depending on the implementation a single node can contain a single compute pipeline or a flow of compute pipelines. Similar to render nodes (graphics).

Some framework (nicebyte) try to do something smart so they don't need to add a render graph API. They use something similar as vulkan synchronization validation layer does. However there are some limitations to when this can be used. These limitations include multi threaded drawing.

Some differences between Blender and these framework is how drawing and threading is organized. Games typically use one main drawing thread and can have a small number of helper threads. The helper threads are often used for data transfers and compute passes to update textures or the scene (physics). Blender however can have multiple drawing threads for example when performing background rendering, baking or GPU compositing. Each thread has its own context, but eventually submit to the same GPU queue.

References:

Render Graph 101
Godot / slides
Vulkan Synchronization Made Easy / slides
WebGPU / slides
Add GDC presentations as well.

Target API

Our target should be to move the render graph (currently part of the draw manager) as its only API. Usages of other APIs should be migrated to the render graph API.

This API should give the API-user more clarity of what is actually needed. The GPU Backend also gets more context of what the API-user is doing and make better decisions. Also being able to cache decisions for the next time to reduce CPU cycles.

NOTE: My recommendation is to keep the impact of this API in mind when continuing the Vulkan backend. I don't think it is realistic to come up with an API at this moment. The impact to the current code base is to large (the API should be able to support all these changes). Another recommendation is after removing OpenGL we should prototype which would lead to a clearer understanding about the needs of this API.

classDiagram
  namespace UI {
    class Editors
  }
  namespace Python {
    class PyGPU
  }
  namespace GPU {
    class GPURenderGraph
  }
  namespace Draw {
    class EEVEE
  }
  namespace GL {
    class Vulkan
    class Metal
  }
  <<Interface>> GPURenderGraph

  Editors --> GPURenderGraph
  PyGPU --> GPURenderGraph
  EEVEE --> GPURenderGraph
  GPURenderGraph --> Vulkan
  GPURenderGraph --> Metal

Although the details of the API isn't clear, it is clear that there are several stages when using the API. All code examples here don't represent the final API and should only be read as guide. For now I kept as close to the current Draw manager API.

All details are open for discussion.

Defining a render graph node.

void OutlinePass::init()
{
  if (render_node_info_.is_initialized()) {
    return;
  }

  Pass &pass = render_node_info_.new_pass();
  pass.state_set(DRW_STATE_WRITE_COLOR | DRW_STATE_BLEND_ALPHA_PREMUL);
  pass.shader_set(ShaderCache::get().outline.get());
  pass.draw_procedural(GPU_PRIM_TRIS, 1, 3);
}

Defines a template for a render graph node. A render graph nodes can have multiple passes and multiple draw commands per pass.

NOTE: Some passes for example materials are too complex to cache as the drawing commands and resource bindings and even parameter change to often. I would assume these will be reset every time.

Syncing a render graph node.

When syncing the resources the render_node_info_ can be used to initialize an instance where the render_node_ and resources are linked.

void OutlinePass::sync(SceneResources &resources)
{
  render_node_.init(render_node_info_);
  render_node_.bind_ubo("world_data", resources.world_buf);
  render_node_.bind_texture("objectIdBuffer", &resources.object_id_tx);
}

When first used the render_node_info_ can be analyzed by the GPU backend creating a list of commands that are needed to send to the GPU. These commands wouldn't contain any references to the actual resources. Names or ids could be used inside the list of commands.

After the render_node_ is initialized it contains a prepared render_node_info_. Resources can be added; the added resources will be stored beside the render_node_info_. The resources and render_node_info_ will be merged later on in the drawing process.

Submitting a render graph node

When it is decided to draw a node the node is sent to the GPU backend.

void OutlinePass::draw(Manager &manager)
{
  manager.add_node(render_node_);
}

The GPU backend adds the node to the render graph of the current context.
No GPU commands are sent during this phase hence the name add_node.

Context render graph submission.

Eventually the GPU Backend will send the commands to the GPU. Just before this happens the nodes are reordered to reduce pipeline recompilations.
After the order is known the resources can be merged with the commands and the pipeline barriers (already part of the command list) can be updated to use the actual state of the resource.

As the submission happens later in the process we have more information how a specific resource version is used and that can lead to better pipeline barriers.

The commands can be recorded into a GL-API specific command buffer and submitted to the device queue.

Project phasing

How do we get from the current state to the target state?

Step 1: Vulkan application responsibility

OpenGL and Metal both have a full render graph or part of it as driver level responsibility. For Vulkan the application is fully responsible to provide the correct calls.

Our Vulkan backend doesn't have a render graph, and lacks performance and correct resource synchronization. Beginning 2024 research and experiments were performed how to solve this. Due to our threading model a low level render graph would be a good solution.

This render graph would not be a replacement for the draw manager but would address be able to translate GPUBatch and GPUImmediate mode APIs to a render graph to create the correct list of commands to send to the GPU.

classDiagram
  namespace GPUVulkan {
      class VKRenderGraph
      class VKBatch
      class VKImmediate
  }
  namespace GL {
    class Vulkan
  }

  VKBatch --> VKRenderGraph
  VKImmediate --> VKRenderGraph
  VKRenderGraph --> Vulkan

Prototype is available in #118963

In stead of having a ComputeNode and a GraphicsNode it also contains many nodes that are specific to the GPUBatch/GPUImmediate APIs. Implementation and design details are inside the mentioned tasks.

The reason to prioritize the Vulkan specific render graph before the GPU render graph:

Detailed API for the GPU render graph can be extracted better when a backend is implemented that has a render graph at its core. Current draw manager API was added based on best practices and improving performance on OpenGL backend. Currently it has some parts that doesn't map nicely to Vulkan (or Metal)
Vulkan is needed for platform support. Some OpenGL platforms don't work, but will using Vulkan.
Vulkan is needed to reduce shader compilation times in EEVEE-Next
Vulkan is needed as some GPU features are not available in OpenGL at all.

Step 2: GPU RenderGraph API

After phasing out OpenGL and based on the VKRenderGraph we can design a GPURenderGraph. Phasing out OpenGL is not a hard requirement, but would reduce the amount of work.

Using test cases we can validate correct working of the Vulkan and Metal render graph implementation.

classDiagram
  namespace UI {
    class SpaceDraw
  }
  namespace Python {
    class PyGPU
  }
  namespace Draw {
    class DrawManager
  }

  namespace GPU {
      class GPUBatch
      class GPUImmediate
      class GPURenderGraph
  }
  namespace GPUMetal {
    class MTLRenderGraph

  }
  namespace GPUVulkan {
    class VKRenderGraph
    
  }

  SpaceDraw --> GPUBatch
  SpaceDraw --> GPUImmediate
  PyGPU --> GPUBatch
  DrawManager --> GPUBatch
  GPURenderGraph <|-- VKRenderGraph
  GPURenderGraph <|-- MTLRenderGraph

Step 3a: API migration

Step 3 is a migration process where the usage of the GPUBatch and GPUImmediate mode APIs are migrated to the GPURenderGraph process. The order of this process can be discussed and depends on the needs the moment we start the migration.

The idea is to keep all APIs working until the whole code base is migrated to the render graph approach.

The order described here is to first migrate GPUBatch calls inside the editors as these are not that many as GPUImmediate or as complex as DrawManager. After getting some experience we can plan the other migrations better.

Step 3a: API migration

Migrate GPUBatch in editor code with GPURenderGraph.

classDiagram
  namespace UI {
    class SpaceDraw
  }
  namespace Python {
    class PyGPU
  }
  namespace Draw {
    class DrawManager
  }

  namespace GPU {
      class GPUBatch
      class GPUImmediate
      class GPURenderGraph
  }
  namespace GPUMetal {
    class MTLRenderGraph

  }
  namespace GPUVulkan {
    class VKRenderGraph
    
  }

  SpaceDraw --> GPURenderGraph
  SpaceDraw --> GPUImmediate
  PyGPU --> GPURenderGraph
  DrawManager --> GPUBatch
  GPURenderGraph <|-- VKRenderGraph
  GPURenderGraph <|-- MTLRenderGraph

Step 3b: API migration

Migrate DrawManager with GPURenderGraph. Most likely here are the most benefits as draw manager also stores a list of commands. This would then be joined with the render graph and the GPU backend doesn't need to reverse engineer all the information it needs.

classDiagram
  namespace UI {
    class SpaceDraw
  }
  namespace Python {
    class PyGPU
  }
  namespace Draw {
    class DrawManager
  }

  namespace GPU {
      class GPUBatch
      class GPUImmediate
      class GPURenderGraph
  }
  namespace GPUMetal {
    class MTLRenderGraph

  }
  namespace GPUVulkan {
    class VKRenderGraph
    
  }

  SpaceDraw --> GPURenderGraph
  SpaceDraw --> GPUImmediate
  PyGPU --> GPUBatch
  DrawManager --> GPURenderGraph
  GPURenderGraph <|-- VKRenderGraph
  GPURenderGraph <|-- MTLRenderGraph

Step 3c: API migration

Python API migration. Currently Python API design has issues as it is extracted from the internal API. Users are requesting for access to features that doesn't fit well. So there is value in implementing this.

classDiagram
  namespace UI {
    class SpaceDraw
  }
  namespace Python {
    class PyGPU
  }
  namespace Draw {
    class DrawManager
  }

  namespace GPU {
      class GPUBatch
      class GPUImmediate
      class GPURenderGraph
  }
  namespace GPUMetal {
    class MTLRenderGraph

  }
  namespace GPUVulkan {
    class VKRenderGraph
    
  }

  SpaceDraw --> GPURenderGraph
  SpaceDraw --> GPUImmediate
  PyGPU --> GPURenderGraph
  DrawManager --> GPURenderGraph
  GPURenderGraph <|-- VKRenderGraph
  GPURenderGraph <|-- MTLRenderGraph

Step 3d: API migration

I doubt the benefit vs development effort for this phase is ok. There is a lot of code that needs to be refactored. Users will eventually get a more fluent UI and developers will need to maintain a smaller code base.

classDiagram
  namespace UI {
    class SpaceDraw
  }
  namespace Python {
    class PyGPU
  }
  namespace Draw {
    class DrawManager
  }

  namespace GPU {
      class GPURenderGraph
  }
  namespace GPUMetal {
    class MTLRenderGraph

  }
  namespace GPUVulkan {
    class VKRenderGraph
    
  }

  SpaceDraw --> GPURenderGraph
  PyGPU --> GPURenderGraph
  DrawManager --> GPURenderGraph
  GPURenderGraph <|-- VKRenderGraph
  GPURenderGraph <|-- MTLRenderGraph

Risks

The development will spawn over multiple years. Priorities will change what would leave us in an unfinished state. We are currently in this state due to immediate mode drawing.
Impact is huge as it touches all existing drawing code in editors, draw manger and python Add-ons.
Most likely step1, step2 and step3b will be executed. Resulting in moving parts of the draw manager to the GPURenderGraph, But the GPUBatch/Immediate APIs will still exist.
API on function level isn't clear. I would recommend to perform some prototyping in order to make it more clear. Designing an API as a feedback from the Vulkan project isn't realistic as the API needs to work on other backends as well. We should keep an api as this in mind when developing Vulkan backend.

# GPU API Redesign ## Management info (TLDR) A feedback I received from the Vulkan Render graph design, my own analysis and external expertise is that in a future we require a render graph on GPU level which replaces/integrates parts of the draw module. At this time it is unclear how that API will look like, but on high level the requirements of such API can be described. This design task will elaborate on how we got to these requirements and how these requirements could potential influence the GPU and its users in its whole. The outcome of this task is to continue with a vulkan specific render graph, but take into account that in 1 year this could lead to a GPU API. This new API will exist besides the other API (GPUBatch, GPUImmediate) due to large impact to fully replace them. In a realistic situation the draw manager and Python GPU module would be fully work on the GPU rendergraph API. Prototyping is needed to design the detailed API. ## References - #118330 - Vulkan Rendergraph design to solve synchronization issues. - #118963 - Prototype of a Vulkan render graph implementation. ## Current state Currently we have several APIs related to GPU drawing. ```mermaid classDiagram namespace UI { class Editors } namespace Python { class PyGPU } namespace GPU { class GPUBatch class GPUImmediate } namespace Draw { class DrawManager } <<Interface>> GPUBatch <<Interface>> GPUImmediate <<Interface>> DrawManager Editors --> GPUBatch Editors --> GPUImmediate Editors --> DrawManager PyGPU --> GPUBatch DrawManager --> GPUBatch ``` ### GPUImmediate `GPUImmediate` is a compatibility layer to use OpenGL pre-core programming model. It was introduced during Blender 2.8 when we switched to the OpenGL core profile. The API should be deprecated and it usages should be replaced by GPUBatch. However due to the understandability of this API to non GPU developers and not promoting that this API is deprecated it is actually the most commonly used API concerning UI/Editor drawing. Draw backs of this API is that geometry/data aren't kept on the GPU and data needs to be resent each time. ### GPUBatch `GPUBatch` uses a shader based approach. Where geometry batches are created and uploaded and when the geometry doesn't change it can be reused by other shaders or for the next frame. Due to the ability to prepare geometry batches it is faster then GPUImmediate mode. ### DrawManager `DrawManager` is an API on top of `GPUBatch` that adds a programming model for high performance rendering. It is typically used to draw the 3D viewport where performance matters. The API uses some best practices on how to order draw commands to reduce context switches. ## Limitations Current APIs have a disconnect with how modern GL APIs are structured that leads to non optimal performance and complex translation code in our backends. ### Pipelines Modern GL-APIs (Vulkan/Metal) provides accessibility to pipelines. A pipeline is a configuration of the GPU when performing drawing or compute commands. Most change to the pipeline configuration can trigger a recompilation of the pipeline, including a recompilation of the shader. This can already happen by changing the blend mode using `GPU_blend` of using geometry with just a different layout. ### Texture layouts Pixels of textures that are used in a pipeline must be in a specific layout. The layout depends on how the texture is used inside the pipeline. There are different layouts for - `TRANSFER_READ/WRITE` - `SHADER_READ/WRITE` - `ATTACHMENT` - `PRESENT` When transferring a texture to a new layout, the previous layout needs to be provided as well. Changing a layout can be done by providing so called pipeline barriers. A pipeline can alter the layout of a whole texture, but also a single layer or LOD level. In this case a texture can have a mixed layout. More information about pipeline barriers will be provided in the section about resource versions. ### Command reordering Commands that are send to the GPU stack can be executed in a different order on the GPU. This is done to reduce pipeline recompilation and resource layout transitions. Depending on the backend the responsibility can be somewhere different. In OpenGL this is a driver responsibility and the application isn't aware and cannot influence the reordering directly. In Metal this is also a driver responsibility, but the application can provide hints to influence the reordering. In Vulkan this is the sole responsibility of the Application. ### Resource versions With the command reordering in mind it is important to track versions of resources. You don't want to reorder the commands in a way that it uses a different content of the resource. Every time a pipeline (or CPU code) alters a resource a new version will be tracked. It is the same resource, only the commands before the change are scoped and must not use the content of new resource. A pipeline barrier can be added before the commands to guard a resource between read and write actions. It also is used to transform the layout of a texture. These pipeline barriers also need to know how resources are going to be used until the next pipeline will be added so it only locks the GPU when it is needed. ### Backend implementation Currently our APIs are limited to a single batch and logic is required to fulfill the requirements of modern GL-APIs. Before sending the commands to the GL-API the actual commands are recorded in an intermediate buffer. When the intermediate buffer is send to the GPU (via a flush/finish, or other event) the intermediate buffer is analyzed to reorder commands and to generate the correct pipeline barriers. It could be that the reordering and pipeline barriers will be the same when looking over frames, but due to the API granularity level the GPU backend doesn't know what it is actually drawing/computing. ## Other software How do game engines and other GPU frameworks solve this? ### WebGPU/wgpu WebGPU is a standard to provide low level access to GPU devices on the Web. wgpu is an widely used implementation of this standard. The API is designed in such way that the developer can create a flow between pipelines and point out how resources are used between them. These pipeline flows are called RenderPipelines and ComputePipelines. I used the term pipeline flow as to not confuse vulkan and metal developers with with graphics pipeline and compute pipeline. The implementation can extract and cache pipeline barriers with each pipeline flow. The next time a pipeline flow is used the resource handles are updated and the already extracted commands are submitted to the GL-API > NOTE: the pipeline barriers created are not optimized for performance, but rather for clarity. It generates barriers that are stalling the GPU more than actually needed. Reason is that it only tries to optimize resource usages within a flow o it can reuse the prerecorded commands. ### Godot Godot has a similar implementation as we do. They have their own GPU API which is also accessible to game developers. The API is shader based. After reaching out to them about what they think would be their target API they responded that they are also inspired by WebGPU. If that WebGPU was defined before they developed their API they might have used it. ### AAA game engines. Since 2017 there is a lot of presentations done at GDC and other conferences about optimization. Most APIs I have seen are based on a similar approach as WebGPU, but with resource tracking across pipeline flows. This approach has multiple names, but more often it is called a render graph. The render graph contains nodes. Each node has a list of relations with its resources. Depending on the implementation a single node can contain a single compute pipeline or a flow of compute pipelines. Similar to render nodes (graphics). Some framework (nicebyte) try to do something smart so they don't need to add a render graph API. They use something similar as vulkan synchronization validation layer does. However there are some limitations to when this can be used. These limitations include multi threaded drawing. Some differences between Blender and these framework is how drawing and threading is organized. Games typically use one main drawing thread and can have a small number of helper threads. The helper threads are often used for data transfers and compute passes to update textures or the scene (physics). Blender however can have multiple drawing threads for example when performing background rendering, baking or GPU compositing. Each thread has its own context, but eventually submit to the same GPU queue. References: - [Render Graph 101](https://blog.traverseresearch.nl/render-graph-101-f42646255636) - [Godot](https://youtu.be/j1SH1gL7E6A?si=YR8MOltlmsCNI9Uq) / [slides](https://vulkan.org/user/pages/09.events/vulkanised-2024/vulkanised-2024-clay-john-godot.pdf) - [Vulkan Synchronization Made Easy](https://www.youtube.com/watch?v=d15RXWp1Rqo) / [slides](https://vulkan.org/user/pages/09.events/vulkanised-2024/vulkanised-2024-grigory-dzhavadyan.pdf) - [WebGPU](https://youtu.be/SH0N4QmioUw?si=bMw4djgWzuJ9ukH-) / [slides](https://vulkan.org/user/pages/09.events/vulkanised-2024/vulkanised-2024-albin-bernhardsson-arm.pdf) - Add GDC presentations as well. ## Target API Our target should be to move the render graph (currently part of the draw manager) as its only API. Usages of other APIs should be migrated to the render graph API. This API should give the API-user more clarity of what is actually needed. The GPU Backend also gets more context of what the API-user is doing and make better decisions. Also being able to cache decisions for the next time to reduce CPU cycles. > NOTE: My recommendation is to keep the impact of this API in mind when continuing the Vulkan backend. I don't think it is realistic to come up with an API at this moment. The impact to the current code base is to large (the API should be able to support all these changes). Another recommendation is after removing OpenGL we should prototype which would lead to a clearer understanding about the needs of this API. ```mermaid classDiagram namespace UI { class Editors } namespace Python { class PyGPU } namespace GPU { class GPURenderGraph } namespace Draw { class EEVEE } namespace GL { class Vulkan class Metal } <<Interface>> GPURenderGraph Editors --> GPURenderGraph PyGPU --> GPURenderGraph EEVEE --> GPURenderGraph GPURenderGraph --> Vulkan GPURenderGraph --> Metal ``` Although the details of the API isn't clear, it is clear that there are several stages when using the API. All code examples here don't represent the final API and should only be read as guide. For now I kept as close to the current Draw manager API. All details are open for discussion. ### Defining a render graph node. ```cpp void OutlinePass::init() { if (render_node_info_.is_initialized()) { return; } Pass &pass = render_node_info_.new_pass(); pass.state_set(DRW_STATE_WRITE_COLOR | DRW_STATE_BLEND_ALPHA_PREMUL); pass.shader_set(ShaderCache::get().outline.get()); pass.draw_procedural(GPU_PRIM_TRIS, 1, 3); } ``` Defines a template for a render graph node. A render graph nodes can have multiple passes and multiple draw commands per pass. > NOTE: Some passes for example materials are too complex to cache as the drawing commands and resource bindings and even parameter change to often. I would assume these will be reset every time. ### Syncing a render graph node. When syncing the resources the render_node_info_ can be used to initialize an instance where the render_node_ and resources are linked. ```cpp void OutlinePass::sync(SceneResources &resources) { render_node_.init(render_node_info_); render_node_.bind_ubo("world_data", resources.world_buf); render_node_.bind_texture("objectIdBuffer", &resources.object_id_tx); } ``` When first used the `render_node_info_` can be analyzed by the GPU backend creating a list of commands that are needed to send to the GPU. These commands wouldn't contain any references to the actual resources. Names or ids could be used inside the list of commands. After the `render_node_` is initialized it contains a prepared `render_node_info_`. Resources can be added; the added resources will be stored beside the `render_node_info_`. The resources and `render_node_info_` will be merged later on in the drawing process. ### Submitting a render graph node When it is decided to draw a node the node is sent to the GPU backend. ```cpp void OutlinePass::draw(Manager &manager) { manager.add_node(render_node_); } ``` The GPU backend adds the node to the render graph of the current context. No GPU commands are sent during this phase hence the name `add_node`. ### Context render graph submission. Eventually the GPU Backend will send the commands to the GPU. Just before this happens the nodes are reordered to reduce pipeline recompilations. After the order is known the resources can be merged with the commands and the pipeline barriers (already part of the command list) can be updated to use the actual state of the resource. As the submission happens later in the process we have more information how a specific resource version is used and that can lead to better pipeline barriers. The commands can be recorded into a GL-API specific command buffer and submitted to the device queue. ## Project phasing How do we get from the current state to the target state? ## Step 1: Vulkan application responsibility OpenGL and Metal both have a full render graph or part of it as driver level responsibility. For Vulkan the application is fully responsible to provide the correct calls. Our Vulkan backend doesn't have a render graph, and lacks performance and correct resource synchronization. Beginning 2024 research and experiments were performed how to solve this. Due to our threading model a low level render graph would be a good solution. This render graph would not be a replacement for the draw manager but would address be able to translate GPUBatch and GPUImmediate mode APIs to a render graph to create the correct list of commands to send to the GPU. ```mermaid classDiagram namespace GPUVulkan { class VKRenderGraph class VKBatch class VKImmediate } namespace GL { class Vulkan } VKBatch --> VKRenderGraph VKImmediate --> VKRenderGraph VKRenderGraph --> Vulkan ``` Prototype is available in https://projects.blender.org/blender/blender/pulls/118963 In stead of having a ComputeNode and a GraphicsNode it also contains many nodes that are specific to the GPUBatch/GPUImmediate APIs. Implementation and design details are inside the mentioned tasks. The reason to prioritize the Vulkan specific render graph before the GPU render graph: - Detailed API for the GPU render graph can be extracted better when a backend is implemented that has a render graph at its core. Current draw manager API was added based on best practices and improving performance on OpenGL backend. Currently it has some parts that doesn't map nicely to Vulkan (or Metal) - Vulkan is needed for platform support. Some OpenGL platforms don't work, but will using Vulkan. - Vulkan is needed to reduce shader compilation times in EEVEE-Next - Vulkan is needed as some GPU features are not available in OpenGL at all. ## Step 2: GPU RenderGraph API After phasing out OpenGL and based on the VKRenderGraph we can design a GPURenderGraph. Phasing out OpenGL is not a hard requirement, but would reduce the amount of work. Using test cases we can validate correct working of the Vulkan and Metal render graph implementation. ```mermaid classDiagram namespace UI { class SpaceDraw } namespace Python { class PyGPU } namespace Draw { class DrawManager } namespace GPU { class GPUBatch class GPUImmediate class GPURenderGraph } namespace GPUMetal { class MTLRenderGraph } namespace GPUVulkan { class VKRenderGraph } SpaceDraw --> GPUBatch SpaceDraw --> GPUImmediate PyGPU --> GPUBatch DrawManager --> GPUBatch GPURenderGraph <|-- VKRenderGraph GPURenderGraph <|-- MTLRenderGraph ``` ## Step 3a: API migration Step 3 is a migration process where the usage of the GPUBatch and GPUImmediate mode APIs are migrated to the GPURenderGraph process. The order of this process can be discussed and depends on the needs the moment we start the migration. The idea is to keep all APIs working until the whole code base is migrated to the render graph approach. The order described here is to first migrate GPUBatch calls inside the editors as these are not that many as GPUImmediate or as complex as DrawManager. After getting some experience we can plan the other migrations better. ### Step 3a: API migration Migrate `GPUBatch` in editor code with `GPURenderGraph`. ```mermaid classDiagram namespace UI { class SpaceDraw } namespace Python { class PyGPU } namespace Draw { class DrawManager } namespace GPU { class GPUBatch class GPUImmediate class GPURenderGraph } namespace GPUMetal { class MTLRenderGraph } namespace GPUVulkan { class VKRenderGraph } SpaceDraw --> GPURenderGraph SpaceDraw --> GPUImmediate PyGPU --> GPURenderGraph DrawManager --> GPUBatch GPURenderGraph <|-- VKRenderGraph GPURenderGraph <|-- MTLRenderGraph ``` ### Step 3b: API migration Migrate `DrawManager` with `GPURenderGraph`. Most likely here are the most benefits as draw manager also stores a list of commands. This would then be joined with the render graph and the GPU backend doesn't need to reverse engineer all the information it needs. ```mermaid classDiagram namespace UI { class SpaceDraw } namespace Python { class PyGPU } namespace Draw { class DrawManager } namespace GPU { class GPUBatch class GPUImmediate class GPURenderGraph } namespace GPUMetal { class MTLRenderGraph } namespace GPUVulkan { class VKRenderGraph } SpaceDraw --> GPURenderGraph SpaceDraw --> GPUImmediate PyGPU --> GPUBatch DrawManager --> GPURenderGraph GPURenderGraph <|-- VKRenderGraph GPURenderGraph <|-- MTLRenderGraph ``` ### Step 3c: API migration Python API migration. Currently Python API design has issues as it is extracted from the internal API. Users are requesting for access to features that doesn't fit well. So there is value in implementing this. ```mermaid classDiagram namespace UI { class SpaceDraw } namespace Python { class PyGPU } namespace Draw { class DrawManager } namespace GPU { class GPUBatch class GPUImmediate class GPURenderGraph } namespace GPUMetal { class MTLRenderGraph } namespace GPUVulkan { class VKRenderGraph } SpaceDraw --> GPURenderGraph SpaceDraw --> GPUImmediate PyGPU --> GPURenderGraph DrawManager --> GPURenderGraph GPURenderGraph <|-- VKRenderGraph GPURenderGraph <|-- MTLRenderGraph ``` ### Step 3d: API migration I doubt the benefit vs development effort for this phase is ok. There is a lot of code that needs to be refactored. Users will eventually get a more fluent UI and developers will need to maintain a smaller code base. ```mermaid classDiagram namespace UI { class SpaceDraw } namespace Python { class PyGPU } namespace Draw { class DrawManager } namespace GPU { class GPURenderGraph } namespace GPUMetal { class MTLRenderGraph } namespace GPUVulkan { class VKRenderGraph } SpaceDraw --> GPURenderGraph PyGPU --> GPURenderGraph DrawManager --> GPURenderGraph GPURenderGraph <|-- VKRenderGraph GPURenderGraph <|-- MTLRenderGraph ``` ## Risks - The development will spawn over multiple years. Priorities will change what would leave us in an unfinished state. We are currently in this state due to immediate mode drawing. - Impact is huge as it touches all existing drawing code in editors, draw manger and python Add-ons. - Most likely step1, step2 and step3b will be executed. Resulting in moving parts of the draw manager to the GPURenderGraph, But the GPUBatch/Immediate APIs will still exist. - API on function level isn't clear. I would recommend to perform some prototyping in order to make it more clear. Designing an API as a feedback from the Vulkan project isn't realistic as the API needs to work on other backends as well. We should keep an api as this in mind when developing Vulkan backend.

❤️ 4 🎉 3

Jeroen Bakker added the

Type

Design

label 2024-04-02 15:03:39 +02:00

Jeroen Bakker added this to the EEVEE & Viewport project 2024-04-02 15:03:44 +02:00

Jeroen Bakker self-assigned this 2024-04-02 15:22:36 +02:00

Jeroen Bakker changed title from ~~GPU: API Redesign~~ to GPU: API Redesign (high level)

2024-04-03 08:56:09 +02:00

Jeroen Bakker referenced this issue

2024-04-04 08:43:53 +02:00

Vulkan: Synchronization/Render Graphs #118330

Aras Pranckevicius commented

2024-04-11 08:58:24 +02:00

Possibly naive comment from the outside (I might be talking complete nonsense, in which case just tell me to shut up :)):

My impression is that the reason why "big engines" are doing Render Graph types of designs, is for two reasons: 1) save VRAM by "kinda automatically" allowing various resources (mostly render targets) to alias each other in the same physical VRAM place, 2) semi-automatically allow things to be kept in sub-passes (i.e. "local tile memory") for complex postprocessing operations, but this one is only really relevant for mobile GPU architectures.

As in, "render graphs" are not meant for improving CPU performance, resource tracking, etc. etc. Their primary reason for existence (and all the "API user" complexity they bring!) is to have some system for figuring out, which parts of the frame can reuse the same memory that would be used by another part of the frame.

I'm not sure how much (or at all?) is that relevant for Blender's use case.

My impression would be along the lines of:

Blender probably does not need a render graph system,
It is not worth trying to do "command reordering" of any kind. It's a messy, complex thing and does not get you much, if anything.
Resource transitions are generally considered to be "a mistake" to expose in the graphics APIs. Like, three people in the world can use them properly and get some benefits, everyone else gets them wrong or at best achieves the same result that would be if they were completely hidden inside the driver. As such, it is best to not expose them in any sort of API (be that C++ or Python). Internally, for APIs that need to handle them (i.e. Vulkan), do not try to be clever, just to the simplest thing that works (e.g. "set texture as render target -> transition to render target, transition previous render target texture out of render target" etc.).
Resource versioning does not need a render graph to work. If anything, without the "reordering" bits, resource versioning is way easier. Resource gets changed: if it is still being used by in-flight or not-submitted command buffers, it is copied and the caller gets a fresh copy. Resource deletion: put it into "pending deletions" list, at "end of frame" actually delete things in the list.
"Immediate mode" graphics is useful for building UIs, graphs, widgets and anything else that is not "the 3D scene". It is not a problem that the geometry gets created on the CPU and sent to the GPU every frame (you can easily send off several million polygons this way, every frame, without problems). The current problem with most of immediate mode drawing is the fact that it does one draw call for one quad, not the fact that the quad is created every frame.

Possibly naive comment from the outside (I might be talking complete nonsense, in which case just tell me to shut up :)): - My impression is that the reason why "big engines" are doing Render Graph types of designs, is for two reasons: 1) save VRAM by "kinda automatically" allowing various resources (mostly render targets) to alias each other in the same physical VRAM place, 2) semi-automatically allow things to be kept in sub-passes (i.e. "local tile memory") for complex postprocessing operations, but this one is only really relevant for mobile GPU architectures. As in, "render graphs" are **not** meant for improving CPU performance, resource tracking, etc. etc. Their primary reason for existence (and all the "API user" complexity they bring!) is to have some system for figuring out, which parts of the frame can reuse the same memory that would be used by another part of the frame. I'm not sure how much (or at all?) is that relevant for Blender's use case. My impression would be along the lines of: - Blender probably does not need a render graph system, - It is not worth trying to do "command reordering" of any kind. It's a messy, complex thing and does not get you much, if anything. - Resource transitions are generally considered to be "a mistake" to expose in the graphics APIs. Like, three people in the world can use them properly and get some benefits, everyone else gets them wrong or at best achieves the same result that would be if they were completely hidden inside the driver. As such, it is best to not expose them in any sort of API (be that C++ or Python). Internally, for APIs that need to handle them (i.e. Vulkan), do not try to be clever, just to the simplest thing that works (e.g. "set texture as render target -> transition to render target, transition previous render target texture out of render target" etc.). - Resource versioning does not need a render graph to work. If anything, without the "reordering" bits, resource versioning is way easier. Resource gets changed: if it is still being used by in-flight or not-submitted command buffers, it is copied and the caller gets a fresh copy. Resource deletion: put it into "pending deletions" list, at "end of frame" actually delete things in the list. - "Immediate mode" graphics is useful for building UIs, graphs, widgets and anything else that is not "the 3D scene". It is **not** a problem that the geometry gets created on the CPU and sent to the GPU every frame (you can easily send off several million polygons this way, every frame, without problems). The current problem with most of immediate mode drawing is the fact that it does one draw call for one quad, not the fact that the quad is created every frame.

Jeroen Bakker commented

2024-04-11 10:00:11 +02:00

Your insights are always helpful. We most of the time can only fall back to paper and presentations to get these insights. When talking to game engine developers some months ago their reasons (for a vulkan PoV) was synchronization.

In case for blender we can have multiple CPU threads that uses the same resources. These resources are device specific and shared. (for this the render graph isn't needed as the resources needs to be guarded by a lock.

When transitions happen you need to keep track of where the resource was used and how it will be used in the (near future). Here it becomes a bit trickier. Versioning is used to generate 'optimal' barriers. And validate that we didn't make a mistake.

Reordering is mostly needed where the Blender API isn't sufficient. Reduce pipeline switches when data transfer/compute commands are done during drawing, improve clear operations on renderpass binding.

So I generally agree that a render graph as implemented by game engines isn't needed. I do believe that having a graph to track resources would lead to generating better barriers. The 'nodes' itself can still be evaluated by a back-to-front iteration to populate destination usages and a front to back iteration to populate source iterations. In the future we are planning to track resources usage per pipeline stage and reduce GPU resets when reading back buffers to the CPU. So yeah, we call it a render graph, but perhaps the implementation is just a list.

The complexity of the render graph is far less than the render graph you're mentioning. Nodes as evaluated in sequence. Selection and barrier extraction is done using the info in the graph. Draw manager already does most of the ordering, making sure that the draw manager API fits better on the GPU backend will reduce CPU cycles, which is the main benefit.

Your insights are always helpful. We most of the time can only fall back to paper and presentations to get these insights. When talking to game engine developers some months ago their reasons (for a vulkan PoV) was synchronization. In case for blender we can have multiple CPU threads that uses the same resources. These resources are device specific and shared. (for this the render graph isn't needed as the resources needs to be guarded by a lock. When transitions happen you need to keep track of where the resource was used and how it will be used in the (near future). Here it becomes a bit trickier. Versioning is used to generate 'optimal' barriers. And validate that we didn't make a mistake. Reordering is mostly needed where the Blender API isn't sufficient. Reduce pipeline switches when data transfer/compute commands are done during drawing, improve clear operations on renderpass binding. So I generally agree that a render graph as implemented by game engines isn't needed. I do believe that having a graph to track resources would lead to generating better barriers. The 'nodes' itself can still be evaluated by a back-to-front iteration to populate destination usages and a front to back iteration to populate source iterations. In the future we are planning to track resources usage per pipeline stage and reduce GPU resets when reading back buffers to the CPU. So yeah, we call it a render graph, but perhaps the implementation is just a list. The complexity of the render graph is far less than the render graph you're mentioning. Nodes as evaluated in sequence. Selection and barrier extraction is done using the info in the graph. Draw manager already does most of the ordering, making sure that the draw manager API fits better on the GPU backend will reduce CPU cycles, which is the main benefit.

👍 1

Aras Pranckevicius commented

2024-04-11 10:29:38 +02:00

Yeah I think I probably misunderstood most of this since "render graph" term is most commonly used to describe "a system that would allow me to save several hundred MB of video memory in a complex frame pipeline".

It is very likely that the "common wisdom" used in game engines does not apply (or applies very little) to Blender's use case. For example, most/all of games do not have the setup where several "windows" can be rendered from different threads, all trying to access the GPU. If a game engine would be doing multi-threaded draw call submission (not many do! especially now that many engines are moving towards a GPU driven rendering pipeline, the CPU is not doing much work anymore, so there's little need for multi-threaded draw submission complexity).

Anyway, the threaded draw call submission in game engines (again, if they bother doing it at all) from what I've seen is much simpler than what you allude that Blender would need. So likely "some sort of other way" of achieving that within Blender would be needed. Maybe long high level locking is indeed the only sensible approach, who knows.

Yeah I think I probably misunderstood most of this since "render graph" term is most commonly used to describe "a system that would allow me to save several hundred MB of video memory in a complex frame pipeline". It is very likely that the "common wisdom" used in game engines does not apply (or applies very little) to Blender's use case. For example, most/all of games do not have the setup where several "windows" can be rendered from different threads, all trying to access the GPU. If a game engine would be doing multi-threaded draw call submission (not many do! especially now that many engines are moving towards a GPU driven rendering pipeline, the CPU is _not_ doing much work anymore, so there's little need for multi-threaded draw submission complexity). Anyway, the threaded draw call submission in game engines (again, if they bother doing it at all) from what I've seen is much simpler than what you allude that Blender would need. So likely "some sort of other way" of achieving that within Blender would be needed. Maybe long high level locking is indeed the only sensible approach, who knows.

Sign in to join this conversation.

No Label

Download

What's New

Blender Studio

Manual

Developers Blog

Documentation

Benchmark

Blender Conference

Development Fund

One-time Donations

GPU: API Redesign (high level) #120174

GPU API Redesign

Management info (TLDR)

References

Current state

GPUImmediate

GPUBatch

DrawManager

Limitations

Pipelines

Texture layouts

Command reordering

Resource versions

Backend implementation

Other software

WebGPU/wgpu

Godot

AAA game engines.

Target API

Defining a render graph node.

Syncing a render graph node.

Submitting a render graph node

Context render graph submission.

Project phasing

Step 1: Vulkan application responsibility

Step 2: GPU RenderGraph API

Step 3a: API migration

Step 3a: API migration

Step 3b: API migration

Step 3c: API migration

Step 3d: API migration

Risks