Vulkan: Synchronization/Render Graphs #118330

New Issue

Jeroen Bakker · 2024-02-15T16:12:23+01:00

Jeroen Bakker commented

2024-02-15 16:12:23 +01:00

State: Approved in April 2024

Vulkan Synchronization

Synchronization is at the core of Vulkan. Sadly it is also very hard to do right. Even expert
vulkan developers have never seen applications that do it correct. Mostly as it is a timing issue
the errors are not resulting in artifacts and not fixed at all. (see no evil, hear no evil)

What is this synchronization issue?

Actual work on the GPU happen in a later stage then the CPU gives the actual command. The work that
is happening on the GPU requires buffers and images to be in a certain state or layout. To give some
context around the situation lets go over how the splash screen in Blender draws its picture.

The term synchronization refers to that the application is responsible for the resource usage,
dependencies and layout transformations when they are being executed on the GPU. GPUs are "extremely"
parallelizable and we need to try to feed the GPU with enough parallelizable work so it can work efficiently.

In OpenGL most part of this was transparent and dealt by the OpenGL Driver. In Metal this is partly a
driver responsibility (resource usage dependency) and partly an application responsibility (layouts &
usage hints). In Vulkan this responsibility shifted fully from the driver to the application.

Why this paradigm shift? An OpenGL driver has to deal with all the possible use cases, and was always
balancing between safety and performance. Making synchronization an application responsibility allows
applications to fine-tune what is actually needed and reduce overhead compared to a driver implementation.

What are layout transformations?

When using images GPUs work can improve the performance by reordering the pixels inside an image to
improve caches. When sampling an image inside a shader a regular sequential storage of pixels requires
more cache lines, then when the pixels are stored in tiles/blocks. Less needed cache lines allows cache
lines to be stored with more relevant data and improved performance.

The actual pixel order can be different based on the GPU and how the image is used. Different layouts
exist for copying data (data transform), sampling, storage or when the texture is used as a framebuffer.
This order of the pixels are hidden to the user for IP reasons.

Example of synchronization

On the CPU side a temp buffer is created and filled with pixels. This is sent to the GPU module to
construct a texture from it. This texture is drawn using a shader to an offscreen texture.
The layout of the texture is different when uploading the pixels, then when used on the shader.
Noteworthy the pixels on the CPU side are already freed before the actual drawing is happening. Even
the GPUTexture is freed before the GPU has started drawing.

Uploading pixels requires the image to be in transfer destination layout.
Shader requires the image to be in shader read layout.
Shader can only start executing when the image is in shader read layout.
Framebuffer textures needs to be in shader write state
image can only transform to shader read layout, when all pixels have been uploaded to the GPU
Uploading pixels can only happen when the image is in transfer destination layout and the staging
buffer in transfer source layout.
Staging buffer can only

And this is just a simple example and other elements like parameters and framebuffer attachments have
been ignored.


stateDiagram
    ChangeStagingLayoutToTransferDestination --> CopyPixelsFromCPUToStaging
    CopyPixelsFromCPUToStaging --> ChangeStagingLayoutToTransferSource
    ChangeStagingLayoutToTransferSource --> CopyPixelsFromStagingToImage
    ChangeImageLayoutToTransferDestination --> CopyPixelsFromStagingToImage
    CopyPixelsFromStagingToImage --> ChangeImageLayoutToShaderRead
    ChangeImageLayoutToShaderRead --> Shader
    Shader--> [*]

Each action can only happen when its previous action has been completed. And resources can only be
freed after the drawing has been finished.

How does the industry solve this?

There are three ways that is commonly used to solve this issue. One is recording barriers in the
command buffer where it is needed. Other will track the state of each resource and the third one is
using a render graph. #120174 provides more information how other software frameworks and game engines
solve this issue.

What requirements do we have that needs to be supported by the chosen solution

A device can be used by multiple context (different CPU threads). This is needed to improve final
image rendering and also GPU compositor. Both happen in a background thread with their own context.
But share resources like input/output images, textures etc.
Resources are used by multiple contexts.
Layouts of image resources can only be in a single state, but used by multiple contexts
Resources can be freed before they are finished drawing.

Alternative 1: Manually add barriers

Add layout transitions and barriers where needed. This is difficult and can also lead to unhandled
situations. It would also not be possible to support threaded rendering as the state is tracked
globally and can be altered by other threads, before the drawing is finished.

Most vulkan tutorials and examples online uses this approach.

(+) Allows the best performance for fixed drawing as barriers can be fine tuned to actual need.
(-) Less performance for dynamic drawing as to many barriers needs to be added.
(-) Very hard to do efficient threading
(-) Hard to merge barriers, as they might be scattered around the codebase

Alternative 2: State tracking

Automatically add layout transitions and barriers by tracking the state of each resource. When a
resource requires a different layout, perform the layout, when read-write issues can occur add
the propriate barrier.

Examples of this is done inside the VK_LAYER_synchronization_valid, webgpu and nice.graphics.

(+) Works without actually diping deep into the synchronization issue.
(+) Multiple transitions and barriers can be merged into a single command, improving the performance.
(-) Performance really depends on the implementation. For example webgpu adds to many barriers.
(-) Threading still needs to lock on a higher level as state tracking on multiple treads is hard.
(-) VK_LAYER_synchronization_valid shows that can be extra complicated based on the features you
want to support. they don't support timeline semaphores and report false positive when the
application is using them.

Alternative 3: Render Graphs

Render graphs records commands per thread and when flushing the commands the resource
layout transitions and barriers are included based on the order of execution. Commands can be
reordered to improve performance. The submission of commands construction of commands and
submission are guarded to ensure thread safety.

This approach is often done by game engines like Frostbite/Unreal/Unity as it gets more performance.
Reordering of commands can lead to less and leaner barriers. Implementation in Granite can be found
at https://themaister.net/blog/2017/08/15/render-graphs-and-vulkan-a-deep-dive/ . The implementation
might not be the cleanest, but it describes the steps and features it added to the render graph.

(+) proven solution to track resources from multiple thread.
(-) Adds another level of indirection and a lot of code. Especially in the draw manager.
(=) In the longer run we can integrate the level of indirection with the draw manager API.
(+) Better performance due to reordering of commands, where more similar commands can be executed
in sequence. Data transfer commands that are done inside a render pass, can be moved before
the render pass starts. Framebuffer layout transitions can be merged with render pass begin/end.
(+) Might reduce the number of command buffers as commands are recorded and send to the GPU when
GPU_flush is called.
(=) Freeing temporary resources can be done as part of the render graph. The render graph knows
when which resource can be safely removed, resulting in less unused memory allocations.

NOTE: #120174 describes an API change that will remove one level of indirection by performing the indirection directly on the
render graph of the GPUBackend. Due to its impact and unclear API at this moment I would not add it
to the scope of the Vulkan project.

High level goal design

classDiagram
  class VKDevice
  class VKContext
  class VKRenderGraph {
    add_node()
    submit_buffer_for_read_back()
    submit_for_present()
  }
  class VKRenderGraphNode {
    operation
  }
  class VKResources
  class VKResource {
    buffer_handle: VkBuffer
    image_handle: VkImage
    current_layout: VkImageLayout
  }
  class VKCommandBuffer {
    draw()
    dispatch()
    copy()
    begin_render_pass()
    end_render_pass()
    pipeline_barrier()
  }
  
  VKDevice *--> VKResources: resources
  VKResources *--> VKResource: resources
  VKContext *--> VKRenderGraph: render_graph
  VKRenderGraph *--> VKRenderGraphNode: nodes
  VKRenderGraph *--> VKCommandBuffer: command_buffer
  VKResource o--> VKRenderGraphNode: producer
  VKRenderGraphNode o--> VKResource: reads_resources
  VKRenderGraphNode o--> VKResource: write_resources
  VKCommandBuffer ..> VKDevice: submit_to_device_queue

Any operation that leads to work on the GPU (Draw, Copy, Dispatch) would create a node.
The node will track the resources it needs (read) and resources it write to using (the producer/consumer pattern).
When writing to a resource a new 'version' of that resource is created. Any action that happens
later will always use the last created version. The resource also keeps track of which node created
that version.

Thread-locking

Locking concerning resources, Adding nodes, converting nodes to commands and command submission happens
with a device level mutex. First implementation will also lock when adding a node to a context when that
node uses resources. Building of nodes however are done outside the lock to keep the locking to a minimum.

flowchart
    CreateRenderGraphNode["Create render graph node"]
    subgraph "DeviceScopeLock"
    subgraph "Add resource"
        AddImage["Add image resource"]
        AddBuffer["Add buffer resource"]
    end 
    subgraph "Building render graph"
        AddRenderGraphNodeToGraph["Add render graph node to render graph"]
    end

    subgraph "Submitting rendergraph"
        BuildCommandBuffer["Convert Nodes To Commands"]
        SubmitCommandBufferToQueue["Submitting commands to device queue"]
    end
    end
    WaitForFinishedSubmission["Wait for commands to finish"]

    CreateRenderGraphNode --> AddRenderGraphNodeToGraph
    SubmitForPresent --> BuildCommandBuffer --> SubmitCommandBufferToQueue --> WaitForFinishedSubmission
    SubmitForReadback --> BuildCommandBuffer
    CreateImage --> AddImage
    CreateBuffer --> AddBuffer

Index based access

We should ensure that the graph only uses indices to refer to other nodes/resources so they can
be stored in a vector. The render graph adds and removes nodes very often and will limit memory
allocation operations.

Implementation approach

The development should be split into multiple smaller steps as I consider it a high risk project.
The checkboxes refer to steps that are tested inside the prototype.
Some intermediate steps could be:

Add render graph data structure, but don't hook it up to the GPU api. Perhaps via a compile
option
Add test cases to test render graph structure and some of its features.
Add test to validate uploading/downloading data from GPU (downloading data requires a flush/finish).
Add test to validate uploading/downloading image data from GPU (would require layout transition)
Dispatch of an empty compute shader
Dispatch of an empty compute shader with parameters (descriptor sets) Be sure that multiple dispatches can be schedules are receives the correct parameters. Nodes are run sequentially.
Ensure that compute dispatch test cases work with uniform buffers. add more buffer types along the
way
Images and just in time layout transisions.
Framebuffers and graphic pipelines.
Try to run blender with render graphs
Combine multiple barriers in the same command.
Reorder commands that share the same pipeline and input resources.
Add swap chain commands (present). I think it is possible to add GPU synchronization here as well
due to the changes we made some months ago in the Metal/Vulkan backend.

I recon this is 3 months of work including stabilization and making sure it works on all platforms.
The goal would be to have better performance then OpenGL. This is feasible as OpenGL has to deal with
to many other situations that we don't need to consider.

The lead time includes twice a month status reporting using a demo/GPU trace and getting feedback from
vulkan experts. It doesn't involve additional time I need to spent on platform support and general project
support.

Future developments

Blender currently doesn't support resource tracking between shader stages. Eg vertex shader can already
run, but the fragment shader needs to wait until a certain resource comes available. We could add support
by this in the GPUShaderCreateInfo where resource usages are tagged with the stages they are needed.

Promote the render graph api as a GPU API (#120174) to remove one the level of indirection.

How to involve the community?

Most of the work cannot scale that well yet to multiple developers. This makes it harder to involve
the community during the development. We should do reporting and demoing twice a month (this
is included in the lead time). See 232fd2d00b/src/gpu-vulkan/2024q2-planning.md for the
analysis on this topic.

The vulkan community will be informed over this design and feedback will be asked/addressed.

**State**: Approved in April 2024 # Vulkan Synchronization Synchronization is at the core of Vulkan. Sadly it is also very hard to do right. Even expert vulkan developers have never seen applications that do it correct. Mostly as it is a timing issue the errors are not resulting in artifacts and not fixed at all. (see no evil, hear no evil) ## What is this synchronization issue? Actual work on the GPU happen in a later stage then the CPU gives the actual command. The work that is happening on the GPU requires buffers and images to be in a certain state or layout. To give some context around the situation lets go over how the splash screen in Blender draws its picture. The term synchronization refers to that the application is responsible for the resource usage, dependencies and layout transformations when they are being executed on the GPU. GPUs are "extremely" parallelizable and we need to try to feed the GPU with enough parallelizable work so it can work efficiently. In OpenGL most part of this was transparent and dealt by the OpenGL Driver. In Metal this is partly a driver responsibility (resource usage dependency) and partly an application responsibility (layouts & usage hints). In Vulkan this responsibility shifted fully from the driver to the application. Why this paradigm shift? An OpenGL driver has to deal with all the possible use cases, and was always balancing between safety and performance. Making synchronization an application responsibility allows applications to fine-tune what is actually needed and reduce overhead compared to a driver implementation. ## What are layout transformations? When using images GPUs work can improve the performance by reordering the pixels inside an image to improve caches. When sampling an image inside a shader a regular sequential storage of pixels requires more cache lines, then when the pixels are stored in tiles/blocks. Less needed cache lines allows cache lines to be stored with more relevant data and improved performance. The actual pixel order can be different based on the GPU and how the image is used. Different layouts exist for copying data (data transform), sampling, storage or when the texture is used as a framebuffer. This order of the pixels are hidden to the user for IP reasons. ## Example of synchronization On the CPU side a temp buffer is created and filled with pixels. This is sent to the GPU module to construct a texture from it. This texture is drawn using a shader to an offscreen texture. The layout of the texture is different when uploading the pixels, then when used on the shader. Noteworthy the pixels on the CPU side are already freed before the actual drawing is happening. Even the GPUTexture is freed before the GPU has started drawing. * Uploading pixels requires the image to be in transfer destination layout. * Shader requires the image to be in shader read layout. * Shader can only start executing when the image is in shader read layout. * Framebuffer textures needs to be in shader write state * image can only transform to shader read layout, when all pixels have been uploaded to the GPU * Uploading pixels can only happen when the image is in transfer destination layout and the staging buffer in transfer source layout. * Staging buffer can only And this is just a simple example and other elements like parameters and framebuffer attachments have been ignored. ```mermaid stateDiagram ChangeStagingLayoutToTransferDestination --> CopyPixelsFromCPUToStaging CopyPixelsFromCPUToStaging --> ChangeStagingLayoutToTransferSource ChangeStagingLayoutToTransferSource --> CopyPixelsFromStagingToImage ChangeImageLayoutToTransferDestination --> CopyPixelsFromStagingToImage CopyPixelsFromStagingToImage --> ChangeImageLayoutToShaderRead ChangeImageLayoutToShaderRead --> Shader Shader--> [*] ``` Each action can only happen when its previous action has been completed. And resources can only be freed after the drawing has been finished. ## How does the industry solve this? There are three ways that is commonly used to solve this issue. One is recording barriers in the command buffer where it is needed. Other will track the state of each resource and the third one is using a render graph. #120174 provides more information how other software frameworks and game engines solve this issue. ## What requirements do we have that needs to be supported by the chosen solution * A device can be used by multiple context (different CPU threads). This is needed to improve final image rendering and also GPU compositor. Both happen in a background thread with their own context. But share resources like input/output images, textures etc. * Resources are used by multiple contexts. * Layouts of image resources can only be in a single state, but used by multiple contexts * Resources can be freed before they are finished drawing. ## Alternative 1: Manually add barriers Add layout transitions and barriers where needed. This is difficult and can also lead to unhandled situations. It would also not be possible to support threaded rendering as the state is tracked globally and can be altered by other threads, before the drawing is finished. Most vulkan tutorials and examples online uses this approach. * (+) Allows the best performance for fixed drawing as barriers can be fine tuned to actual need. * (-) Less performance for dynamic drawing as to many barriers needs to be added. * (-) Very hard to do efficient threading * (-) Hard to merge barriers, as they might be scattered around the codebase ## Alternative 2: State tracking Automatically add layout transitions and barriers by tracking the state of each resource. When a resource requires a different layout, perform the layout, when read-write issues can occur add the propriate barrier. Examples of this is done inside the `VK_LAYER_synchronization_valid`, `webgpu` and `nice.graphics`. * (+) Works without actually diping deep into the synchronization issue. * (+) Multiple transitions and barriers can be merged into a single command, improving the performance. * (-) Performance really depends on the implementation. For example webgpu adds to many barriers. * (-) Threading still needs to lock on a higher level as state tracking on multiple treads is hard. * (-) VK_LAYER_synchronization_valid shows that can be extra complicated based on the features you want to support. they don't support timeline semaphores and report false positive when the application is using them. ## Alternative 3: Render Graphs Render graphs records commands per thread and when flushing the commands the resource layout transitions and barriers are included based on the order of execution. Commands can be reordered to improve performance. The submission of commands construction of commands and submission are guarded to ensure thread safety. This approach is often done by game engines like Frostbite/Unreal/Unity as it gets more performance. Reordering of commands can lead to less and leaner barriers. Implementation in Granite can be found at https://themaister.net/blog/2017/08/15/render-graphs-and-vulkan-a-deep-dive/ . The implementation might not be the cleanest, but it describes the steps and features it added to the render graph. * (+) proven solution to track resources from multiple thread. * (-) Adds another level of indirection and a lot of code. Especially in the draw manager. * (=) In the longer run we can integrate the level of indirection with the draw manager API. * (+) Better performance due to reordering of commands, where more similar commands can be executed in sequence. Data transfer commands that are done inside a render pass, can be moved before the render pass starts. Framebuffer layout transitions can be merged with render pass begin/end. * (+) Might reduce the number of command buffers as commands are recorded and send to the GPU when `GPU_flush` is called. * (=) Freeing temporary resources can be done as part of the render graph. The render graph knows when which resource can be safely removed, resulting in less unused memory allocations. > NOTE: #120174 describes an API change that will remove one level of indirection by performing the indirection directly on the > render graph of the GPUBackend. Due to its impact and unclear API at this moment I would not add it > to the scope of the Vulkan project. ## High level goal design ```mermaid classDiagram class VKDevice class VKContext class VKRenderGraph { add_node() submit_buffer_for_read_back() submit_for_present() } class VKRenderGraphNode { operation } class VKResources class VKResource { buffer_handle: VkBuffer image_handle: VkImage current_layout: VkImageLayout } class VKCommandBuffer { draw() dispatch() copy() begin_render_pass() end_render_pass() pipeline_barrier() } VKDevice *--> VKResources: resources VKResources *--> VKResource: resources VKContext *--> VKRenderGraph: render_graph VKRenderGraph *--> VKRenderGraphNode: nodes VKRenderGraph *--> VKCommandBuffer: command_buffer VKResource o--> VKRenderGraphNode: producer VKRenderGraphNode o--> VKResource: reads_resources VKRenderGraphNode o--> VKResource: write_resources VKCommandBuffer ..> VKDevice: submit_to_device_queue ``` Any operation that leads to work on the GPU (Draw, Copy, Dispatch) would create a node. The node will track the resources it needs (read) and resources it write to using (the producer/consumer pattern). When writing to a resource a new 'version' of that resource is created. Any action that happens later will always use the last created version. The resource also keeps track of which node created that version. ### Thread-locking Locking concerning resources, Adding nodes, converting nodes to commands and command submission happens with a device level mutex. First implementation will also lock when adding a node to a context when that node uses resources. Building of nodes however are done outside the lock to keep the locking to a minimum. ```mermaid flowchart CreateRenderGraphNode["Create render graph node"] subgraph "DeviceScopeLock" subgraph "Add resource" AddImage["Add image resource"] AddBuffer["Add buffer resource"] end subgraph "Building render graph" AddRenderGraphNodeToGraph["Add render graph node to render graph"] end subgraph "Submitting rendergraph" BuildCommandBuffer["Convert Nodes To Commands"] SubmitCommandBufferToQueue["Submitting commands to device queue"] end end WaitForFinishedSubmission["Wait for commands to finish"] CreateRenderGraphNode --> AddRenderGraphNodeToGraph SubmitForPresent --> BuildCommandBuffer --> SubmitCommandBufferToQueue --> WaitForFinishedSubmission SubmitForReadback --> BuildCommandBuffer CreateImage --> AddImage CreateBuffer --> AddBuffer ``` ### Index based access We should ensure that the graph only uses indices to refer to other nodes/resources so they can be stored in a vector. The render graph adds and removes nodes very often and will limit memory allocation operations. ## Implementation approach The development should be split into multiple smaller steps as I consider it a high risk project. The checkboxes refer to steps that are tested inside the prototype. Some intermediate steps could be: * [x] Add render graph data structure, but don't hook it up to the GPU api. Perhaps via a compile option * [x] Add test cases to test render graph structure and some of its features. * [x] Add test to validate uploading/downloading data from GPU (downloading data requires a flush/finish). * [x] Add test to validate uploading/downloading image data from GPU (would require layout transition) * [x] Dispatch of an empty compute shader * [x] Dispatch of an empty compute shader with parameters (descriptor sets) Be sure that multiple dispatches can be schedules are receives the correct parameters. Nodes are run sequentially. * [x] Ensure that compute dispatch test cases work with uniform buffers. add more buffer types along the way * [x] Images and just in time layout transisions. * [ ] Framebuffers and graphic pipelines. * [ ] Try to run blender with render graphs * [ ] Combine multiple barriers in the same command. * [ ] Reorder commands that share the same pipeline and input resources. * [ ] Add swap chain commands (present). I think it is possible to add GPU synchronization here as well due to the changes we made some months ago in the Metal/Vulkan backend. I recon this is 3 months of work including stabilization and making sure it works on all platforms. The goal would be to have better performance then OpenGL. This is feasible as OpenGL has to deal with to many other situations that we don't need to consider. The lead time includes twice a month status reporting using a demo/GPU trace and getting feedback from vulkan experts. It doesn't involve additional time I need to spent on platform support and general project support. ## Future developments Blender currently doesn't support resource tracking between shader stages. Eg vertex shader can already run, but the fragment shader needs to wait until a certain resource comes available. We could add support by this in the `GPUShaderCreateInfo` where resource usages are tagged with the stages they are needed. Promote the render graph api as a GPU API (#120174) to remove one the level of indirection. ## How to involve the community? Most of the work cannot scale that well yet to multiple developers. This makes it harder to involve the community during the development. We should do reporting and demoing twice a month (this is included in the lead time). See https://projects.blender.org/Jeroen-Bakker/documentation/src/commit/232fd2d00b22ef3267e317517c6b49fabf2f98a2/src/gpu-vulkan/2024q2-planning.md for the analysis on this topic. The vulkan community will be informed over this design and feedback will be asked/addressed.

🎉 1 🚀 2

Jeroen Bakker added the

Download

What's New

Blender Studio

Manual

Developers Blog

Documentation

Benchmark

Blender Conference

Development Fund

One-time Donations

Vulkan: Synchronization/Render Graphs #118330