Vulkan: Synchronization/Render Graphs #118330

Open
opened 2024-02-15 16:12:23 +01:00 by Jeroen Bakker · 11 comments
Member

State: Approved in April 2024

Vulkan Synchronization

Synchronization is at the core of Vulkan. Sadly it is also very hard to do right. Even expert
vulkan developers have never seen applications that do it correct. Mostly as it is a timing issue
the errors are not resulting in artifacts and not fixed at all. (see no evil, hear no evil)

What is this synchronization issue?

Actual work on the GPU happen in a later stage then the CPU gives the actual command. The work that
is happening on the GPU requires buffers and images to be in a certain state or layout. To give some
context around the situation lets go over how the splash screen in Blender draws its picture.

The term synchronization refers to that the application is responsible for the resource usage,
dependencies and layout transformations when they are being executed on the GPU. GPUs are "extremely"
parallelizable and we need to try to feed the GPU with enough parallelizable work so it can work efficiently.

In OpenGL most part of this was transparent and dealt by the OpenGL Driver. In Metal this is partly a
driver responsibility (resource usage dependency) and partly an application responsibility (layouts &
usage hints). In Vulkan this responsibility shifted fully from the driver to the application.

Why this paradigm shift? An OpenGL driver has to deal with all the possible use cases, and was always
balancing between safety and performance. Making synchronization an application responsibility allows
applications to fine-tune what is actually needed and reduce overhead compared to a driver implementation.

What are layout transformations?

When using images GPUs work can improve the performance by reordering the pixels inside an image to
improve caches. When sampling an image inside a shader a regular sequential storage of pixels requires
more cache lines, then when the pixels are stored in tiles/blocks. Less needed cache lines allows cache
lines to be stored with more relevant data and improved performance.

The actual pixel order can be different based on the GPU and how the image is used. Different layouts
exist for copying data (data transform), sampling, storage or when the texture is used as a framebuffer.
This order of the pixels are hidden to the user for IP reasons.

Example of synchronization

On the CPU side a temp buffer is created and filled with pixels. This is sent to the GPU module to
construct a texture from it. This texture is drawn using a shader to an offscreen texture.
The layout of the texture is different when uploading the pixels, then when used on the shader.
Noteworthy the pixels on the CPU side are already freed before the actual drawing is happening. Even
the GPUTexture is freed before the GPU has started drawing.

  • Uploading pixels requires the image to be in transfer destination layout.
  • Shader requires the image to be in shader read layout.
  • Shader can only start executing when the image is in shader read layout.
  • Framebuffer textures needs to be in shader write state
  • image can only transform to shader read layout, when all pixels have been uploaded to the GPU
  • Uploading pixels can only happen when the image is in transfer destination layout and the staging
    buffer in transfer source layout.
  • Staging buffer can only

And this is just a simple example and other elements like parameters and framebuffer attachments have
been ignored.


stateDiagram
    ChangeStagingLayoutToTransferDestination --> CopyPixelsFromCPUToStaging
    CopyPixelsFromCPUToStaging --> ChangeStagingLayoutToTransferSource
    ChangeStagingLayoutToTransferSource --> CopyPixelsFromStagingToImage
    ChangeImageLayoutToTransferDestination --> CopyPixelsFromStagingToImage
    CopyPixelsFromStagingToImage --> ChangeImageLayoutToShaderRead
    ChangeImageLayoutToShaderRead --> Shader
    Shader--> [*]

Each action can only happen when its previous action has been completed. And resources can only be
freed after the drawing has been finished.

How does the industry solve this?

There are three ways that is commonly used to solve this issue. One is recording barriers in the
command buffer where it is needed. Other will track the state of each resource and the third one is
using a render graph. #120174 provides more information how other software frameworks and game engines
solve this issue.

What requirements do we have that needs to be supported by the chosen solution

  • A device can be used by multiple context (different CPU threads). This is needed to improve final
    image rendering and also GPU compositor. Both happen in a background thread with their own context.
    But share resources like input/output images, textures etc.
  • Resources are used by multiple contexts.
  • Layouts of image resources can only be in a single state, but used by multiple contexts
  • Resources can be freed before they are finished drawing.

Alternative 1: Manually add barriers

Add layout transitions and barriers where needed. This is difficult and can also lead to unhandled
situations. It would also not be possible to support threaded rendering as the state is tracked
globally and can be altered by other threads, before the drawing is finished.

Most vulkan tutorials and examples online uses this approach.

  • (+) Allows the best performance for fixed drawing as barriers can be fine tuned to actual need.
  • (-) Less performance for dynamic drawing as to many barriers needs to be added.
  • (-) Very hard to do efficient threading
  • (-) Hard to merge barriers, as they might be scattered around the codebase

Alternative 2: State tracking

Automatically add layout transitions and barriers by tracking the state of each resource. When a
resource requires a different layout, perform the layout, when read-write issues can occur add
the propriate barrier.

Examples of this is done inside the VK_LAYER_synchronization_valid, webgpu and nice.graphics.

  • (+) Works without actually diping deep into the synchronization issue.
  • (+) Multiple transitions and barriers can be merged into a single command, improving the performance.
  • (-) Performance really depends on the implementation. For example webgpu adds to many barriers.
  • (-) Threading still needs to lock on a higher level as state tracking on multiple treads is hard.
  • (-) VK_LAYER_synchronization_valid shows that can be extra complicated based on the features you
    want to support. they don't support timeline semaphores and report false positive when the
    application is using them.

Alternative 3: Render Graphs

Render graphs records commands per thread and when flushing the commands the resource
layout transitions and barriers are included based on the order of execution. Commands can be
reordered to improve performance. The submission of commands construction of commands and
submission are guarded to ensure thread safety.

This approach is often done by game engines like Frostbite/Unreal/Unity as it gets more performance.
Reordering of commands can lead to less and leaner barriers. Implementation in Granite can be found
at https://themaister.net/blog/2017/08/15/render-graphs-and-vulkan-a-deep-dive/ . The implementation
might not be the cleanest, but it describes the steps and features it added to the render graph.

  • (+) proven solution to track resources from multiple thread.
  • (-) Adds another level of indirection and a lot of code. Especially in the draw manager.
  • (=) In the longer run we can integrate the level of indirection with the draw manager API.
  • (+) Better performance due to reordering of commands, where more similar commands can be executed
    in sequence. Data transfer commands that are done inside a render pass, can be moved before
    the render pass starts. Framebuffer layout transitions can be merged with render pass begin/end.
  • (+) Might reduce the number of command buffers as commands are recorded and send to the GPU when
    GPU_flush is called.
  • (=) Freeing temporary resources can be done as part of the render graph. The render graph knows
    when which resource can be safely removed, resulting in less unused memory allocations.

NOTE: #120174 describes an API change that will remove one level of indirection by performing the indirection directly on the
render graph of the GPUBackend. Due to its impact and unclear API at this moment I would not add it
to the scope of the Vulkan project.

High level goal design

classDiagram
  class VKDevice
  class VKContext
  class VKRenderGraph {
    add_node()
    submit_buffer_for_read_back()
    submit_for_present()
  }
  class VKRenderGraphNode {
    operation
  }
  class VKResources
  class VKResource {
    buffer_handle: VkBuffer
    image_handle: VkImage
    current_layout: VkImageLayout
  }
  class VKCommandBuffer {
    draw()
    dispatch()
    copy()
    begin_render_pass()
    end_render_pass()
    pipeline_barrier()
  }
  
  VKDevice *--> VKResources: resources
  VKResources *--> VKResource: resources
  VKContext *--> VKRenderGraph: render_graph
  VKRenderGraph *--> VKRenderGraphNode: nodes
  VKRenderGraph *--> VKCommandBuffer: command_buffer
  VKResource o--> VKRenderGraphNode: producer
  VKRenderGraphNode o--> VKResource: reads_resources
  VKRenderGraphNode o--> VKResource: write_resources
  VKCommandBuffer ..> VKDevice: submit_to_device_queue

Any operation that leads to work on the GPU (Draw, Copy, Dispatch) would create a node.
The node will track the resources it needs (read) and resources it write to using (the producer/consumer pattern).
When writing to a resource a new 'version' of that resource is created. Any action that happens
later will always use the last created version. The resource also keeps track of which node created
that version.

Thread-locking

Locking concerning resources, Adding nodes, converting nodes to commands and command submission happens
with a device level mutex. First implementation will also lock when adding a node to a context when that
node uses resources. Building of nodes however are done outside the lock to keep the locking to a minimum.

flowchart
    CreateRenderGraphNode["Create render graph node"]
    subgraph "DeviceScopeLock"
    subgraph "Add resource"
        AddImage["Add image resource"]
        AddBuffer["Add buffer resource"]
    end 
    subgraph "Building render graph"
        AddRenderGraphNodeToGraph["Add render graph node to render graph"]
    end

    subgraph "Submitting rendergraph"
        BuildCommandBuffer["Convert Nodes To Commands"]
        SubmitCommandBufferToQueue["Submitting commands to device queue"]
    end
    end
    WaitForFinishedSubmission["Wait for commands to finish"]

    CreateRenderGraphNode --> AddRenderGraphNodeToGraph
    SubmitForPresent --> BuildCommandBuffer --> SubmitCommandBufferToQueue --> WaitForFinishedSubmission
    SubmitForReadback --> BuildCommandBuffer
    CreateImage --> AddImage
    CreateBuffer --> AddBuffer

Index based access

We should ensure that the graph only uses indices to refer to other nodes/resources so they can
be stored in a vector. The render graph adds and removes nodes very often and will limit memory
allocation operations.

Implementation approach

The development should be split into multiple smaller steps as I consider it a high risk project.
The checkboxes refer to steps that are tested inside the prototype.
Some intermediate steps could be:

  • Add render graph data structure, but don't hook it up to the GPU api. Perhaps via a compile
    option
  • Add test cases to test render graph structure and some of its features.
  • Add test to validate uploading/downloading data from GPU (downloading data requires a flush/finish).
  • Add test to validate uploading/downloading image data from GPU (would require layout transition)
  • Dispatch of an empty compute shader
  • Dispatch of an empty compute shader with parameters (descriptor sets) Be sure that multiple dispatches can be schedules are receives the correct parameters. Nodes are run sequentially.
  • Ensure that compute dispatch test cases work with uniform buffers. add more buffer types along the
    way
  • Images and just in time layout transisions.
  • Framebuffers and graphic pipelines.
  • Try to run blender with render graphs
  • Combine multiple barriers in the same command.
  • Reorder commands that share the same pipeline and input resources.
  • Add swap chain commands (present). I think it is possible to add GPU synchronization here as well
    due to the changes we made some months ago in the Metal/Vulkan backend.

I recon this is 3 months of work including stabilization and making sure it works on all platforms.
The goal would be to have better performance then OpenGL. This is feasible as OpenGL has to deal with
to many other situations that we don't need to consider.

The lead time includes twice a month status reporting using a demo/GPU trace and getting feedback from
vulkan experts. It doesn't involve additional time I need to spent on platform support and general project
support.

Future developments

Blender currently doesn't support resource tracking between shader stages. Eg vertex shader can already
run, but the fragment shader needs to wait until a certain resource comes available. We could add support
by this in the GPUShaderCreateInfo where resource usages are tagged with the stages they are needed.

Promote the render graph api as a GPU API (#120174) to remove one the level of indirection.

How to involve the community?

Most of the work cannot scale that well yet to multiple developers. This makes it harder to involve
the community during the development. We should do reporting and demoing twice a month (this
is included in the lead time). See 232fd2d00b/src/gpu-vulkan/2024q2-planning.md for the
analysis on this topic.

The vulkan community will be informed over this design and feedback will be asked/addressed.

**State**: Approved in April 2024 # Vulkan Synchronization Synchronization is at the core of Vulkan. Sadly it is also very hard to do right. Even expert vulkan developers have never seen applications that do it correct. Mostly as it is a timing issue the errors are not resulting in artifacts and not fixed at all. (see no evil, hear no evil) ## What is this synchronization issue? Actual work on the GPU happen in a later stage then the CPU gives the actual command. The work that is happening on the GPU requires buffers and images to be in a certain state or layout. To give some context around the situation lets go over how the splash screen in Blender draws its picture. The term synchronization refers to that the application is responsible for the resource usage, dependencies and layout transformations when they are being executed on the GPU. GPUs are "extremely" parallelizable and we need to try to feed the GPU with enough parallelizable work so it can work efficiently. In OpenGL most part of this was transparent and dealt by the OpenGL Driver. In Metal this is partly a driver responsibility (resource usage dependency) and partly an application responsibility (layouts & usage hints). In Vulkan this responsibility shifted fully from the driver to the application. Why this paradigm shift? An OpenGL driver has to deal with all the possible use cases, and was always balancing between safety and performance. Making synchronization an application responsibility allows applications to fine-tune what is actually needed and reduce overhead compared to a driver implementation. ## What are layout transformations? When using images GPUs work can improve the performance by reordering the pixels inside an image to improve caches. When sampling an image inside a shader a regular sequential storage of pixels requires more cache lines, then when the pixels are stored in tiles/blocks. Less needed cache lines allows cache lines to be stored with more relevant data and improved performance. The actual pixel order can be different based on the GPU and how the image is used. Different layouts exist for copying data (data transform), sampling, storage or when the texture is used as a framebuffer. This order of the pixels are hidden to the user for IP reasons. ## Example of synchronization On the CPU side a temp buffer is created and filled with pixels. This is sent to the GPU module to construct a texture from it. This texture is drawn using a shader to an offscreen texture. The layout of the texture is different when uploading the pixels, then when used on the shader. Noteworthy the pixels on the CPU side are already freed before the actual drawing is happening. Even the GPUTexture is freed before the GPU has started drawing. * Uploading pixels requires the image to be in transfer destination layout. * Shader requires the image to be in shader read layout. * Shader can only start executing when the image is in shader read layout. * Framebuffer textures needs to be in shader write state * image can only transform to shader read layout, when all pixels have been uploaded to the GPU * Uploading pixels can only happen when the image is in transfer destination layout and the staging buffer in transfer source layout. * Staging buffer can only And this is just a simple example and other elements like parameters and framebuffer attachments have been ignored. ```mermaid stateDiagram ChangeStagingLayoutToTransferDestination --> CopyPixelsFromCPUToStaging CopyPixelsFromCPUToStaging --> ChangeStagingLayoutToTransferSource ChangeStagingLayoutToTransferSource --> CopyPixelsFromStagingToImage ChangeImageLayoutToTransferDestination --> CopyPixelsFromStagingToImage CopyPixelsFromStagingToImage --> ChangeImageLayoutToShaderRead ChangeImageLayoutToShaderRead --> Shader Shader--> [*] ``` Each action can only happen when its previous action has been completed. And resources can only be freed after the drawing has been finished. ## How does the industry solve this? There are three ways that is commonly used to solve this issue. One is recording barriers in the command buffer where it is needed. Other will track the state of each resource and the third one is using a render graph. #120174 provides more information how other software frameworks and game engines solve this issue. ## What requirements do we have that needs to be supported by the chosen solution * A device can be used by multiple context (different CPU threads). This is needed to improve final image rendering and also GPU compositor. Both happen in a background thread with their own context. But share resources like input/output images, textures etc. * Resources are used by multiple contexts. * Layouts of image resources can only be in a single state, but used by multiple contexts * Resources can be freed before they are finished drawing. ## Alternative 1: Manually add barriers Add layout transitions and barriers where needed. This is difficult and can also lead to unhandled situations. It would also not be possible to support threaded rendering as the state is tracked globally and can be altered by other threads, before the drawing is finished. Most vulkan tutorials and examples online uses this approach. * (+) Allows the best performance for fixed drawing as barriers can be fine tuned to actual need. * (-) Less performance for dynamic drawing as to many barriers needs to be added. * (-) Very hard to do efficient threading * (-) Hard to merge barriers, as they might be scattered around the codebase ## Alternative 2: State tracking Automatically add layout transitions and barriers by tracking the state of each resource. When a resource requires a different layout, perform the layout, when read-write issues can occur add the propriate barrier. Examples of this is done inside the `VK_LAYER_synchronization_valid`, `webgpu` and `nice.graphics`. * (+) Works without actually diping deep into the synchronization issue. * (+) Multiple transitions and barriers can be merged into a single command, improving the performance. * (-) Performance really depends on the implementation. For example webgpu adds to many barriers. * (-) Threading still needs to lock on a higher level as state tracking on multiple treads is hard. * (-) VK_LAYER_synchronization_valid shows that can be extra complicated based on the features you want to support. they don't support timeline semaphores and report false positive when the application is using them. ## Alternative 3: Render Graphs Render graphs records commands per thread and when flushing the commands the resource layout transitions and barriers are included based on the order of execution. Commands can be reordered to improve performance. The submission of commands construction of commands and submission are guarded to ensure thread safety. This approach is often done by game engines like Frostbite/Unreal/Unity as it gets more performance. Reordering of commands can lead to less and leaner barriers. Implementation in Granite can be found at https://themaister.net/blog/2017/08/15/render-graphs-and-vulkan-a-deep-dive/ . The implementation might not be the cleanest, but it describes the steps and features it added to the render graph. * (+) proven solution to track resources from multiple thread. * (-) Adds another level of indirection and a lot of code. Especially in the draw manager. * (=) In the longer run we can integrate the level of indirection with the draw manager API. * (+) Better performance due to reordering of commands, where more similar commands can be executed in sequence. Data transfer commands that are done inside a render pass, can be moved before the render pass starts. Framebuffer layout transitions can be merged with render pass begin/end. * (+) Might reduce the number of command buffers as commands are recorded and send to the GPU when `GPU_flush` is called. * (=) Freeing temporary resources can be done as part of the render graph. The render graph knows when which resource can be safely removed, resulting in less unused memory allocations. > NOTE: #120174 describes an API change that will remove one level of indirection by performing the indirection directly on the > render graph of the GPUBackend. Due to its impact and unclear API at this moment I would not add it > to the scope of the Vulkan project. ## High level goal design ```mermaid classDiagram class VKDevice class VKContext class VKRenderGraph { add_node() submit_buffer_for_read_back() submit_for_present() } class VKRenderGraphNode { operation } class VKResources class VKResource { buffer_handle: VkBuffer image_handle: VkImage current_layout: VkImageLayout } class VKCommandBuffer { draw() dispatch() copy() begin_render_pass() end_render_pass() pipeline_barrier() } VKDevice *--> VKResources: resources VKResources *--> VKResource: resources VKContext *--> VKRenderGraph: render_graph VKRenderGraph *--> VKRenderGraphNode: nodes VKRenderGraph *--> VKCommandBuffer: command_buffer VKResource o--> VKRenderGraphNode: producer VKRenderGraphNode o--> VKResource: reads_resources VKRenderGraphNode o--> VKResource: write_resources VKCommandBuffer ..> VKDevice: submit_to_device_queue ``` Any operation that leads to work on the GPU (Draw, Copy, Dispatch) would create a node. The node will track the resources it needs (read) and resources it write to using (the producer/consumer pattern). When writing to a resource a new 'version' of that resource is created. Any action that happens later will always use the last created version. The resource also keeps track of which node created that version. ### Thread-locking Locking concerning resources, Adding nodes, converting nodes to commands and command submission happens with a device level mutex. First implementation will also lock when adding a node to a context when that node uses resources. Building of nodes however are done outside the lock to keep the locking to a minimum. ```mermaid flowchart CreateRenderGraphNode["Create render graph node"] subgraph "DeviceScopeLock" subgraph "Add resource" AddImage["Add image resource"] AddBuffer["Add buffer resource"] end subgraph "Building render graph" AddRenderGraphNodeToGraph["Add render graph node to render graph"] end subgraph "Submitting rendergraph" BuildCommandBuffer["Convert Nodes To Commands"] SubmitCommandBufferToQueue["Submitting commands to device queue"] end end WaitForFinishedSubmission["Wait for commands to finish"] CreateRenderGraphNode --> AddRenderGraphNodeToGraph SubmitForPresent --> BuildCommandBuffer --> SubmitCommandBufferToQueue --> WaitForFinishedSubmission SubmitForReadback --> BuildCommandBuffer CreateImage --> AddImage CreateBuffer --> AddBuffer ``` ### Index based access We should ensure that the graph only uses indices to refer to other nodes/resources so they can be stored in a vector. The render graph adds and removes nodes very often and will limit memory allocation operations. ## Implementation approach The development should be split into multiple smaller steps as I consider it a high risk project. The checkboxes refer to steps that are tested inside the prototype. Some intermediate steps could be: * [x] Add render graph data structure, but don't hook it up to the GPU api. Perhaps via a compile option * [x] Add test cases to test render graph structure and some of its features. * [x] Add test to validate uploading/downloading data from GPU (downloading data requires a flush/finish). * [x] Add test to validate uploading/downloading image data from GPU (would require layout transition) * [x] Dispatch of an empty compute shader * [x] Dispatch of an empty compute shader with parameters (descriptor sets) Be sure that multiple dispatches can be schedules are receives the correct parameters. Nodes are run sequentially. * [x] Ensure that compute dispatch test cases work with uniform buffers. add more buffer types along the way * [x] Images and just in time layout transisions. * [ ] Framebuffers and graphic pipelines. * [ ] Try to run blender with render graphs * [ ] Combine multiple barriers in the same command. * [ ] Reorder commands that share the same pipeline and input resources. * [ ] Add swap chain commands (present). I think it is possible to add GPU synchronization here as well due to the changes we made some months ago in the Metal/Vulkan backend. I recon this is 3 months of work including stabilization and making sure it works on all platforms. The goal would be to have better performance then OpenGL. This is feasible as OpenGL has to deal with to many other situations that we don't need to consider. The lead time includes twice a month status reporting using a demo/GPU trace and getting feedback from vulkan experts. It doesn't involve additional time I need to spent on platform support and general project support. ## Future developments Blender currently doesn't support resource tracking between shader stages. Eg vertex shader can already run, but the fragment shader needs to wait until a certain resource comes available. We could add support by this in the `GPUShaderCreateInfo` where resource usages are tagged with the stages they are needed. Promote the render graph api as a GPU API (#120174) to remove one the level of indirection. ## How to involve the community? Most of the work cannot scale that well yet to multiple developers. This makes it harder to involve the community during the development. We should do reporting and demoing twice a month (this is included in the lead time). See https://projects.blender.org/Jeroen-Bakker/documentation/src/commit/232fd2d00b22ef3267e317517c6b49fabf2f98a2/src/gpu-vulkan/2024q2-planning.md for the analysis on this topic. The vulkan community will be informed over this design and feedback will be asked/addressed.
Jeroen Bakker added the
Interest
Vulkan
Type
Design
labels 2024-02-15 16:12:24 +01:00
Jeroen Bakker self-assigned this 2024-02-15 16:12:24 +01:00
Jeroen Bakker added this to the EEVEE & Viewport project 2024-02-15 16:12:27 +01:00
Contributor

Why not mention transitions in the render pass itself?
By considering Image transitions separately outside and inside the render pass, there are far fewer situations in which barriers need to be considered.
Most render passes in Blender achieve smooth transitions by keeping the initial and final layouts the same.
Please answer this question.

Why not mention transitions in the render pass itself? By considering Image transitions separately outside and inside the render pass, there are far fewer situations in which barriers need to be considered. Most render passes in Blender achieve smooth transitions by keeping the initial and final layouts the same. Please answer this question.
Contributor

Making synchronization an application responsibility allows applications to fine-tune what is actually needed
and reduce overhead compared to a driver implementation.

What do you think about these costs specifically?
The current main branch implementation generates VkRenderpass, VkFramebuffer, VkPipeline, and VkDescriptorSet effectively an infinite number of times. In my PR, I was able to make this almost static.

> Making synchronization an application responsibility allows applications to fine-tune what is actually needed and reduce overhead compared to a driver implementation. What do you think about these costs specifically? The current main branch implementation generates `VkRenderpass`, `VkFramebuffer`, `VkPipeline`, and `VkDescriptorSet` effectively an infinite number of times. In my PR, I was able to make this almost static.
Contributor

Although memory accesses on the GPU are said to be invisible to the user, a debugger can give you a pretty detailed view of what's going on. GPU minimum core local memory,
Within the SM, the access range increases in the order of Shared-Memory, L1-Memory, and L2-Memory.
Various utilities are available to manage memory access.
Vulkan ray tracing also allows efficient use of local memory.
If you want to use tensor cores, you can use ComputeShader or MeshLetShader to do things like co-operative Matrix.
For the L1 area, possible methods include staging Device-Local memory.
Also, subpass allows you to freeze a pixel and perform the next write.

Although memory accesses on the GPU are said to be invisible to the user, a debugger can give you a pretty detailed view of what's going on. GPU minimum core local memory, Within the SM, the access range increases in the order of Shared-Memory, L1-Memory, and L2-Memory. Various utilities are available to manage memory access. Vulkan ray tracing also allows efficient use of local memory. If you want to use tensor cores, you can use `ComputeShader` or `MeshLetShader` to do things like co-operative Matrix. For the L1 area, possible methods include staging Device-Local memory. Also, `subpass` allows you to freeze a pixel and perform the next write.
Contributor

Shader requires the image to be in shader read layout.

It doesn't particularly have to be Shader-ReadOnly.

Framebuffer textures needs to be in shader write state

Please distinguish between shader write state and attachment state.

image can only transform to shader read layout, when all pixels have been uploaded to the GPU

The situation of random access is abstracted.

Your synchronization example is too simple.
Please explain why there is VkSubpassDependency in the render pass.
By using this VkSubpassDependency many synchronization problems were solved.

> Shader requires the image to be in shader read layout. It doesn't particularly have to be Shader-ReadOnly. > Framebuffer textures needs to be in shader write state Please distinguish between shader write state and attachment state. > image can only transform to shader read layout, when all pixels have been uploaded to the GPU The situation of random access is abstracted. Your synchronization example is too simple. Please explain why there is `VkSubpassDependency` in the render pass. By using this `VkSubpassDependency` many synchronization problems were solved.
Contributor

Alternative 1: Manually add barriers

This is the current situation.
In particular, the part that executes the pipeline barrier in the "VKDescriptorSetTracker::update" function is unstable.
This is because the layout differs depending on the "submission" status. To solve this, we embedded a transition structure within the render pass.

> Alternative 1: Manually add barriers This is the current situation. In particular, the part that executes the pipeline barrier in the "VKDescriptorSetTracker::update" function is unstable. This is because the layout differs depending on the "submission" status. To solve this, we embedded a transition structure within the render pass.
Contributor

Alternative 2: State tracking

I don't understand the difference between the situations in option 2 and option 1.
Rather, this problem is solved by providing a more detailed barrier API for Draw-Module. In other words, it is an aspect that is needed as a complement rather than a substitute.

> Alternative 2: State tracking I don't understand the difference between the situations in option 2 and option 1. Rather, this problem is solved by providing a more detailed barrier API for Draw-Module. In other words, it is an aspect that is needed as a complement rather than a substitute.
Contributor

proven solution to track resources from multiple thread.

No matter how much the GPU-Module tracks the resources, isn't the Draw-Module the one who knows the real timing?

In the long term, indirection levels can be integrated with the draw manager API.

The story will go awry.
We need to discuss the design of building the VkCommandBuffer for each individual draw pass.

Render graphs records commands from multiple threads and when flushing the commands the resource
layout transitions and barriers are included based on the order of execution. Commands can be
reordered to improve performance.

There's nothing particularly new about it. It's obvious.
But you don't mention anything about transitions within RenderPass.

> proven solution to track resources from multiple thread. No matter how much the GPU-Module tracks the resources, isn't the Draw-Module the one who knows the real timing? > In the long term, indirection levels can be integrated with the draw manager API. The story will go awry. We need to discuss the design of building the `VkCommandBuffer` for each individual draw pass. > Render graphs records commands from multiple threads and when flushing the commands the resource layout transitions and barriers are included based on the order of execution. Commands can be reordered to improve performance. There's nothing particularly new about it. It's obvious. But you don't mention anything about transitions within `RenderPass`.
Contributor

High-level goal design

This design is a good direction.
However, the bottleneck is that current_layout is required.
In other words, as I have repeatedly argued, if a resource exists outside of his Renderpass, he always needs one layout for that resource.
Then you don't have to follow me around like a detective, right?
I named it best-layout (you can call it whatever you want). This is useful because you can insert this name and change its

> High-level goal design This design is a good direction. However, the bottleneck is that `current_layout` is required. In other words, as I have repeatedly argued, if a resource exists outside of his Renderpass, he always needs one layout for that resource. Then you don't have to follow me around like a detective, right? I named it best-layout (you can call it whatever you want). This is useful because you can insert this name and change its
Author
Member

I try to reply on all your questions and comments.

  • I would not make a decision on which command the barriers are added (pipeline barrier/render pass begin/end) at this point. As commands are reordered it is adviced to use what is best suited at that state. This is also for framebuffer attachments. We aim at a situation where nodes that originate from different threads (render thread, composite thread, viewport thread) are scheduled in the same queue submit, so artists are not blocked in their work. In viewport rendering the frame buffer attachment can already be prepared to be used for viewport compositing. this saves additional transformation on specific GPU devices.
  • Pipelines and renderpasses are part of the reordering and would be constructed just-in-time. the reordering would reduce the pipeline construction.
  • framebuffers (and clearing) can be optimized as part of the render graph, so no need to track it outside.
  • descriptor sets will become an offset in a large buffers where all descriptor sets are uploaded used by the current submission.
  • subpasses are only benificial on tile based GPUs. I would not focus on those at this moment. We already have this in place for Metal and eventually we might also introduce it for Vulkan as well.
  • I would not say that alternative 1 is the current situation as there are semaphores ensuring that we don't add barriers. Current solution doesn't have any barriers in place as it is focused on getting the correct pixels on the screen. This design is about the upcoming 6 months where we want to improve its performance. Alternative 1 describes one way to improve the performance which is often chosen by smaller applications.
  • All this should be done without altering the GPU API. I mentioned that the Draw API and can eventually merge with the render graph, but that is not part of the initial scope.
  • For alternative 2, please check how the provided tools and framework are different compared to manual adding barriers. Memory barriers are very difficult to add, and no application has done this correctly (not my words). So adding an API that is expected to be understood and well used by higher level application developers is just asking for more problems, and doesn't solve the issue that we are currently facing. We should provide APIs that are understandable to regular developers without indepth knowledge of how GPUs work. This is one of the reasons why we don't simply accept most changes to the GPU API.
  • Draw module cannot track resources in a multi-threaded environment. This can only be done on device level and submission level. Resource tracking on a higher level (draw module) would lead to incorrect state tracking or larger scope device locks.
  • Tracking a command buffer for each draw pass, doesn't allow optimizing between multiple draw passes and isn't necessary to add this limitation.
  • This is a high level approach/design, details have left out as some parts of the design will require more analysis. I expect that to happen as part of the development.
  • I know you've argued that current layout tracking is a bottle-neck. If render graphs are done right, it doesn't and it would also be beneficial to do as it will improve performance as less transitions are needed. Resources don't have a single layout when they exist outside a render pass. Blender is a content creation suite and it's core of this definition is that resources are altered, and therefore requires different layouts in time. Yes per resource a best-layout could be added, but would rather do that when it is clear what the full benefit is. The benefit currently is unclear and we should first find the bottlenecks of a fully working Blender, before adding solutions based on intermediate situations. I am not looking only at opening a scene and optimize it for drawing/rendering. Users are working with scene in many different ways which leads to different aspects of a final solution. Eventually I can see that a resource can get a best layout that is used between submissions/flushes so much larger scope then you're arguing. How this will eventually impact the code is still unclear as well. Perhaps it is visible in code, but could also be only visible in running state.

A must for this project is to take small steps to go to the goal implementation. We also must ensure that regular developers can work with the solutions we provide. As I see it this doesn't require any changes to the GPU or Draw Module API. Only changes inside the Vulkan Backend.

I try to reply on all your questions and comments. * I would not make a decision on which command the barriers are added (pipeline barrier/render pass begin/end) at this point. As commands are reordered it is adviced to use what is best suited at that state. This is also for framebuffer attachments. We aim at a situation where nodes that originate from different threads (render thread, composite thread, viewport thread) are scheduled in the same queue submit, so artists are not blocked in their work. In viewport rendering the frame buffer attachment can already be prepared to be used for viewport compositing. this saves additional transformation on specific GPU devices. * Pipelines and renderpasses are part of the reordering and would be constructed just-in-time. the reordering would reduce the pipeline construction. * framebuffers (and clearing) can be optimized as part of the render graph, so no need to track it outside. * descriptor sets will become an offset in a large buffers where all descriptor sets are uploaded used by the current submission. * subpasses are only benificial on tile based GPUs. I would not focus on those at this moment. We already have this in place for Metal and eventually we might also introduce it for Vulkan as well. * I would not say that alternative 1 is the current situation as there are semaphores ensuring that we don't add barriers. Current solution doesn't have any barriers in place as it is focused on getting the correct pixels on the screen. This design is about the upcoming 6 months where we want to improve its performance. Alternative 1 describes one way to improve the performance which is often chosen by smaller applications. * All this should be done without altering the GPU API. I mentioned that the Draw API and can eventually merge with the render graph, but that is not part of the initial scope. * For alternative 2, please check how the provided tools and framework are different compared to manual adding barriers. Memory barriers are very difficult to add, and no application has done this correctly (not my words). So adding an API that is expected to be understood and well used by higher level application developers is just asking for more problems, and doesn't solve the issue that we are currently facing. We should provide APIs that are understandable to regular developers without indepth knowledge of how GPUs work. This is one of the reasons why we don't simply accept most changes to the GPU API. * Draw module cannot track resources in a multi-threaded environment. This can only be done on device level and submission level. Resource tracking on a higher level (draw module) would lead to incorrect state tracking or larger scope device locks. * Tracking a command buffer for each draw pass, doesn't allow optimizing between multiple draw passes and isn't necessary to add this limitation. * This is a high level approach/design, details have left out as some parts of the design will require more analysis. I expect that to happen as part of the development. * I know you've argued that current layout tracking is a bottle-neck. If render graphs are done right, it doesn't and it would also be beneficial to do as it will improve performance as less transitions are needed. Resources don't have a single layout when they exist outside a render pass. Blender is a content creation suite and it's core of this definition is that resources are altered, and therefore requires different layouts in time. Yes per resource a best-layout could be added, but would rather do that when it is clear what the full benefit is. The benefit currently is unclear and we should first find the bottlenecks of a fully working Blender, before adding solutions based on intermediate situations. I am not looking only at opening a scene and optimize it for drawing/rendering. Users are working with scene in many different ways which leads to different aspects of a final solution. Eventually I can see that a resource can get a best layout that is used between submissions/flushes so much larger scope then you're arguing. How this will eventually impact the code is still unclear as well. Perhaps it is visible in code, but could also be only visible in running state. A must for this project is to take small steps to go to the goal implementation. We also must ensure that regular developers can work with the solutions we provide. As I see it this doesn't require any changes to the GPU or Draw Module API. Only changes inside the Vulkan Backend.
Contributor

Pipelines and renderpasses are part of the reordering and would be constructed just-in-time. the reordering would reduce the pipeline construction.

The disadvantage of messing with static structures is what the data says. (Not my opinion.)

framebuffers (and clearing) can be optimized as part of the render graph, so no need to track it outside.

The distinction between framebuffer and render pass is ambiguous.
In order to successfully operate the frame buffer, we need to add dynamic states, right?
The logic doesn't make sense.

descriptor sets will become an offset in a large buffers where all descriptor sets are uploaded used by the current submission.

Do you know how many DescriptorSets the main branch implementation generates per second?
This has also been reduced to 0 times in my PR.

subpasses are only benificial on tile based GPUs. I would not focus on those at this moment. We already have this in place for Metal and eventually we might also introduce it for Vulkan as well.

I have already implemented subpaths in my branch.
I don't understand why you don't make it public and pursue more sensitive issues.

I would not say that alternative 1 is the current situation as there are semaphores ensuring that we don't add barriers.

All of your implementations of semaphore are blocked. There is no particular difference with fences. It's sophistry.

(not my words).

Whose words are these?
Do you have the right to use other people's words?
I say it's easy after actually implementing it.

This is one of the reasons why we don't simply accept most changes to the GPU API.

Understood. If you deny it like that, it will be impossible for me to have any further relationship with you.

Draw module cannot track resources in a multi-threaded environment. This can only be done on device level and submission level. Resource tracking on a higher level (draw module) would lead to incorrect state tracking or larger scope device locks.

That is correct. The implementation is full of deadlocks. This is a warning to everyone. Using the current implementation with RenderDoc many times will destroy the CPU registry and thus the hardware itself. If you have the money, please give it a try.

I know you've argued that current layout tracking is a bottle-neck. If render graphs are done right, it doesn't and it would also be beneficial to do as it will improve performance as less transitions are needed. Resources don't have a single layout when they exist outside a render pass. Blender is a content creation suite and it's core of this definition is that resources are altered, and therefore requires different layouts in time. Yes per resource a best-layout could be added, but would rather do that when it is clear what the full benefit is. The benefit currently is unclear and we should first find the bottlenecks of a fully working Blender, before adding solutions based on intermediate situations. I am not looking only at opening a scene and optimize it for drawing/rendering. Users are working with scene in many different ways which leads to different aspects of a final solution. Eventually I can see that a resource can get a best layout that is used between submissions/flushes so much larger scope then you're arguing. How this will eventually impact the code is still unclear as well. Perhaps it is visible in code, but could also be only visible in running state.

This is disrespectful to Blender's previous contributors.
I'm sure the average programmer tries to make sure that resources are always static when outside of Renderpass.
GPUModule has been OpenGL = proprietary. It's the artificial programmers who prefer to rely on parts that aren't OpenSource.
Artificial intelligence continues to infringe on copyright and art becomes a watered-down jellyfish. That's just the paradigm of the times.
However, even if the word art is lost, its content is not.

> Pipelines and renderpasses are part of the reordering and would be constructed just-in-time. the reordering would reduce the pipeline construction. The disadvantage of messing with static structures is what the data says. (Not my opinion.) > framebuffers (and clearing) can be optimized as part of the render graph, so no need to track it outside. The distinction between framebuffer and render pass is ambiguous. In order to successfully operate the frame buffer, we need to add dynamic states, right? The logic doesn't make sense. > descriptor sets will become an offset in a large buffers where all descriptor sets are uploaded used by the current submission. Do you know how many DescriptorSets the main branch implementation generates per second? This has also been reduced to 0 times in my PR. > subpasses are only benificial on tile based GPUs. I would not focus on those at this moment. We already have this in place for Metal and eventually we might also introduce it for Vulkan as well. I have already implemented subpaths in my branch. I don't understand why you don't make it public and pursue more sensitive issues. > I would not say that alternative 1 is the current situation as there are semaphores ensuring that we don't add barriers. All of your implementations of semaphore are blocked. There is no particular difference with fences. It's sophistry. > (not my words). Whose words are these? Do you have the right to use other people's words? I say it's easy after actually implementing it. > This is one of the reasons why we don't simply accept most changes to the GPU API. Understood. If you deny it like that, it will be impossible for me to have any further relationship with you. > Draw module cannot track resources in a multi-threaded environment. This can only be done on device level and submission level. Resource tracking on a higher level (draw module) would lead to incorrect state tracking or larger scope device locks. That is correct. The implementation is full of deadlocks. This is a warning to everyone. Using the current implementation with RenderDoc many times will destroy the CPU registry and thus the hardware itself. If you have the money, please give it a try. > I know you've argued that current layout tracking is a bottle-neck. If render graphs are done right, it doesn't and it would also be beneficial to do as it will improve performance as less transitions are needed. Resources don't have a single layout when they exist outside a render pass. Blender is a content creation suite and it's core of this definition is that resources are altered, and therefore requires different layouts in time. Yes per resource a best-layout could be added, but would rather do that when it is clear what the full benefit is. The benefit currently is unclear and we should first find the bottlenecks of a fully working Blender, before adding solutions based on intermediate situations. I am not looking only at opening a scene and optimize it for drawing/rendering. Users are working with scene in many different ways which leads to different aspects of a final solution. Eventually I can see that a resource can get a best layout that is used between submissions/flushes so much larger scope then you're arguing. How this will eventually impact the code is still unclear as well. Perhaps it is visible in code, but could also be only visible in running state. This is disrespectful to Blender's previous contributors. I'm sure the average programmer tries to make sure that resources are always static when outside of Renderpass. GPUModule has been OpenGL = proprietary. It's the artificial programmers who prefer to rely on parts that aren't OpenSource. Artificial intelligence continues to infringe on copyright and art becomes a watered-down jellyfish. That's just the paradigm of the times. However, even if the word art is lost, its content is not.
Contributor

As I see it this doesn't require any changes to the GPU or Draw Module API.  

Understood. Let's close.

> As I see it this doesn't require any changes to the GPU or Draw Module API.   Understood. Let's close.
Jeroen Bakker added this to the 4.2 LTS milestone 2024-04-08 08:01:01 +02:00
Sign in to join this conversation.
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset System
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Asset Browser Project
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#118330
No description provided.