Vulkan: Synchronization/Render Graphs #118330
Labels
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: blender/blender#118330
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
State: Approved in April 2024
Vulkan Synchronization
Synchronization is at the core of Vulkan. Sadly it is also very hard to do right. Even expert
vulkan developers have never seen applications that do it correct. Mostly as it is a timing issue
the errors are not resulting in artifacts and not fixed at all. (see no evil, hear no evil)
What is this synchronization issue?
Actual work on the GPU happen in a later stage then the CPU gives the actual command. The work that
is happening on the GPU requires buffers and images to be in a certain state or layout. To give some
context around the situation lets go over how the splash screen in Blender draws its picture.
The term synchronization refers to that the application is responsible for the resource usage,
dependencies and layout transformations when they are being executed on the GPU. GPUs are "extremely"
parallelizable and we need to try to feed the GPU with enough parallelizable work so it can work efficiently.
In OpenGL most part of this was transparent and dealt by the OpenGL Driver. In Metal this is partly a
driver responsibility (resource usage dependency) and partly an application responsibility (layouts &
usage hints). In Vulkan this responsibility shifted fully from the driver to the application.
Why this paradigm shift? An OpenGL driver has to deal with all the possible use cases, and was always
balancing between safety and performance. Making synchronization an application responsibility allows
applications to fine-tune what is actually needed and reduce overhead compared to a driver implementation.
What are layout transformations?
When using images GPUs work can improve the performance by reordering the pixels inside an image to
improve caches. When sampling an image inside a shader a regular sequential storage of pixels requires
more cache lines, then when the pixels are stored in tiles/blocks. Less needed cache lines allows cache
lines to be stored with more relevant data and improved performance.
The actual pixel order can be different based on the GPU and how the image is used. Different layouts
exist for copying data (data transform), sampling, storage or when the texture is used as a framebuffer.
This order of the pixels are hidden to the user for IP reasons.
Example of synchronization
On the CPU side a temp buffer is created and filled with pixels. This is sent to the GPU module to
construct a texture from it. This texture is drawn using a shader to an offscreen texture.
The layout of the texture is different when uploading the pixels, then when used on the shader.
Noteworthy the pixels on the CPU side are already freed before the actual drawing is happening. Even
the GPUTexture is freed before the GPU has started drawing.
buffer in transfer source layout.
And this is just a simple example and other elements like parameters and framebuffer attachments have
been ignored.
Each action can only happen when its previous action has been completed. And resources can only be
freed after the drawing has been finished.
How does the industry solve this?
There are three ways that is commonly used to solve this issue. One is recording barriers in the
command buffer where it is needed. Other will track the state of each resource and the third one is
using a render graph. #120174 provides more information how other software frameworks and game engines
solve this issue.
What requirements do we have that needs to be supported by the chosen solution
image rendering and also GPU compositor. Both happen in a background thread with their own context.
But share resources like input/output images, textures etc.
Alternative 1: Manually add barriers
Add layout transitions and barriers where needed. This is difficult and can also lead to unhandled
situations. It would also not be possible to support threaded rendering as the state is tracked
globally and can be altered by other threads, before the drawing is finished.
Most vulkan tutorials and examples online uses this approach.
Alternative 2: State tracking
Automatically add layout transitions and barriers by tracking the state of each resource. When a
resource requires a different layout, perform the layout, when read-write issues can occur add
the propriate barrier.
Examples of this is done inside the
VK_LAYER_synchronization_valid
,webgpu
andnice.graphics
.want to support. they don't support timeline semaphores and report false positive when the
application is using them.
Alternative 3: Render Graphs
Render graphs records commands per thread and when flushing the commands the resource
layout transitions and barriers are included based on the order of execution. Commands can be
reordered to improve performance. The submission of commands construction of commands and
submission are guarded to ensure thread safety.
This approach is often done by game engines like Frostbite/Unreal/Unity as it gets more performance.
Reordering of commands can lead to less and leaner barriers. Implementation in Granite can be found
at https://themaister.net/blog/2017/08/15/render-graphs-and-vulkan-a-deep-dive/ . The implementation
might not be the cleanest, but it describes the steps and features it added to the render graph.
in sequence. Data transfer commands that are done inside a render pass, can be moved before
the render pass starts. Framebuffer layout transitions can be merged with render pass begin/end.
GPU_flush
is called.when which resource can be safely removed, resulting in less unused memory allocations.
High level goal design
Any operation that leads to work on the GPU (Draw, Copy, Dispatch) would create a node.
The node will track the resources it needs (read) and resources it write to using (the producer/consumer pattern).
When writing to a resource a new 'version' of that resource is created. Any action that happens
later will always use the last created version. The resource also keeps track of which node created
that version.
Thread-locking
Locking concerning resources, Adding nodes, converting nodes to commands and command submission happens
with a device level mutex. First implementation will also lock when adding a node to a context when that
node uses resources. Building of nodes however are done outside the lock to keep the locking to a minimum.
Index based access
We should ensure that the graph only uses indices to refer to other nodes/resources so they can
be stored in a vector. The render graph adds and removes nodes very often and will limit memory
allocation operations.
Implementation approach
The development should be split into multiple smaller steps as I consider it a high risk project.
The checkboxes refer to steps that are tested inside the prototype.
Some intermediate steps could be:
option
way
due to the changes we made some months ago in the Metal/Vulkan backend.
I recon this is 3 months of work including stabilization and making sure it works on all platforms.
The goal would be to have better performance then OpenGL. This is feasible as OpenGL has to deal with
to many other situations that we don't need to consider.
The lead time includes twice a month status reporting using a demo/GPU trace and getting feedback from
vulkan experts. It doesn't involve additional time I need to spent on platform support and general project
support.
Future developments
Blender currently doesn't support resource tracking between shader stages. Eg vertex shader can already
run, but the fragment shader needs to wait until a certain resource comes available. We could add support
by this in the
GPUShaderCreateInfo
where resource usages are tagged with the stages they are needed.Promote the render graph api as a GPU API (#120174) to remove one the level of indirection.
How to involve the community?
Most of the work cannot scale that well yet to multiple developers. This makes it harder to involve
the community during the development. We should do reporting and demoing twice a month (this
is included in the lead time). See
232fd2d00b/src/gpu-vulkan/2024q2-planning.md
for theanalysis on this topic.
The vulkan community will be informed over this design and feedback will be asked/addressed.
Why not mention transitions in the render pass itself?
By considering Image transitions separately outside and inside the render pass, there are far fewer situations in which barriers need to be considered.
Most render passes in Blender achieve smooth transitions by keeping the initial and final layouts the same.
Please answer this question.
What do you think about these costs specifically?
The current main branch implementation generates
VkRenderpass
,VkFramebuffer
,VkPipeline
, andVkDescriptorSet
effectively an infinite number of times. In my PR, I was able to make this almost static.Although memory accesses on the GPU are said to be invisible to the user, a debugger can give you a pretty detailed view of what's going on. GPU minimum core local memory,
Within the SM, the access range increases in the order of Shared-Memory, L1-Memory, and L2-Memory.
Various utilities are available to manage memory access.
Vulkan ray tracing also allows efficient use of local memory.
If you want to use tensor cores, you can use
ComputeShader
orMeshLetShader
to do things like co-operative Matrix.For the L1 area, possible methods include staging Device-Local memory.
Also,
subpass
allows you to freeze a pixel and perform the next write.It doesn't particularly have to be Shader-ReadOnly.
Please distinguish between shader write state and attachment state.
The situation of random access is abstracted.
Your synchronization example is too simple.
Please explain why there is
VkSubpassDependency
in the render pass.By using this
VkSubpassDependency
many synchronization problems were solved.This is the current situation.
In particular, the part that executes the pipeline barrier in the "VKDescriptorSetTracker::update" function is unstable.
This is because the layout differs depending on the "submission" status. To solve this, we embedded a transition structure within the render pass.
I don't understand the difference between the situations in option 2 and option 1.
Rather, this problem is solved by providing a more detailed barrier API for Draw-Module. In other words, it is an aspect that is needed as a complement rather than a substitute.
No matter how much the GPU-Module tracks the resources, isn't the Draw-Module the one who knows the real timing?
The story will go awry.
We need to discuss the design of building the
VkCommandBuffer
for each individual draw pass.There's nothing particularly new about it. It's obvious.
But you don't mention anything about transitions within
RenderPass
.This design is a good direction.
However, the bottleneck is that
current_layout
is required.In other words, as I have repeatedly argued, if a resource exists outside of his Renderpass, he always needs one layout for that resource.
Then you don't have to follow me around like a detective, right?
I named it best-layout (you can call it whatever you want). This is useful because you can insert this name and change its
I try to reply on all your questions and comments.
A must for this project is to take small steps to go to the goal implementation. We also must ensure that regular developers can work with the solutions we provide. As I see it this doesn't require any changes to the GPU or Draw Module API. Only changes inside the Vulkan Backend.
The disadvantage of messing with static structures is what the data says. (Not my opinion.)
The distinction between framebuffer and render pass is ambiguous.
In order to successfully operate the frame buffer, we need to add dynamic states, right?
The logic doesn't make sense.
Do you know how many DescriptorSets the main branch implementation generates per second?
This has also been reduced to 0 times in my PR.
I have already implemented subpaths in my branch.
I don't understand why you don't make it public and pursue more sensitive issues.
All of your implementations of semaphore are blocked. There is no particular difference with fences. It's sophistry.
Whose words are these?
Do you have the right to use other people's words?
I say it's easy after actually implementing it.
Understood. If you deny it like that, it will be impossible for me to have any further relationship with you.
That is correct. The implementation is full of deadlocks. This is a warning to everyone. Using the current implementation with RenderDoc many times will destroy the CPU registry and thus the hardware itself. If you have the money, please give it a try.
This is disrespectful to Blender's previous contributors.
I'm sure the average programmer tries to make sure that resources are always static when outside of Renderpass.
GPUModule has been OpenGL = proprietary. It's the artificial programmers who prefer to rely on parts that aren't OpenSource.
Artificial intelligence continues to infringe on copyright and art becomes a watered-down jellyfish. That's just the paradigm of the times.
However, even if the word art is lost, its content is not.
Understood. Let's close.