GPU: Add explicit API to sync storage buffer back to host #113456

Merged
Clément Foucault merged 3 commits from Jason-Fielder/blender:MetalStorageBufferHostSync2 into main 2023-10-20 17:04:45 +02:00
Member

PR Introduces GPU_storagebuf_sync_to_host as an explicit routine to
flush GPU-resident storage buffer memory back to the host within the
GPU command stream.

The previous implmentation relied on implicit synchronization of
resources using OpenGL barriers which does not match the
paradigm of explicit APIs, where indiviaul resources may need
to be tracked.

This patch ensures GPU_storagebuf_read can be called without
stalling the GPU pipeline while work finishes executing. There are
two possible use cases:

  1. If GPU_storagebuf_read is called AFTER an explicit call to
    GPU_storagebuf_sync_to_host, the read will be synchronized.
    If the dependent work is still executing on the GPU, the host
    will stall until GPU work has completed and results are available.

  2. If GPU_storagebuf_read is called WITHOUT an explicit call to
    GPU_storagebuf_sync_to_host, the read will be asynchronous
    and whatever memory is visible to the host at that time will be used.
    (This is the same as assuming a sync event has already been signalled.)

This patch also addresses a gap in the Metal implementation where
there was missing read support for GPU-only storage buffers.
This routine now uses a staging buffer to copy results if no
host-visible buffer was available.

Reading from a GPU-only storage buffer will always stall
the host, as it is not possible to pre-flush results, as no
host-resident buffer is available.

Authored by Apple: Michael Parkin-White

PR Introduces GPU_storagebuf_sync_to_host as an explicit routine to flush GPU-resident storage buffer memory back to the host within the GPU command stream. The previous implmentation relied on implicit synchronization of resources using OpenGL barriers which does not match the paradigm of explicit APIs, where indiviaul resources may need to be tracked. This patch ensures GPU_storagebuf_read can be called without stalling the GPU pipeline while work finishes executing. There are two possible use cases: 1) If GPU_storagebuf_read is called AFTER an explicit call to GPU_storagebuf_sync_to_host, the read will be synchronized. If the dependent work is still executing on the GPU, the host will stall until GPU work has completed and results are available. 2) If GPU_storagebuf_read is called WITHOUT an explicit call to GPU_storagebuf_sync_to_host, the read will be asynchronous and whatever memory is visible to the host at that time will be used. (This is the same as assuming a sync event has already been signalled.) This patch also addresses a gap in the Metal implementation where there was missing read support for GPU-only storage buffers. This routine now uses a staging buffer to copy results if no host-visible buffer was available. Reading from a GPU-only storage buffer will always stall the host, as it is not possible to pre-flush results, as no host-resident buffer is available. Authored by Apple: Michael Parkin-White
Jason Fielder added 1 commit 2023-10-09 18:04:28 +02:00
PR Introduces GPU_storagebuf_sync_to_host as an explicit routine to
flush GPU-resident storage buffer memory back to the host within the
GPU command stream.

The previous implmentation relied on implicit synchronization of
resources using OpenGL barriers which does not match the
paradigm of explicit APIs, where indiviaul resources may need
to be tracked.

This patch ensures GPU_storagebuf_read can be called without
stalling the GPU pipeline while work finishes executing. There are
two possible use cases:

1) If GPU_storagebuf_read is called AFTER an explicit call to
GPU_storagebuf_sync_to_host, the read will be synchronized.
If the dependent work is still executing on the GPU, the host
will stall until GPU work has completed and results are available.

2) If GPU_storagebuf_read is called WITHOUT an explicit call to
GPU_storagebuf_sync_to_host, the read will be asynchronous
and whatever memory is visible to the host at that time will be used.
(This is the same as assuming a sync event has already been signalled.)

This patch also addresses a gap in the Metal implementation where
there was missing read support for GPU-only storage buffers.
This routine now uses a staging buffer to copy results if no
host-visible buffer was available.

Reading from a GPU-only storage buffer will always stall
the host, as it is not possible to pre-flush results, as no
host-resident buffer is available.

Authored by Apple: Michael Parkin-White
Jason Fielder requested review from Clément Foucault 2023-10-09 18:04:48 +02:00
Jason Fielder requested review from Jeroen Bakker 2023-10-09 18:04:58 +02:00
Clément Foucault requested changes 2023-10-13 17:05:14 +02:00
Clément Foucault left a comment
Member

Looks good. I would just add a note on the barrier description saying it is usually better to call GPU_storagebuf_sync_to_host instead of the barrier.

Looks good. I would just add a note on the barrier description saying it is usually better to call `GPU_storagebuf_sync_to_host` instead of the barrier.
@ -1328,6 +1328,8 @@ void ShadowModule::set_view(View &view)
shadow_multi_view_.compute_procedural_bounds();
statistics_buf_.current().sync_to_host();

Shouldn't we call this just before reading the content? This seems a bit out of place.

Shouldn't we call this just before reading the content? This seems a bit out of place.
First-time contributor

Apologies if this is poorly named, the intent of this is to make synchronization of data back to the host asynchronous as part of the GPU command stream. If this call is inserted directly before the host read, then that's where the host stalls can come from as the GPU only receives instruction that data needs to be transferred right before it is needed, meaning the host may need to wait for this sync to complete.

by explicitly flushing the data back to host as part of the GPU command stream, this implies that once the operations have finished executing on the GPU, the data will be "ready" for consumption by the CPU immediately, so if this data is accessed at a later time, it will be ready as soon as the GPU has finished working on it, rather than being copied at the time the host needs it.

On Apple Silicon/UMA, this would mean just ensuring caches are flushed and operations have completed before read. For discrete GPU systems, this would mean that the memory has been wired to host addressable memory.

The caveat with OpenGL is that this data sync may have already been implicitly inserted by the GPU based on the memory flag type, whereas with explicit APIs, flushing of resources needs.

In Vulkan, this would likely be insertion of a pipeline barrier in the command stream with a VkBufferMemoryBarrier with destination of HOST_READ_BIT, but it still often makes sense to follow the command which modified the memory which is to be read later.

Apologies if this is poorly named, the intent of this is to make synchronization of data back to the host asynchronous as part of the GPU command stream. If this call is inserted directly before the host read, then that's where the host stalls can come from as the GPU only receives instruction that data needs to be transferred right before it is needed, meaning the host may need to wait for this sync to complete. by explicitly flushing the data back to host as part of the GPU command stream, this implies that once the operations have finished executing on the GPU, the data will be "ready" for consumption by the CPU immediately, so if this data is accessed at a later time, it will be ready as soon as the GPU has finished working on it, rather than being copied at the time the host needs it. On Apple Silicon/UMA, this would mean just ensuring caches are flushed and operations have completed before read. For discrete GPU systems, this would mean that the memory has been wired to host addressable memory. The caveat with OpenGL is that this data sync may have already been implicitly inserted by the GPU based on the memory flag type, whereas with explicit APIs, flushing of resources needs. In Vulkan, this would likely be insertion of a pipeline barrier in the command stream with a VkBufferMemoryBarrier with destination of HOST_READ_BIT, but it still often makes sense to follow the command which modified the memory which is to be read later.
@ -55,0 +65,4 @@
*
* Otherwise, this command is unsynchronized and will return current visible storage buffer
* contents immediately.
* Alternatively, use appropriate barrier or GPU_finish before reading.

So that means that using the barrier is still a viable option, just that it is slower because it is equivalent to a GPU_finish call on Metal. Am I correct?

So that means that using the barrier is still a viable option, just that it is slower because it is equivalent to a `GPU_finish` call on Metal. Am I correct?
First-time contributor

May need to alter this comment, as this is likely where synchronization primitives become different between APIs, as the barrier API likely makes sense for defining dependencies that exist solely on the GPU e.g. MTLFence/MTLEvent cover these, however, the GPU_memory_barrier function does not nicely translate to cases where the host and GPU need to be synchronized with each other.

The comment here would instead refer to the coherency of the data, i.e. the memory barriers would not guarantee host/gpu sync, but they can ensure that the GPU has finished updating data by a given point. But yeah, I think it makes sense to evolve this, and perhaps use a sync_to_host pattern for resources being read by the CPU.

But happy for your thoughts on this.

May need to alter this comment, as this is likely where synchronization primitives become different between APIs, as the barrier API likely makes sense for defining dependencies that exist solely on the GPU e.g. MTLFence/MTLEvent cover these, however, the `GPU_memory_barrier` function does not nicely translate to cases where the host and GPU need to be synchronized with each other. The comment here would instead refer to the coherency of the data, i.e. the memory barriers would not guarantee host/gpu sync, but they can ensure that the GPU has finished updating data by a given point. But yeah, I think it makes sense to evolve this, and perhaps use a `sync_to_host` pattern for resources being read by the CPU. But happy for your thoughts on this.
Jason Fielder added 1 commit 2023-10-18 19:34:04 +02:00
Jason Fielder added 1 commit 2023-10-18 19:58:12 +02:00
Clément Foucault approved these changes 2023-10-20 17:03:05 +02:00
Clément Foucault merged commit 1b0ddfa6cb into main 2023-10-20 17:04:45 +02:00
Sign in to join this conversation.
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset System
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Asset Browser Project
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#113456
No description provided.