GPU: Add explicit API to sync storage buffer back to host #113456

Jason Fielder · 2023-10-09T18:04:15+02:00

Jason Fielder commented

2023-10-09 18:04:15 +02:00

PR Introduces GPU_storagebuf_sync_to_host as an explicit routine to
flush GPU-resident storage buffer memory back to the host within the
GPU command stream.

The previous implmentation relied on implicit synchronization of
resources using OpenGL barriers which does not match the
paradigm of explicit APIs, where indiviaul resources may need
to be tracked.

This patch ensures GPU_storagebuf_read can be called without
stalling the GPU pipeline while work finishes executing. There are
two possible use cases:

If GPU_storagebuf_read is called AFTER an explicit call to
GPU_storagebuf_sync_to_host, the read will be synchronized.
If the dependent work is still executing on the GPU, the host
will stall until GPU work has completed and results are available.
If GPU_storagebuf_read is called WITHOUT an explicit call to
GPU_storagebuf_sync_to_host, the read will be asynchronous
and whatever memory is visible to the host at that time will be used.
(This is the same as assuming a sync event has already been signalled.)

This patch also addresses a gap in the Metal implementation where
there was missing read support for GPU-only storage buffers.
This routine now uses a staging buffer to copy results if no
host-visible buffer was available.

Reading from a GPU-only storage buffer will always stall
the host, as it is not possible to pre-flush results, as no
host-resident buffer is available.

Authored by Apple: Michael Parkin-White

PR Introduces GPU_storagebuf_sync_to_host as an explicit routine to flush GPU-resident storage buffer memory back to the host within the GPU command stream. The previous implmentation relied on implicit synchronization of resources using OpenGL barriers which does not match the paradigm of explicit APIs, where indiviaul resources may need to be tracked. This patch ensures GPU_storagebuf_read can be called without stalling the GPU pipeline while work finishes executing. There are two possible use cases: 1) If GPU_storagebuf_read is called AFTER an explicit call to GPU_storagebuf_sync_to_host, the read will be synchronized. If the dependent work is still executing on the GPU, the host will stall until GPU work has completed and results are available. 2) If GPU_storagebuf_read is called WITHOUT an explicit call to GPU_storagebuf_sync_to_host, the read will be asynchronous and whatever memory is visible to the host at that time will be used. (This is the same as assuming a sync event has already been signalled.) This patch also addresses a gap in the Metal implementation where there was missing read support for GPU-only storage buffers. This routine now uses a staging buffer to copy results if no host-visible buffer was available. Reading from a GPU-only storage buffer will always stall the host, as it is not possible to pre-flush results, as no host-resident buffer is available. Authored by Apple: Michael Parkin-White

Jason Fielder added 1 commit 2023-10-09 18:04:28 +02:00

GPU: Add explicit API to sync storage buffer back to host 011a7313a8

PR Introduces GPU_storagebuf_sync_to_host as an explicit routine to
flush GPU-resident storage buffer memory back to the host within the
GPU command stream.

The previous implmentation relied on implicit synchronization of
resources using OpenGL barriers which does not match the
paradigm of explicit APIs, where indiviaul resources may need
to be tracked.

This patch ensures GPU_storagebuf_read can be called without
stalling the GPU pipeline while work finishes executing. There are
two possible use cases:

1) If GPU_storagebuf_read is called AFTER an explicit call to
GPU_storagebuf_sync_to_host, the read will be synchronized.
If the dependent work is still executing on the GPU, the host
will stall until GPU work has completed and results are available.

2) If GPU_storagebuf_read is called WITHOUT an explicit call to
GPU_storagebuf_sync_to_host, the read will be asynchronous
and whatever memory is visible to the host at that time will be used.
(This is the same as assuming a sync event has already been signalled.)

This patch also addresses a gap in the Metal implementation where
there was missing read support for GPU-only storage buffers.
This routine now uses a staging buffer to copy results if no
host-visible buffer was available.

Reading from a GPU-only storage buffer will always stall
the host, as it is not possible to pre-flush results, as no
host-resident buffer is available.

Authored by Apple: Michael Parkin-White

Jason Fielder requested review from Clément Foucault 2023-10-09 18:04:48 +02:00

Jason Fielder requested review from Jeroen Bakker 2023-10-09 18:04:58 +02:00

Clément Foucault requested changes 2023-10-13 17:05:14 +02:00

Clément Foucault left a comment

Looks good. I would just add a note on the barrier description saying it is usually better to call GPU_storagebuf_sync_to_host instead of the barrier.

Looks good. I would just add a note on the barrier description saying it is usually better to call `GPU_storagebuf_sync_to_host` instead of the barrier.

source/blender/draw/engines/eevee_next/eevee_shadow.cc Outdated

						
				@ -1328,6 +1328,8 @@ void ShadowModule::set_view(View &view)

				      shadow_multi_view_.compute_procedural_bounds();

				      statistics_buf_.current().sync_to_host();

Clément Foucault commented

2023-10-13 17:00:14 +02:00

Shouldn't we call this just before reading the content? This seems a bit out of place.

Michael Parkin-White commented

2023-10-16 18:16:47 +02:00

First-time contributor

Apologies if this is poorly named, the intent of this is to make synchronization of data back to the host asynchronous as part of the GPU command stream. If this call is inserted directly before the host read, then that's where the host stalls can come from as the GPU only receives instruction that data needs to be transferred right before it is needed, meaning the host may need to wait for this sync to complete.

by explicitly flushing the data back to host as part of the GPU command stream, this implies that once the operations have finished executing on the GPU, the data will be "ready" for consumption by the CPU immediately, so if this data is accessed at a later time, it will be ready as soon as the GPU has finished working on it, rather than being copied at the time the host needs it.

On Apple Silicon/UMA, this would mean just ensuring caches are flushed and operations have completed before read. For discrete GPU systems, this would mean that the memory has been wired to host addressable memory.

The caveat with OpenGL is that this data sync may have already been implicitly inserted by the GPU based on the memory flag type, whereas with explicit APIs, flushing of resources needs.

In Vulkan, this would likely be insertion of a pipeline barrier in the command stream with a VkBufferMemoryBarrier with destination of HOST_READ_BIT, but it still often makes sense to follow the command which modified the memory which is to be read later.

Apologies if this is poorly named, the intent of this is to make synchronization of data back to the host asynchronous as part of the GPU command stream. If this call is inserted directly before the host read, then that's where the host stalls can come from as the GPU only receives instruction that data needs to be transferred right before it is needed, meaning the host may need to wait for this sync to complete. by explicitly flushing the data back to host as part of the GPU command stream, this implies that once the operations have finished executing on the GPU, the data will be "ready" for consumption by the CPU immediately, so if this data is accessed at a later time, it will be ready as soon as the GPU has finished working on it, rather than being copied at the time the host needs it. On Apple Silicon/UMA, this would mean just ensuring caches are flushed and operations have completed before read. For discrete GPU systems, this would mean that the memory has been wired to host addressable memory. The caveat with OpenGL is that this data sync may have already been implicitly inserted by the GPU based on the memory flag type, whereas with explicit APIs, flushing of resources needs. In Vulkan, this would likely be insertion of a pipeline barrier in the command stream with a VkBufferMemoryBarrier with destination of HOST_READ_BIT, but it still often makes sense to follow the command which modified the memory which is to be read later.

source/blender/gpu/GPU_storage_buffer.h

						
				@ -55,0 +65,4 @@

				 *

				 * Otherwise, this command is unsynchronized and will return current visible storage buffer

				 * contents immediately.

				 * Alternatively, use appropriate barrier or GPU_finish before reading.

Clément Foucault commented

2023-10-13 17:00:54 +02:00

So that means that using the barrier is still a viable option, just that it is slower because it is equivalent to a GPU_finish call on Metal. Am I correct?

So that means that using the barrier is still a viable option, just that it is slower because it is equivalent to a `GPU_finish` call on Metal. Am I correct?

Michael Parkin-White commented

2023-10-16 18:22:07 +02:00

First-time contributor

May need to alter this comment, as this is likely where synchronization primitives become different between APIs, as the barrier API likely makes sense for defining dependencies that exist solely on the GPU e.g. MTLFence/MTLEvent cover these, however, the GPU_memory_barrier function does not nicely translate to cases where the host and GPU need to be synchronized with each other.

The comment here would instead refer to the coherency of the data, i.e. the memory barriers would not guarantee host/gpu sync, but they can ensure that the GPU has finished updating data by a given point. But yeah, I think it makes sense to evolve this, and perhaps use a sync_to_host pattern for resources being read by the CPU.

But happy for your thoughts on this.

May need to alter this comment, as this is likely where synchronization primitives become different between APIs, as the barrier API likely makes sense for defining dependencies that exist solely on the GPU e.g. MTLFence/MTLEvent cover these, however, the `GPU_memory_barrier` function does not nicely translate to cases where the host and GPU need to be synchronized with each other. The comment here would instead refer to the coherency of the data, i.e. the memory barriers would not guarantee host/gpu sync, but they can ensure that the GPU has finished updating data by a given point. But yeah, I think it makes sense to evolve this, and perhaps use a `sync_to_host` pattern for resources being read by the CPU. But happy for your thoughts on this.

Jason Fielder added 1 commit 2023-10-18 19:34:04 +02:00

Merge branch 'main' into MetalStorageBufferHostSync2 06ee1cba46

Jason Fielder added 1 commit 2023-10-18 19:58:12 +02:00

Rename sync_to_host function to async_flush_to_host 488466337a

Clément Foucault approved these changes 2023-10-20 17:03:05 +02:00

Clément Foucault referenced this issue from a commit

2023-10-20 17:04:45 +02:00

GPU: Add explicit API to sync storage buffer back to host

Clément Foucault merged commit 1b0ddfa6cb into main

2023-10-20 17:04:45 +02:00

Sign in to join this conversation.

No reviewers

No Label

Download

What's New

Blender Studio

Manual

Developers Blog

Documentation

Benchmark

Blender Conference

Development Fund

One-time Donations

GPU: Add explicit API to sync storage buffer back to host #113456