WIP:Interleaved slices for better work distriubtion with a Multi-GPU setup #110348

William Leeson · 2023-07-21T15:42:20+02:00

William Leeson commented

2023-07-21 15:42:20 +02:00

Why

To improve the distribution of work between GPUs by giving them a set of work that should take roughly the same amount of time.

What

This patch adds an option to the cycles performance category called "interleaved slices". This splits the work load by giving each device a set of scan lines such that the one with the smallest weight w_{smallest} gets 1 scan line and the others get n_i = w_i/w_{smallest}. The scaliness for each device are interleaved such that the first device takes the first n_0 and the second gets the next n_1 + }$ and so on. The set of scan lines are reassigned each time the weights change.

This pull request is based on that in 108147 I could not find a way to change the branch being merged from so I started a new one.

I put some performance statistics here

## Why To improve the distribution of work between GPUs by giving them a set of work that should take roughly the same amount of time. ## What This patch adds an option to the cycles performance category called "interleaved slices". This splits the work load by giving each device a set of scan lines such that the one with the smallest weight $w_{smallest}$ gets 1 scan line and the others get $n_i = w_i/w_{smallest}$. The scaliness for each device are interleaved such that the first device takes the first $n_0$ and the second gets the next $n_1$ + }$ and so on. The set of scan lines are reassigned each time the weights change. This pull request is based on that in [108147](https://projects.blender.org/blender/blender/pulls/108147) I could not find a way to change the branch being merged from so I started a new one. I put some performance statistics [here](https://docs.google.com/spreadsheets/d/1eG5_AvD_tAY-wJLkEEkZNQcgZYqS5PpSkmc4RNFrdXo/edit?usp=sharing)

Interleaved Slices Background Render Comparison.png

13 KiB

Interleaved Slices Forground Rendering Comparison.png

12 KiB

Interleaved Slices Background Render Comparison.png

Interleaved Slices Forground Rendering Comparison.png

William Leeson added 65 commits 2023-07-21 15:42:32 +02:00

127066dd99 Add work_sets for better multi-gpu scaling

4e4d5f3ea0 Only realloc the buffer if it is bigger

c5da8e6429 Request transfers for all slices then sync only once

84fa6cfc51 Merge branch 'upstream_main' into work_sets

d21cd0bc2e Merge branch 'upstream_main' into work_sets

edfb257495 Add the ability to allocate pinned memory

b8fbf9dfcb Allocate path trace counters in pinned memory to allow bg transfer

5f5595caa7 Allocate render buffers as pinned so as to be able to bg transfer

790e12a444 FIX: Adds missing device import for MacOS bvh.mm

9c7f5266a6 Merge branch 'upstream_main' into work_sets

4c1d196af8 Merge branch 'upstream_main' into work_sets

d8bd1c1eb8 Merge branch 'upstream_main' into work_sets

8c4db51deb FIX: Don't read kernel data for the KernelLightTreeEmitter if the index is -1

cbdb1379e2 Use a single master buffer to hold all the slices

This replaces n slice buffers with a single master buffer and n
slices which reference into the master buffer. This allows a
single copy to upload or download the data.

9565aaa6fe FIX: Denoise buffer when more than one work_set

Previouly when there were more than one worker (aka device) and
multiple work_sets the denoising was skipped.

6050b49ee4 FIX: Use remaining rows in the last work item

The remain rows were not added to the last work item as it was
detected incorrectly.

47e1cd75cb Merge branch 'upstream_main' into work_sets

7a3135ec98 FIX: Set pinned to false when not allocating pinned memory

f0185ed234 FIX: Don't copy 0 height slices to display

Attempting to copy a zero height slice resulted in CUDA errors.
Also switches back to using the master buffer when zeroing all
the slices.

bd669026e3 Skip over 0 height slices

76b0775361 Clean up code

0a716b2f68 Put wavefront tracing counters in pinned memory.

6a73c7d345 Merge branch 'upstream_main' into work_sets

ef5d66e1bf FIX: Buffers must always be cleared

2ad2d46aeb Removes performance penalty due to rebalance for having slices.

f283a9beb4 Merge branch 'upstream_main' into work_sets

0e0056707e FIX: Master buffer is now copied directly using it's buffer

CPU data was not being copied properly due to the buffer not
having the correct parameters.

69e04ddeb3 Merge branch 'upstream_main' into work_sets

35335e4932 Adds NVTX markers for viewing program flow and execution

098353947d Add some markers to visualise the render_pipeline

c836ee5e99 Merge branch 'upstream_main' into work_sets

2bbf552f09 Render all slices in one go

Previously the render_samples iterated over all the WorkSets.
However, this was not ideal due to overheads and was not good at
keeping the GPU busy. Now info is passed in the WorkTile to enable
the GPU to render all the slices in one pass.

7d1379f95d Path rng_hash uses image pixel coordinates

Previously it was using the slice coordinates

cf298a1d7e FIX: Update slice buffer offest into the master buffer on change

The slice buffers offsets into the master buffer were only updated
when the master buffer was reallocated. This ignored that fact that
the resolution scaler could resize the buffer and the slices even
though the master buffer was not reallocated.

7460500a81 FIX: RenderBuffers state now setup correctly for NODE

The parameters for slices added to the RenderBuffers was not setup
correctly for use with Nodes. This adds the necessary setup code.

Also, switched the render buffer to not use pinned memory.

ebfddd1c1a Merge branch 'upstream_main' into work_sets

67116b0844 FIX: Fixes debug build

The debug build was failing because dna_type_offsets.h was not
always generated when building. This also sometimes more rarely
affected the release build.

d463f4d194 Pre-calculate master buffer size

7d0cb956d4 Use slices buffer params to copy data to render buffers

240cad9fb0 Make Copy from render buffers use slice buffer params

4b4fbf6ec0 Clean up code

7c4f3aa0d4 Remove the need for using the WorkSet size()

Adds device_scale_factor_ member variable to PathTraceWork for
iterating over the slices.

7809f5e275 Use a single kernel launch call to perform the adaptive sampling with slices

7f11e5f38d Cryptomatte uses only 1 kernel launch on GPU

d9442b3969 Adaptive sampling uses only one parallel_for on CPU

Also fixes an issue where the render_samples_impl was getting the
incorrect height.

9ba33c795d Remove workset from denoise update

295652a1b9 Merge branch 'upstream_main' into work_sets

481d1c4423 FIX: MacOS compiler cannot initialise dynamic arrays

Replace initializer with for loop.

943f9c3e59 Merge branch 'upstream_main' into work_sets

4e424d384f Make PassAssessor and master_buffers_ aware of slice structure

Adds the correct BufferParams to the master_buffers and also
changes the code to the PathAccessors for bot CPU and GPU to copy
the images according to the slice structure.

dcb9476e9c FIX: Stop (get/set)_render_tile_pixels using work_sets

Sets the master_buffers as the effective_buffer_params so they
only iterate once instead of per buffer. As the device_pointers
cannot be used as regular pointers.

3203ca5c19 Merge branch 'upstream_main' into work_sets

b9a17d2d31 Merge branch 'work_sets' of projects.blender.org:leesonw/blender-cluster into work_sets

797e2a7b11 Merge branch 'upstream_main' into work_sets

fdef5c310c Merge branch 'upstream_main' into work_sets

b8215503d5 Use the smallest device weight to choose the slice sizes.

989dd1d3ef FIX: Correctly account for partial slices

The last slices that are not full sized need to take the current_y
int account to determine how many scanlines are left.

3e93532b24 Remove device_scale_factor from copy to/from routines.

5e532bb022 FIX: Baking now reads the correct scanlines into the RenderBuffers

The scanlines were just copied serially not taking into account
the slices this is now corrected.

514f8a7990 FIX: For Baking don't use interleaved slices

For some reason at the moment baking roughness does not work due
to the interleaved scanlines between the devices. So for now it
reverts to just using 2 big slices.

155fd0991f Remove temp fix while work on real solution

2d5198beb4 Merge branch 'upstream_main' into work_sets_similar

b86a443aaa FIX: padding takes interleaved scanlines into account

Previously padding did not use the interleaved scanlines to pad
the data and instead just wrote to the first n scanlines. Now it
iterates over the correct scanlines updating the correct set
based on the data in the BufferParams.

c703a7c8ac Change device_scale_factor to interleaved_slices

Replace the device_scale_factor with a check box to enable or
disable the interleaved slices.

2d05993da0 Clean up code and remove debug code

William Leeson added 1 commit 2023-07-21 16:31:24 +02:00

a6e4771b27 FIX: Devices cannot be assigned more rows than scanlines

If the compute difference was huge it was possible for devices to
be assigned way to many rows sometimes more than there are devices.
This prevents this from happening by ensuring devices get at least
1 row.

William Leeson added 1 commit 2023-07-21 16:34:38 +02:00

87a299d0ef Clean up code and remove old debug checks

William Leeson changed title from ~~WIP:Work sets for better work distrubtion with a Multi-GPU setup~~ to WIP:Interleaved slices for better work distriubtion with a Multi-GPU setup

2023-07-21 16:35:50 +02:00

William Leeson requested review from Brecht Van Lommel 2023-07-21 16:36:13 +02:00

William Leeson requested review from Sergey Sharybin 2023-07-21 16:37:14 +02:00

William Leeson added the

 @ -978,6 +978,12 @@ class CyclesRenderSettings(bpy.types.PropertyGroup):
         min=8, max=8192,
     )
     interleaved_slices: BoolProperty(

 @ -46,3 +46,3 @@
                                                       destination.num_components;
   parallel_for(0, buffer_params.window_height, [&](int64_t y) {
   /* Calculate how many full plus partial slices there are */

 @ -49,0 +65,4 @@
   /* Copy over each slice to the destination */
   parallel_for(0, slices, [&](int slice) {
   //for(int slice = 0;slice < slices;++slice) {

 @ -254,0 +298,4 @@
     slice_sizes[largest_weight] += leftover_scanlines;
     slice_stride++;
   } else if(leftover_scanlines < 0) {
     VLOG_WARNING << "#######Used to many scanlines";

 @ -254,0 +301,4 @@
     VLOG_WARNING << "#######Used to many scanlines";
   }
   VLOG_INFO << "===================SLICE allocatable:" << allocatable_slices << " fixed:"<< fixed_slices <<  "================";

 @ -412,2 +463,4 @@
   });
   const double work_time = time_dt() - start_time;
   VLOG(3) << "render time total for frame: "

 @ -78,2 +78,4 @@
       kg, state, render_buffer, scheduled_sample, tile->sample_offset);
   /* Map the buffer coordinates to the image coordinates */
   int tile_y = y - tile->slice_start_y;

Download

What's New

Blender Studio

Manual

Developers Blog

Documentation

Benchmark

Blender Conference

Development Fund

One-time Donations

WIP:Interleaved slices for better work distriubtion with a Multi-GPU setup #110348

Why

What

Checkout