Refactor geometry update for better parallelization and upload performance #105403

William Leeson · 2023-03-03T15:05:55+01:00

William Leeson commented

2023-03-03 15:05:55 +01:00

Why

Cycle's geometry upload and preparation can be refactored to be easier to understand and also to improve its efficiency. By creating a single parallel_for for each device instead of one thread per copy operation the overhead of handling threading can be reduced thereby improving performance. Also, by splitting the host side work from the device side work it is easier to distribute the workload. Some calculations are performed twice or unnecessarily as they only needed for specific tasks but are performed regardless as this is hard to work out with the current code. This also allows for further potential optimizations in the future.

What

Break the geometry update into host and device update.
Re-order the update so that it is easier to understand
Create a parallel loop over each device and from there upload the data.
- Create a DeviceScene per device instead of using the MultiDevice
- Create a BVH per device instead of using the MultiDevice
Only perform calculations that are required
This results in the program flow looking like the following
This method works as follows:
----HOST SIDE CALCULATIONS AND SETUP------

Calculate vertex normals and gather undisplaced vertices
Tessellate mesh if needed
Compute object bounds
fill/pack attribute buffers
fill/pack geometry buffers
--- only run on the first device ---
displace meshes and calculate shadow transparency if needed
if displacement or shadow transparency occurred then refill buffers
----DEVICE SPECIFIC BEGIN----
copy attribute buffers to device
copy geometry buffers to device
build object BVHs
build scene BVH
----DEVICE SPECIFIC END----
----HOST SIDE BELOW THIS POINT
Remove modified or update tags

The original flow and the new flow are shown in the files below. To achieve this a DeviceScene is created for each device. However, as this is just updating the geometry device update (and not all uses of the DeviceScene) the old DeviceScene needs to remain so other that buffers can be handled by the multi-device. The host side pointers are shared by all the DeviceScenes so they only need to be packed once. This is done using a method to update the host pointer of one DeviceScene from an existing DeviceScene.

I have created a spreedsheet here comparing this patch versus Blender main. Also and chart of the results so far is available in the images below. I have also created another spreadsheet here comparing it with both the original Blender implementation and the alternative method #107552. This patch is "geom_update" the other is "parallel_copy" in the graphs.

## Why Cycle's geometry upload and preparation can be refactored to be easier to understand and also to improve its efficiency. By creating a single parallel_for for each device instead of one thread per copy operation the overhead of handling threading can be reduced thereby improving performance. Also, by splitting the host side work from the device side work it is easier to distribute the workload. Some calculations are performed twice or unnecessarily as they only needed for specific tasks but are performed regardless as this is hard to work out with the current code. This also allows for further potential optimizations in the future. ## What * Break the geometry update into host and device update. * Re-order the update so that it is easier to understand * Create a parallel loop over each device and from there upload the data. * Create a DeviceScene per device instead of using the MultiDevice * Create a BVH per device instead of using the MultiDevice * Only perform calculations that are required This results in the program flow looking like the following This method works as follows: ----HOST SIDE CALCULATIONS AND SETUP------ 1. Calculate vertex normals and gather undisplaced vertices 2. Tessellate mesh if needed 3. Compute object bounds 4. fill/pack attribute buffers 5. fill/pack geometry buffers --- only run on the first device --- 6. displace meshes and calculate shadow transparency if needed 7. if displacement or shadow transparency occurred then refill buffers ----DEVICE SPECIFIC BEGIN---- 8. copy attribute buffers to device 9. copy geometry buffers to device 10. build object BVHs 11. build scene BVH ----DEVICE SPECIFIC END---- ----HOST SIDE BELOW THIS POINT 12. Remove modified or update tags The original flow and the new flow are shown in the files below. To achieve this a DeviceScene is created for each device. However, as this is just updating the geometry device update (and not all uses of the DeviceScene) the old DeviceScene needs to remain so other that buffers can be handled by the multi-device. The host side pointers are shared by all the DeviceScenes so they only need to be packed once. This is done using a method to update the host pointer of one DeviceScene from an existing DeviceScene. I have created a spreedsheet [here](https://docs.google.com/spreadsheets/d/1OYF3g988jZQXQG1pVopzIMQTRVo39V1znrLx8c92Rhs/edit?usp=sharing) comparing this patch versus Blender main. Also and chart of the results so far is available in the images below. I have also created another spreadsheet [here](https://docs.google.com/spreadsheets/d/1ywQym3LTLgGIGMedExxsS4RpPKwpP60BjeLtXf1BaF4/edit?usp=sharing) comparing it with both the original Blender implementation and the alternative method #107552. This patch is "geom_update" the other is "parallel_copy" in the graphs.

original_device_update.png

130 KiB

new_device_update.png

104 KiB

% Improvement for Geometry Update.png

20 KiB

Scene BVH Update Times.png

17 KiB

Scene BVH Update Times.png

17 KiB

Object BVH Update Times.png

19 KiB

Geometry Update Times.png

17 KiB

% Improvement for Scene Update.png

20 KiB

❤️ 6 🚀 3

William Leeson added 7 commits 2023-03-03 15:06:12 +01:00

598c7c151d Remove some locks from progress update

Removes mutex locks when updating pixel counts etc. Also, when
in background mode text status updates are removed.

31bda3d613 Create function to update progress values locklessly

a5efd8b582 Update the update lock with a lockless update

bbce8a0aae Add macro to switch from std::mutex to SpinLock

Simplified the code to be more similar to the original by
implementing it as a spin lock. Then added a macro to be able to
switch back and forth between std::mutex and the SpinLock.

72b918a9e2 Adds scoped event markers

Add scoped event markers to highlight code regions of interest.

e45b7f3a00 Merge branch 'upstream_main' into progress_update

0f5e6ed93e Refactor geometry update allow for easier parallel updates

William Leeson added 2 commits 2023-03-08 14:31:11 +01:00

fdd1c856e4 Add file that was forgotten

858195e683 Merge branch 'upstream_main' into geometry_update

William Leeson added this to the Render & Cycles project 2023-03-09 18:00:57 +01:00

William Leeson added the

labels 2023-03-09 18:01:44 +01:00

William Leeson added 1 commit 2023-03-10 12:29:47 +01:00

8f0b239036 Merge branch 'upstream_main' into geometry_update

William Leeson added 1 commit 2023-03-14 10:13:35 +01:00

59ddc1be9e Merge branch 'upstream_main' into geometry_update

William Leeson added 1 commit 2023-03-14 12:01:34 +01:00

2183d70e9d Move scene BVH update into parallel_for

The scen BVH update was outside the parallel for this moves it
inside the parallel_for so it can be performed in parallel also.

William Leeson added 1 commit 2023-03-14 12:21:31 +01:00

5c56f518eb FIX: Preallocate timer arrays in scene

The Windows build failed as the timers were using
a dynamic array. This has been changed to use an
heap array which is allocated with the scene
instead.

William Leeson added 1 commit 2023-03-14 12:58:33 +01:00

cba4d732e2 Remove stats params from deviceDataXferAndBVHUpdate

The stats parameters are now contained in the Scene class and
no longer need to be passed as a parameter.

Hans Goudey changed title from ~~WIP:Refactor geometry update for better parallelization and upload performance~~ to WIP: Refactor geometry update for better parallelization and upload performance

2023-03-14 13:27:16 +01:00

William Leeson added 1 commit 2023-03-14 13:32:05 +01:00

665a0e84b2 Move scne BVH upload into deviceDataXferAndBVHUpdate

To simplify the code and make it easier to read this moves the
scene BVH upload into the deviceDataXferAndBVHUpdate method.

William Leeson added 1 commit 2023-03-14 15:07:05 +01:00

532479e569 Move BVH2 device data update inside parallel_for

William Leeson requested review from Brecht Van Lommel 2023-03-14 15:33:33 +01:00

William Leeson commented

2023-03-14 15:35:41 +01:00

@brecht Ok I think this is close to ready. It would be good to get an overall impression even if not a complete review when you have time. I'll hopefully have performance data sometime this week.

William Leeson added 2 commits 2023-03-15 16:19:56 +01:00

3fb101d2ba Clean up code comments

74a7a2df77 Merge branch 'upstream_main' into geometry_update

William Leeson changed title from ~~WIP: Refactor geometry update for better parallelization and upload performance~~ to Refactor geometry update for better parallelization and upload performance

2023-03-16 08:23:39 +01:00

William Leeson added 1 commit 2023-03-16 09:37:19 +01:00

20bd8ad907 Merge branch 'upstream_main' into geometry_update

Iliya Katushenock commented

2023-03-16 14:02:23 +01:00

I see a bunch of commented code and benchmarking utilities in business logic.

William Leeson added 1 commit 2023-03-16 15:56:25 +01:00

b5e12fa1cc Remove some more comments

William Leeson added 1 commit 2023-03-16 15:59:47 +01:00

a875bc9e45 Remove comments from cuew

William Leeson changed title from ~~Refactor geometry update for better parallelization and upload performance~~ to WIP: Refactor geometry update for better parallelization and upload performance

2023-03-16 16:00:03 +01:00

Brecht Van Lommel commented

2023-03-16 18:57:23 +01:00

I'm not going to have time this week, but will look next week.

William Leeson added 1 commit 2023-03-17 12:51:29 +01:00

0bd30b175a Merge branch 'upstream_main' into geometry_update

William Leeson added 1 commit 2023-03-17 16:02:01 +01:00

717b7a9b6f Remove profiling interval markers from geometry update

William Leeson added 1 commit 2023-03-20 12:41:55 +01:00

buildbot/vexp-code-patch-coordinator Build done. Details

ef84fe9e5d Merge branch 'upstream_main' into geometry_update

Brecht Van Lommel commented

2023-03-20 18:18:22 +01:00

@blender-bot build

Brecht Van Lommel commented

2023-03-20 18:22:57 +01:00

This part I'm not sure about.

Create a BVH per device instead of using the MultiDevice

Right now we can build the BVH once for multi-device rendering on CUDA, HIP, Metal, OneAPI. With native hardware ray-tracing this may become impossible, but that usually comes with GPU accelerated tree building so it's not so bad.

This performance regression I would rather avoid.

This part I'm not sure about. > Create a BVH per device instead of using the MultiDevice Right now we can build the BVH once for multi-device rendering on CUDA, HIP, Metal, OneAPI. With native hardware ray-tracing this may become impossible, but that usually comes with GPU accelerated tree building so it's not so bad. This performance regression I would rather avoid.

Brecht Van Lommel commented

2023-03-20 18:43:28 +01:00

See the buildbot log for build errors, build warnings and test failures to be fixed.

Brecht Van Lommel requested changes 2023-03-20 19:28:34 +01:00

Brecht Van Lommel left a comment

I did not go through every line of code in geometry.cpp and geometry_additions.cpp, but overall:

Try to find some way to avoid counting sizes in a separate struct?
Splitting the code into multiple files seems good to get this organized a bit. I'd suggest to have geometry_attributes.cpp, geometry_bvh.cpp, geometry_mesh.cpp, .. or something along those lines. And then each of those can handle as much as possible related to that part of the data, with consistent function names for the various stages.

I did not go through every line of code in `geometry.cpp` and `geometry_additions.cpp`, but overall: * Try to find some way to avoid counting sizes in a separate struct? * Splitting the code into multiple files seems good to get this organized a bit. I'd suggest to have `geometry_attributes.cpp`, `geometry_bvh.cpp`, `geometry_mesh.cpp`, .. or something along those lines. And then each of those can handle as much as possible related to that part of the data, with consistent function names for the various stages.

extern/cuew/include/cuew.h Outdated

						
				@ -887,7 +887,6 @@ typedef enum  {

				typedef struct _nvrtcProgram* nvrtcProgram;

Brecht Van Lommel commented

2023-03-20 18:24:35 +01:00

Unrelated whitespace change.

brecht marked this conversation as resolved

extern/cuew/src/cuew.c Outdated

						
				@ -681,7 +682,6 @@ static int cuewNvrtcInit(void) {

				  return result;

				}

Brecht Van Lommel commented

2023-03-20 18:25:57 +01:00

Unrelated whitespace change.

brecht marked this conversation as resolved

extern/cuew/src/cuew.c Outdated

						
				@ -68,6 +68,7 @@ typedef void* DynamicLibrary;

				static DynamicLibrary cuda_lib;

				static DynamicLibrary nvrtc_lib;

				static DynamicLibrary nvtx_lib;

Brecht Van Lommel commented

2023-03-20 18:23:35 +01:00

This probably comes from another patch and should not be part of this?

brecht marked this conversation as resolved

intern/cycles/blender/session.cpp Outdated

						
				@ -125,6 +125,7 @@ void BlenderSession::create_session()

				  session = new Session(session_params, scene_params);

				  session->progress.set_update_callback(function_bind(&BlenderSession::tag_redraw, this));

				  session->progress.set_cancel_callback(function_bind(&BlenderSession::test_cancel, this));

				  session->progress.set_updates(!background);

Brecht Van Lommel commented

2023-03-20 18:27:30 +01:00

This seems like a change to review as a separate PR?

brecht marked this conversation as resolved

intern/cycles/bvh/multi.cpp Outdated

						
				@ -14,1 +13,4 @@

				                   const Device *device_)

				    : BVH(params_, geometry_, objects_), device(device_)

				{

				  // Resize the sub-bvh container to match the number of devices

Brecht Van Lommel commented

2023-03-20 18:29:05 +01:00

Maybe there can be some abstraction at this level to avoid building the BVH multiple times when not needed?

leesonw marked this conversation as resolved

intern/cycles/device/cpu/device_impl.cpp Outdated

						
				@ -149,6 +149,15 @@ void CPUDevice::mem_copy_to(device_memory &mem)

				  }

				}

				void CPUDevice::mem_copy_to(device_memory &mem, size_t, size_t offset)

Brecht Van Lommel commented

2023-03-20 18:31:45 +01:00

Silence unused argument warnings with , size_t, size_t)

Silence unused argument warnings with `, size_t, size_t)`

brecht marked this conversation as resolved

intern/cycles/device/cuda/device.cpp Outdated

						
				@ -31,0 +30,4 @@

				  int flags = CUEW_INIT_CUDA;

				  int cuew_result = cuewInit(flags);

Brecht Van Lommel commented

2023-03-20 18:34:24 +01:00

This change does nothing (anymore?).

leesonw marked this conversation as resolved

intern/cycles/device/cuda/device_impl.cpp Outdated

						
				@ -548,3 +548,3 @@

				  const CUDAContextScope scope(this);

				  cuda_assert(cuMemcpyHtoD((CUdeviceptr)device_pointer, host_pointer, size));

				  cuda_assert(cuMemcpyHtoD((CUdeviceptr)device_pointer + offset,

Brecht Van Lommel commented

2023-03-20 18:36:32 +01:00

Checking the definition of CUdeviceptr this seems to work, but it's not obvious by just looking at the code or for all devices types. Maybe better to use reinterpret_cast<unsigned char *>( in all device code, to make it clear?

Checking the definition of `CUdeviceptr` this seems to work, but it's not obvious by just looking at the code or for all devices types. Maybe better to use `reinterpret_cast<unsigned char *>(` in all device code, to make it clear?

brecht marked this conversation as resolved

intern/cycles/device/cuda/device_impl.cpp Outdated

						
				@ -551,0 +554,4 @@

				void CUDADevice::host_mem_alloc(size_t size, int aligment, void **p_mem) {

				  CUDAContextScope scope(this);

				  CUresult result = cuMemAllocHost(p_mem, size);

Brecht Van Lommel commented

2023-03-20 18:39:35 +01:00

The CUDA docs for cuMemAllocHost say:

Allocating excessive amounts of memory with cuMemAllocHost() may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to allocate staging areas for data exchange between host and device.

It's not clear to me that we should be using this for allocating the entire scene representation on the CPU? Seems like that could lead to Blender/Cycles running out of memory faster than before.

The CUDA docs for `cuMemAllocHost` say: ``` Allocating excessive amounts of memory with cuMemAllocHost() may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to allocate staging areas for data exchange between host and device. ``` It's not clear to me that we should be using this for allocating the entire scene representation on the CPU? Seems like that could lead to Blender/Cycles running out of memory faster than before.

brecht marked this conversation as resolved

intern/cycles/device/cuda/device_impl.cpp Outdated

						
				@ -551,0 +555,4 @@

				void CUDADevice::host_mem_alloc(size_t size, int aligment, void **p_mem) {

				  CUDAContextScope scope(this);

				  CUresult result = cuMemAllocHost(p_mem, size);

				  if(result != CUDA_SUCCESS) {

Brecht Van Lommel commented

2023-03-20 18:38:03 +01:00

This should not have custom error handling. We have cuda_assert for error reporting, so that an error is set in the device and reported to the user.

This should not have custom error handling. We have `cuda_assert` for error reporting, so that an error is set in the device and reported to the user.

brecht marked this conversation as resolved

intern/cycles/device/cuda/device_impl.cpp Outdated

						
				@ -566,8 +584,13 @@ void CUDADevice::mem_alloc(device_memory &mem)

				void CUDADevice::mem_copy_to(device_memory &mem)

Brecht Van Lommel commented

2023-03-20 18:46:30 +01:00

Can this code be deduplicated by calling mem_copy_to(mem, mem.memory_size(), 0)?

Can this code be deduplicated by calling `mem_copy_to(mem, mem.memory_size(), 0)`?

brecht marked this conversation as resolved

intern/cycles/device/cuda/device_impl.cpp Outdated

						
				@ -568,3 +586,2 @@

				  if (mem.type == MEM_GLOBAL) {

				    global_free(mem);

				    global_alloc(mem);

				    if ((mem.device_size < mem.memory_size()) || (!mem.device_pointer)) {

Brecht Van Lommel commented

2023-03-20 18:41:38 +01:00

This means memory allocations never decrease in size? Shouldn't it reallocate whenever the size changes?

brecht marked this conversation as resolved

intern/cycles/device/cuda/device_impl.cpp Outdated

						
				@ -584,0 +607,4 @@

				void CUDADevice::mem_copy_to(device_memory &mem, size_t size, size_t offset)

				{

				  if (mem.type == MEM_GLOBAL) {

				    if ((mem.device_size < mem.memory_size()) || (!mem.device_pointer)) {

Brecht Van Lommel commented

2023-03-20 18:42:03 +01:00

Same comment as above.

brecht marked this conversation as resolved

intern/cycles/device/device.cpp Outdated

						
				@ -766,0 +774,4 @@

				  }

				}

				void GPUDevice::generic_copy_to(device_memory &mem, size_t size, size_t offset)

Brecht Van Lommel commented

2023-03-20 18:47:05 +01:00

Similar comment about code deduplication as above.

brecht marked this conversation as resolved

intern/cycles/device/device.h Outdated

						
				@ -8,3 +8,3 @@

				#include "bvh/params.h"

				#include "scene/scene.h"

Brecht Van Lommel commented

2023-03-20 18:48:32 +01:00

The intent for the device API was to not be aware of Scene or DeviceScene, though that is already broken somewhat.

Ideally we'd avoid it entirely, but here at least it should include but use a class DeviceScene; forward declaration.

The intent for the device API was to not be aware of `Scene` or `DeviceScene`, though that is already broken somewhat. Ideally we'd avoid it entirely, but here at least it should include but use a `class DeviceScene;` forward declaration.

leesonw marked this conversation as resolved

intern/cycles/device/device.h Outdated

						
				@ -217,2 +222,4 @@

				    return 0;

				  }

				  virtual device_ptr find_matching_mem(device_ptr key, Device * /*sub*/)

Brecht Van Lommel commented

2023-03-20 18:48:53 +01:00

Add a comment explaining what this does.

leesonw marked this conversation as resolved

intern/cycles/device/device.h Outdated

						
				@ -290,8 +300,11 @@ class Device {

				  friend class DeviceServer;

				  friend class device_memory;

				  virtual void host_mem_alloc(size_t size, int alignment, void **p_mem);

Brecht Van Lommel commented

2023-03-20 18:44:36 +01:00

This can return void * instead of using void **p_mem.

This can return `void *` instead of using `void **p_mem`.

leesonw marked this conversation as resolved

intern/cycles/device/memory.h Outdated

						
				@ -369,17 +379,78 @@ template<typename T> class device_vector : public device_memory {

				    assert(data_elements > 0);

				  }

				  device_vector(Device *device,

Brecht Van Lommel commented

2023-03-20 18:50:46 +01:00

TODO: what is this constructor for?

leesonw marked this conversation as resolved

intern/cycles/device/multi/device.cpp Outdated

						
				@ -41,3 +45,3 @@

				      /* Always add CPU devices at the back since GPU devices can change

				       * host memory pointers, which CPU uses as device pointer. */

				      SubDevice *sub;

				      SubDevice *sub = new SubDevice;

Brecht Van Lommel commented

2023-03-20 18:52:13 +01:00

Use unique_ptr and make_unique instead of new in new code.

Use `unique_ptr` and `make_unique` instead of `new` in new code.

leesonw marked this conversation as resolved

intern/cycles/device/multi/device.cpp Outdated

						
				@ -49,2 +52,2 @@

				        devices.emplace_front();

				        sub = &devices.front();

				        devices.emplace_back(sub);

				        // find first CPU device and swop it with the new device

Brecht Van Lommel commented

2023-03-20 18:53:20 +01:00

swop -> swap

leesonw marked this conversation as resolved

intern/cycles/device/multi/device.cpp Outdated

						
				@ -51,0 +54,4 @@

				        // to keep the CPU devices at the end of the vector.

				        int last = devices.size() - 1;

				        int o = last;

				        while ((o > 0) && (devices[o - 1]->device->info.type == DEVICE_CPU)) {

Brecht Van Lommel commented

2023-03-20 18:54:16 +01:00

If we ensure the CPU is always the last device in the vector, there is no need to have a while loop searching for it.

leesonw marked this conversation as resolved

intern/cycles/scene/geometry.cpp Outdated

						
				@ -1041,2 +637,2 @@

				    }

				  }

				  /* copy svm attributes to device */

				  dscene->attributes_map.copy_to_device_if_modified();

Brecht Van Lommel commented

2023-03-20 19:17:45 +01:00

Looking at code like this, it feels like maybe DeviceScene should somehow have a list of the device_memory it contains. And then perhaps that could be used to have a function to transfer all modified memory, rather than manually doing it here.

Looking at code like this, it feels like maybe `DeviceScene` should somehow have a list of the `device_memory` it contains. And then perhaps that could be used to have a function to transfer all modified memory, rather than manually doing it here.

intern/cycles/scene/geometry.cpp Outdated

						
				@ -2088,3 +1487,1 @@

				    if (geom->is_mesh()) {

				      Mesh *mesh = static_cast<Mesh *>(geom);

				      mesh->subd_attributes.clear_modified();

				    size_t n_scenes = scene->dscenes.size();

Brecht Van Lommel commented

2023-03-20 19:18:31 +01:00

n_scenes -> num_scenes is the convention.

leesonw marked this conversation as resolved

intern/cycles/scene/geometry.cpp Outdated

						
				@ -2091,0 +1487,4 @@

				    size_t n_scenes = scene->dscenes.size();

				    // Parallel upload the geometry data to the devices and

				    // calculate or refit the BVHs

				    tbb::parallel_for(size_t(0),

Brecht Van Lommel commented

2023-03-20 19:20:01 +01:00

Use just parallel_for, not tbb::parallel_for. We abstact such things in case we want to override them.

Use just `parallel_for`, not `tbb::parallel_for`. We abstact such things in case we want to override them.

leesonw marked this conversation as resolved

intern/cycles/scene/geometry.cpp Outdated

						
				@ -2091,0 +1489,4 @@

				    // calculate or refit the BVHs

				    tbb::parallel_for(size_t(0),

				                      n_scenes,

				                      [this,

Brecht Van Lommel commented

2023-03-20 19:19:14 +01:00

Can we automatically capture all these variables instead of doing it manually?

leesonw marked this conversation as resolved

intern/cycles/scene/geometry.h Outdated

						
				@ -162,0 +185,4 @@

				  size_t *motion_vert_offsets;

				};

				struct AttributeSizes {

Brecht Van Lommel commented

2023-03-20 19:11:30 +01:00

Can we avoid having these AttributeSizes and GeometrySizes structs? This feels like something we should be able to set on the device_memory in DeviceScene instead of tracking in a separate struct.

Or maybe it's not a good because every device has their own DeviceScene?

Can we avoid having these `AttributeSizes` and `GeometrySizes` structs? This feels like something we should be able to set on the `device_memory` in `DeviceScene` instead of tracking in a separate struct. Or maybe it's not a good because every device has their own `DeviceScene`?

brecht marked this conversation as resolved

intern/cycles/scene/geometry_additions.cpp Outdated

						
				@ -0,0 +1,1045 @@

				/* SPDX-License-Identifier: Apache-2.0

Brecht Van Lommel commented

2023-03-20 18:58:39 +01:00

This file needs a better name, or probably code needs to split across files in a different way. This doesn't seem like a logical grouping.

Brecht Van Lommel commented

2023-04-14 16:10:28 +02:00

Did you forget to remove this file?

William Leeson commented

2023-04-21 17:30:39 +02:00

Yes, I have removed it now :-/

intern/cycles/scene/geometry_additions.cpp Outdated

						
				@ -0,0 +34,4 @@

				/*

				 * Clears all tags used to indicate the the shader needs to be updated.

				 */

				void GeometryManager::clearShaderUpdateTags(Scene *scene)

Brecht Van Lommel commented

2023-03-20 18:57:25 +01:00

Don't use camelCase function names.

leesonw marked this conversation as resolved

intern/cycles/scene/geometry_additions.cpp Outdated

						
				@ -0,0 +121,4 @@

				  progress.set_status("Updating Mesh", "Computing attributes");

				  // SHOULD NOT ALLOC ONLY ALLOC IF MORE SPACE IS NEEDED

Brecht Van Lommel commented

2023-03-20 19:12:16 +01:00

I guess this is a TODO comment to be resolved.

leesonw marked this conversation as resolved

intern/cycles/scene/object.cpp Outdated

						
				@ -660,11 +660,8 @@ void ObjectManager::device_update_transforms(DeviceScene *dscene, Scene *scene,

				                 }

				               });

				  if (progress.get_cancel()) {

Brecht Van Lommel commented

2023-03-20 18:59:53 +01:00

Why remove this?

In general we want to be able to cancel rendering in the middle of updates so users don't have to wait too long.

Why remove this? In general we want to be able to cancel rendering in the middle of updates so users don't have to wait too long.

leesonw marked this conversation as resolved

intern/cycles/scene/scene.cpp Outdated

						
				@ -36,0 +37,4 @@

				 * checks the progress for if a cancel has been requested and also

				 * the device to see if an error has occurred.

				 */

				bool Scene::progressErrorCheck(Progress &progress, Device *device) {

Brecht Van Lommel commented

2023-03-20 19:01:40 +01:00

I'd name this check_cancel_update or something like that, progressErrorCheck to me reads as if it's only checking for errors.

I'd name this `check_cancel_update` or something like that, `progressErrorCheck` to me reads as if it's only checking for errors.

leesonw marked this conversation as resolved

intern/cycles/scene/scene.cpp Outdated

						
				@ -36,0 +39,4 @@

				 */

				bool Scene::progressErrorCheck(Progress &progress, Device *device) {

				  bool status = false;

				  if (!background && progress.get_updates()) {

Brecht Van Lommel commented

2023-03-20 19:02:13 +01:00

Cancelling quickly is important for F12 renders as well, which are "background" in Cycles.

leesonw marked this conversation as resolved

intern/cycles/scene/scene.cpp Outdated

						
				@ -104,1 +116,4 @@

				{

				  /* Create a DeviceScene for each device */

				  device->foreach_device([this](Device *sub_device) {

				    DeviceScene *sub_dscene = new DeviceScene(sub_device);

Brecht Van Lommel commented

2023-03-20 19:02:42 +01:00

Use unique_ptr instead of new in new code.

Use `unique_ptr` instead of `new` in new code.

leesonw marked this conversation as resolved

intern/cycles/scene/scene.cpp Outdated

						
				@ -107,0 +127,4 @@

				  mesh_times = new double[device_count];

				  attrib_times = new double[device_count];

				  object_bvh_times = new double[device_count];

				  scene_bvh_times = new double[device_count];

Brecht Van Lommel commented

2023-03-20 19:03:16 +01:00

Use vector instead of new[]. Also perhaps it's better to put these in a struct instead of separate arrays.

Use `vector` instead of `new[]`. Also perhaps it's better to put these in a struct instead of separate arrays.

leesonw marked this conversation as resolved

intern/cycles/util/progress.h Outdated

						
				@ -17,3 +28,4 @@

				CCL_NAMESPACE_BEGIN

				class SpinLock {

Brecht Van Lommel commented

2023-03-20 19:04:08 +01:00

Don't implement a custom spinlock here, or at all if we can avoid it since getting that right can be surprisingly difficult.

leesonw marked this conversation as resolved

intern/cycles/util/progress.h Outdated

						
				@ -41,2 +93,4 @@

				    error_message = "";

				    cancel_cb = function_null;

				    updates = true;

				    //    update_call = true;

Brecht Van Lommel commented

2023-03-20 19:04:46 +01:00

Don't leave uncommented code in.

leesonw marked this conversation as resolved

William Leeson added 4 commits 2023-03-21 15:58:21 +01:00

ccf7384458 Merge branch 'upstream_main' into geometry_update

042bf24409 Remove cuew changes.

fcb3bd5854 Adds comment to describe the find_matching_mem method

7bf80f2b77 Clean up code and simplify

1. Switch to use auto for DeviceScene
2. use parallel_for instead of tbb::parallel_for
3. rename n_scenes num_scenes

William Leeson added 2 commits 2023-03-22 11:17:44 +01:00

01e96490f1 Change params for host_mem_alloc to host_mem_alloc(size_t size, int aligment)

The parameters for host_mem_alloc were changed to
`host_mem_alloc(size_t size, int aligment) from
host_mem_alloc(size_t size, int aligment, void **p_mem)

4e1469b8bb Only build the BVH2 BVH once

Ensures that only 1 BVH2 BVH is built for each object or scene.
This is done using lock to aquire the BVH2 and other threads
can skip the building if it is for an object. However, for the
top level BVH all threads wait on the TLAS to be finished.

William Leeson added 1 commit 2023-03-22 12:35:33 +01:00

6795ee2f71 Only wait on condition_variable if mutex is locked.

William Leeson added 1 commit 2023-03-23 21:35:52 +01:00

6e0875b729 Only build a BVH2 BVH once

Basically checks if the BVH2 BVH has been built or is being built.
Then when building the top level BVH2 BVH all devices that use it
wait for the one that is building it to finish before continuing.

William Leeson added 1 commit 2023-03-24 12:20:32 +01:00

6618d31e12 Merge branch 'upstream_main' into geometry_update

William Leeson added 1 commit 2023-03-28 09:13:02 +02:00

9f149ee682 Merge branch 'upstream_main' into geometry_update

William Leeson added 1 commit 2023-03-28 10:18:24 +02:00

92a3b720bf Merge branch 'upstream_main' into geometry_update

William Leeson added 2 commits 2023-03-28 13:25:13 +02:00

c952e5d159 FIX: Fix memory leak when using more than 1 device

The mult-device set the device pointer for the device_memory as it
iterated through the devices. However, it did not restore the
original pointer to the device if the device_ptr (pointer to the
memory buffer) was 0 which it always is after a mem_free.

3a566ccee4 FIX: Using the Optix BVH for the NVidia device

Optix device was using the multi-bvh as the Optix BVH which was
causing the device to have the wrong handles for the objects.

William Leeson added 2 commits 2023-03-28 14:56:21 +02:00

b0345701a0 FIX:Use the headless flag to switch on/off progress updates

Don't do progress updates when in headless mode. Previously
the background flag was used but this is also used for F12 render.
Which needs the updates.

d59a30b4a0 FIX: Lock mutex before using condition variable waiting on BVH build

Mutex needs to be locked before using the condition variable can
be used. This adds a scoped lock to ensure it is locked before
waiting on the condition_variable.

William Leeson added 1 commit 2023-03-29 10:11:04 +02:00

b75f6ad788 FIX: Make sure top level BVH is built after all other BVHs

The top level BVH can only be built once all the other (bottom
level) BVHs are completed.

William Leeson added 1 commit 2023-03-29 18:20:50 +02:00

04475876d3 Merge branch 'upstream_main' into geometry_update

Brecht Van Lommel requested review from Brecht Van Lommel 2023-03-29 18:51:19 +02:00

William Leeson added 1 commit 2023-04-04 14:41:32 +02:00

b9840e7640 Merge branch 'upstream_main' into geometry_update

William Leeson added 1 commit 2023-04-05 18:31:18 +02:00

fc30231f88 Merge branch 'upstream_main' into geometry_update

William Leeson added 1 commit 2023-04-10 18:09:31 +02:00

3186d8d725 Merge branch 'upstream_main' into geometry_update

William Leeson added 17 commits 2023-04-11 15:41:13 +02:00

a27247e28f Refactor geometry.cpp and geometry_additions.cpp

This breaks up the geometry.cpp and geometry_additions.cpp into
the following files:
- geometry.cpp
- geometry_mesh.cpp
- geometry_attributes.cpp
- geometry_bvh.cpp
so as to organise the code into more manageable pieces.

5409cc2011 Refactor memory copy methods to remove duplication

9b3b180d23 FIX: Enable OneAPI build and fix compile issues

Needed to add the memory copy methods the have offset and size.

57d7deec2a Merge branch 'geometry_update' into geometry_update_refactor

ac67371fa4 Makes sure the CPU is the last device just once at the end.

Previously after every device was added the CPU device was swopped
to the end of the list. Now the position it was added at is saved
and it is swopped with the last device at the end.

674721194e FIX: Adds case where no CPU is selected in MultiDevice

Forgot to push the code which skips swopping if the is no CPU in
the MultiDevice set.

3c81e479a6 Switch off using pinned memory on CUDA devices

Pinned memory was used for all device_memory. However, due to
that it may cause memory pressure this was removed and can be
enabled by passing USE_DEVICE_PINNED_MEMORY.

e3068b23b7 Remove only allocating memory if it needs to be enlarged.

bd08b280a4 Replace include with forward declaration of DeviceScene.

0f1e29912c Remove unused constructor

32c44f9f43 Remove only grow memory allocation.

511d878921 Use unique_ptr to handle the SubDevice and Device pointers

8731ee9761 Change swop to swap.

6e904ef6b2 Rename progressErrorCheck check_cancel_update

06b43213b7 Don't use background to disable progress update.

506106b3a8 Remove custom spin lock and replace with that from utils/thread.h

cddd2bfdf0 Replae upload and building times arrays with a vector struct

Arrays of dougle values are replaced by a vector<> of a struct
scene_times.

William Leeson reviewed 2023-04-11 15:43:15 +02:00

extern/cuew/include/cuew.h Outdated

William Leeson commented

2023-04-11 14:59:36 +02:00

Removed

leesonw marked this conversation as resolved

intern/cycles/bvh/multi.cpp Outdated

						
				@ -14,1 +13,4 @@

				                   const Device *device_)

				    : BVH(params_, geometry_, objects_), device(device_)

				{

				  // Resize the sub-bvh container to match the number of devices

William Leeson commented

2023-04-11 15:05:21 +02:00

I added a variable and mutex to indicate if it was being built or has been built. This ensures that it is built only once. However, I put this code in build_bvh in device.cpp as the BVHMulti can have different BVH types within it and this only applies to BVH2.

brecht marked this conversation as resolved

intern/cycles/device/cuda/device_impl.cpp Outdated

						
				@ -548,3 +548,3 @@

				  const CUDAContextScope scope(this);

				  cuda_assert(cuMemcpyHtoD((CUdeviceptr)device_pointer, host_pointer, size));

				  cuda_assert(cuMemcpyHtoD((CUdeviceptr)device_pointer + offset,

William Leeson commented

2023-04-11 09:02:00 +02:00

Changing this to use reinterpret_cast<unsigned char *>() results in having to cast it back to CUdeviceptr as follows reinterpret_cast<CUdeviceptr>(reinterpret_cast<unsigned char *>(device_pointer)) otherwise the following error occurs:

device_impl.cpp:550:78: error: invalid conversion from ‘unsigned char*’ to ‘CUdeviceptr’ {aka ‘long long unsigned int’} [-fpermissive]
  550 | uMemcpyHtoD(reinterpret_cast<unsigned char *>(device_pointer) + offset,
      |             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~
      |                                                               |
      |                                                               unsigned char*

which is fine but it seems a bigger mouthful to swallow that just using CUdeviceptr in the first place.

Changing this to use `reinterpret_cast<unsigned char *>()` results in having to cast it back to `CUdeviceptr` as follows `reinterpret_cast<CUdeviceptr>(reinterpret_cast<unsigned char *>(device_pointer))` otherwise the following error occurs: ``` device_impl.cpp:550:78: error: invalid conversion from ‘unsigned char*’ to ‘CUdeviceptr’ {aka ‘long long unsigned int’} [-fpermissive] 550 | uMemcpyHtoD(reinterpret_cast<unsigned char *>(device_pointer) + offset, | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~ | | | unsigned char* ``` which is fine but it seems a bigger mouthful to swallow that just using `CUdeviceptr` in the first place.

Brecht Van Lommel commented

2023-04-14 16:51:06 +02:00

Ok, fine to leave it as is then.

brecht marked this conversation as resolved

intern/cycles/device/cuda/device_impl.cpp Outdated

						
				@ -551,0 +554,4 @@

				void *CUDADevice::host_mem_alloc(size_t size, int aligment) {

				  void *p_mem = NULL;

				  CUDAContextScope scope(this);

William Leeson commented

2023-04-11 09:09:16 +02:00

That is a possibility, I can switch it off for now. In our usage we never had that happen but most of the systems were not memory constrained. Not putting it in pinned memory ends up with a double copy as the driver then copies it to pinned memory before copying to the device. I switched this back to using regular memory for now.

brecht marked this conversation as resolved

intern/cycles/device/cuda/device_impl.cpp Outdated

						
				@ -581,6 +607,29 @@ void CUDADevice::mem_copy_to(device_memory &mem)

				  }

				}

				void CUDADevice::mem_copy_to(device_memory &mem, size_t size, size_t offset)

William Leeson commented

2023-04-11 10:29:39 +02:00

This gives a great speed boost for interactive preview. I'll remove it from this PR and I can add it in a separate PR later.

brecht marked this conversation as resolved

intern/cycles/device/device.h Outdated

						
				@ -8,3 +8,3 @@

				#include "bvh/params.h"

				#include "scene/scene.h"

William Leeson commented

2023-04-11 10:47:09 +02:00

Ok, I replaced the include with a forward declaration. In another patch, instead of uploading the geometry data a second time we use the data in the DeviceScene which is already uploaded. This saves both memory and time. Would that not be acceptable?

brecht marked this conversation as resolved

intern/cycles/device/memory.h Outdated

						
				@ -377,4 +449,2 @@

				  {

				    size_t new_size = size(width, height, depth);

				    if (new_size != data_size) {

William Leeson commented

2023-04-11 10:53:57 +02:00

I reverted this change also.

brecht marked this conversation as resolved

intern/cycles/device/memory.h Outdated

						
				@ -369,1 +379,4 @@

				  device_vector(Device *device,

				                const char *name,

				                void *p_mem,

William Leeson commented

2023-04-11 10:52:53 +02:00

Yes, it looks like it's not used in this PR so I removed it.

brecht marked this conversation as resolved

intern/cycles/device/metal/device_impl.mm Outdated

						
				@ -833,0 +856,4 @@

				    }

				    else {

				      generic_copy_to(mem, size, offset);

				    }

William Leeson commented

2023-04-11 10:55:53 +02:00

I reverted this change also.

brecht marked this conversation as resolved

intern/cycles/device/multi/device.cpp Outdated

						
				@ -41,3 +45,3 @@

				      /* Always add CPU devices at the back since GPU devices can change

				       * host memory pointers, which CPU uses as device pointer. */

				      SubDevice *sub;

				      SubDevice *sub = new SubDevice;

William Leeson commented

2023-04-11 15:07:58 +02:00

Done. I also replaced the device creation with a unique_ptr<>

Done. I also replaced the device creation with a `unique_ptr<>`

brecht marked this conversation as resolved

intern/cycles/device/multi/device.cpp Outdated

						
				@ -51,0 +54,4 @@

				        // to keep the CPU devices at the end of the vector.

				        int last = devices.size() - 1;

				        int o = last;

				        while ((o > 0) && (devices[o - 1]->device->info.type == DEVICE_CPU)) {

William Leeson commented

2023-04-11 15:08:56 +02:00

Yes, I fixed this up.

brecht marked this conversation as resolved

intern/cycles/scene/geometry.h Outdated

						
				@ -162,0 +185,4 @@

				  size_t *motion_vert_offsets;

				};

				struct AttributeSizes {

William Leeson commented

2023-04-11 15:17:13 +02:00

I'll see what I can do. It is this way because in the original implementation the device_memory is not always exactly the same size (sometimes it is bigger) so we need a place to track the actual size of upload otherwise more memory is uploaded that needed.

I don't think putting it in DeviceScene is great because as you say if I put it in the DeviceScene then that information is replicated for each device.

I'll investigate more to see if there is somewhere else that this information would be logical.

I'll see what I can do. It is this way because in the original implementation the device_memory is not always exactly the same size (sometimes it is bigger) so we need a place to track the actual size of upload otherwise more memory is uploaded that needed. I don't think putting it in DeviceScene is great because as you say if I put it in the DeviceScene then that information is replicated for each device. I'll investigate more to see if there is somewhere else that this information would be logical.

Brecht Van Lommel commented

2023-04-14 16:47:36 +02:00

I still think it's important to eliminate these, tracking this separately is really error prone.

I don't understand what you mean by "device_memory is not always exactly the same size (sometimes it is bigger)". Is this related to the device memory not shrinking with mem.device_size < mem.memory_size(), that I think we should change to != anyway?

Even if we were not shrinking it, that should be an implementation detail of device_vector similar to how it's an implementation detail of std::vector. It's not something that should be tracked externally.

I still think it's important to eliminate these, tracking this separately is really error prone. I don't understand what you mean by "device_memory is not always exactly the same size (sometimes it is bigger)". Is this related to the device memory not shrinking with `mem.device_size < mem.memory_size()`, that I think we should change to `!=` anyway? Even if we were not shrinking it, that should be an implementation detail of `device_vector` similar to how it's an implementation detail of `std::vector`. It's not something that should be tracked externally.

intern/cycles/scene/scene.cpp Outdated

						
				@ -36,0 +37,4 @@

				 * checks the progress for if a cancel has been requested and also

				 * the device to see if an error has occurred.

				 */

				bool Scene::progressErrorCheck(Progress &progress, Device *device) {

William Leeson commented

2023-04-11 13:48:43 +02:00

I changed this to check_cancel_update.

brecht marked this conversation as resolved

intern/cycles/scene/scene.cpp Outdated

						
				@ -36,0 +39,4 @@

				 */

				bool Scene::progressErrorCheck(Progress &progress, Device *device) {

				  bool status = false;

				  if (!background && progress.get_updates()) {

William Leeson commented

2023-04-11 13:57:45 +02:00

I changed this to only happen if updates are on. i.e. I removed the !background condition

brecht marked this conversation as resolved

intern/cycles/scene/scene.cpp Outdated

						
				@ -107,0 +127,4 @@

				  mesh_times = new double[device_count];

				  attrib_times = new double[device_count];

				  object_bvh_times = new double[device_count];

				  scene_bvh_times = new double[device_count];

William Leeson commented

2023-04-11 14:57:40 +02:00

I replaced these with a vector and moved them into a struct.

intern/cycles/util/progress.h Outdated

						
				@ -17,3 +28,4 @@

				CCL_NAMESPACE_BEGIN

				class SpinLock {

William Leeson commented

2023-04-11 14:58:45 +02:00

I replaced this with the one in utils/thread.h and removed this code.

brecht marked this conversation as resolved

intern/cycles/util/progress.h Outdated

						
				@ -41,2 +93,4 @@

				    error_message = "";

				    cancel_cb = function_null;

				    updates = true;

				    //    update_call = true;

William Leeson commented

2023-04-11 14:58:55 +02:00

Removed

brecht marked this conversation as resolved

William Leeson added 1 commit 2023-04-11 16:57:48 +02:00

66a6a7a0af Clean up code

1. remove camelCase method names.
2. fix up white space changes.
3. remove some unnecessary changes.

William Leeson added 1 commit 2023-04-11 18:50:44 +02:00

bb2e9cc75f FIX: Size needs to be explicit as default params have been removed.

William Leeson added 2 commits 2023-04-12 12:04:04 +02:00

fe548c872d Merge branch 'upstream_main' into geometry_update

4920afae04 Merge branch 'geometry_update' of projects.blender.org:leesonw/blender-cluster into geometry_update

William Leeson added 1 commit 2023-04-13 15:37:25 +02:00

d8dcfcb89b Merge branch 'upstream_main' into geometry_update

William Leeson reviewed 2023-04-13 15:57:32 +02:00

intern/cycles/scene/geometry_additions.cpp Outdated

						
				@ -0,0 +1,407 @@

				/* SPDX-License-Identifier: Apache-2.0

William Leeson commented

2023-04-11 16:14:47 +02:00

I split this into geometry.cpp, geometry_mesh.cpp, geometry_attributes.cpp and geometry_bvh.cpp

brecht marked this conversation as resolved

intern/cycles/scene/geometry_additions.cpp Outdated

						
				@ -0,0 +34,4 @@

				/*

				 * Clears all tags used to indicate the the shader needs to be updated.

				 */

				void GeometryManager::clearShaderUpdateTags(Scene *scene)

William Leeson commented

2023-04-11 16:31:40 +02:00

I removed the camelCase function names.

brecht marked this conversation as resolved

intern/cycles/scene/geometry_additions.cpp Outdated

						
				@ -0,0 +121,4 @@

				      sub_dscene->tri_verts.assign_mem(dscene->tri_verts);

				      sub_dscene->tri_verts.tag_modified();

				    }

				    {

William Leeson commented

2023-04-11 16:32:02 +02:00

Removed as it was an old comment

brecht marked this conversation as resolved

William Leeson added 1 commit 2023-04-13 16:20:56 +02:00

5c9b27b374 Switch to use Cycles CUDA error checking.

William Leeson added 3 commits 2023-04-13 16:25:26 +02:00

b1fa18f6f8 Restore const methods and clean up code.

4ba4f7bf65 Remove commented out code

63bb671388 Restore cancel checks

William Leeson added 2 commits 2023-04-14 10:23:33 +02:00

29396f6227 Remove unused data elements.

99f6183d84 Merge branch 'upstream_main' into geometry_update

William Leeson added 1 commit 2023-04-14 10:57:01 +02:00

bbbf76c4db GeometrySizes and AttributeSizes are stored in Scene

Moved GeometrySizes and AttributeSizes to be stored in scene. This
resulted in them being removed as parameters where a Scene was
already passed in.

William Leeson added 1 commit 2023-04-14 13:42:04 +02:00

b1dd204d42 Moves DeviceScene related methods to the DeviceScene class

Refactored the GeometryManager methods so that those relating to
the DeviceScene are now methods in the DeviceScene.

William Leeson added 1 commit 2023-04-14 14:20:59 +02:00

f63c508b40 Move BVH2 device update to DeviceScene

As the device_update_bvh2 method just involves transferring data
to the device it was moved to DeviceScene.

William Leeson added 2 commits 2023-04-14 15:10:42 +02:00

d7cd4a4951 FIX: Add device type definitions

Moving DeviceScene from scene.h caused the device definitions to be
no longer present in various files. This add them back in by
including device.h in the required files.

buildbot/vexp-code-patch-coordinator Build done. Details

064724d14e Merge branch 'geometry_update' of projects.blender.org:leesonw/blender-cluster into geometry_update

William Leeson changed title from ~~WIP: Refactor geometry update for better parallelization and upload performance~~ to Refactor geometry update for better parallelization and upload performance

2023-04-14 15:13:20 +02:00

William Leeson commented

2023-04-14 15:15:08 +02:00

@brecht I hope this is closer to what you are looking for. I have removed the WIP as I think it is ready now :-)

Ray molenkamp commented

2023-04-14 15:22:50 +02:00

@blender-bot package

Blender Bot commented

2023-04-14 15:22:53 +02:00

Package build started. Download here when ready.

Package build started. [Download here](https://builder.blender.org/download/patch/PR105403) when ready.

William Leeson added 1 commit 2023-04-14 15:53:47 +02:00

b407dba398 FIX: Use new memory copy method that replaced the old one

Forgot to add the file for the previous change to the memory copy
methods, which resulted in the OneAPI build failing.

Brecht Van Lommel commented

2023-04-14 18:04:48 +02:00

On the design side, I'm still confused about why parallelizing memory copying to multiple devices requires such deep changes on the host side, why this can't be abstracted more inside Device and MultiDevice.

CUDA and HIP already have a MemMap device_mem_map that keeps track of all device_memory. So if device_memory contains a bool modified_, that should be enough information to implement a method like Device::copy_modified_memory_to_device that just loops over all memory and copies where needed? And then that can be done inside a parallel for in MultiDevice.

Or am I missing a reason for why this is not practical?

On the design side, I'm still confused about why parallelizing memory copying to multiple devices requires such deep changes on the host side, why this can't be abstracted more inside `Device` and `MultiDevice`. CUDA and HIP already have a `MemMap device_mem_map` that keeps track of all `device_memory`. So if `device_memory` contains a `bool modified_`, that should be enough information to implement a method like `Device::copy_modified_memory_to_device` that just loops over all memory and copies where needed? And then that can be done inside a parallel for in `MultiDevice`. Or am I missing a reason for why this is not practical?

Brecht Van Lommel requested changes 2023-04-14 18:06:24 +02:00

Brecht Van Lommel left a comment

I'm having trouble reviewing the geometry update changes, since it's both moving code and changing the logic. It would be very helpful if you could submit a PR that just moves code to new files and order functions to be similar to the new code. Then I can quickly review and approve that, and then when this PR is updated hopefully it will show a readable delta between old and new code?

intern/cycles/blender/session.cpp Outdated

						
				@ -125,6 +125,7 @@ void BlenderSession::create_session()

				  session = new Session(session_params, scene_params);

				  session->progress.set_update_callback(function_bind(&BlenderSession::tag_redraw, this));

				  session->progress.set_cancel_callback(function_bind(&BlenderSession::test_cancel, this));

				  session->progress.set_updates(!headless);

Brecht Van Lommel commented

2023-04-14 16:01:01 +02:00

It's still not clear to me why showing updates should ever be disabled. Regardless if rendering headless or not it's good to see what is happening so Cycles doesn't look like it's stuck. If performance is an issue we can look at other ways to optimize this.

Can you explain the thinking behind this?

Regardless, I think this change and then changes in progress.h should be submitted a a separate PR. I think that can be reviewed and landed separate from this, there doesn't seem to be a dependency in either direction.

It's still not clear to me why showing updates should ever be disabled. Regardless if rendering headless or not it's good to see what is happening so Cycles doesn't look like it's stuck. If performance is an issue we can look at other ways to optimize this. Can you explain the thinking behind this? Regardless, I think this change and then changes in `progress.h` should be submitted a a separate PR. I think that can be reviewed and landed separate from this, there doesn't seem to be a dependency in either direction.

William Leeson commented

2023-04-28 14:44:37 +02:00

The reason behind this is that with many threads the mutexs in the progress calls were causing large stalls. So this was an easy way to switch then on an off. The stalls were frequent especially where the progress update was called inside a loop.

I have removed this for now.

The reason behind this is that with many threads the mutexs in the progress calls were causing large stalls. So this was an easy way to switch then on an off. The stalls were frequent especially where the progress update was called inside a loop. I have removed this for now.

intern/cycles/bvh/bvh.h Outdated

						
				@ -66,3 +66,3 @@

				  vector<Geometry *> geometry;

				  vector<Object *> objects;

				  bool built;

Brecht Van Lommel commented

2023-04-14 16:01:58 +02:00

Add empty line after this, initialize to false directly here.

William Leeson commented

2023-04-17 16:41:15 +02:00

Done

intern/cycles/bvh/multi.cpp

						
				@ -24,0 +36,4 @@

				{

				  int id = device->device_number(subdevice);

				  resize_sub_bvhs_if_needed(id);

				  sub_bvhs[id] = bvh;

Brecht Van Lommel commented

2023-04-14 16:06:06 +02:00

It's not clear to me how the previous BVH that was stored here gets freed, or how the ownership here works in general. Ideally unique_ptr would be used to make that really clear.

It's not clear to me how the previous BVH that was stored here gets freed, or how the ownership here works in general. Ideally `unique_ptr` would be used to make that really clear.

William Leeson commented

2023-04-21 17:24:41 +02:00

I have changed this to use unique_ptr now

I have changed this to use `unique_ptr` now

leesonw marked this conversation as resolved

intern/cycles/device/device.cpp Outdated

						
				@ -54,3 +53,2 @@

				  BVH2 *const bvh2 = static_cast<BVH2 *>(bvh);

				  if (refit) {

				    bvh2->refit(progress);

				  static std::atomic<int> building{0};

Brecht Van Lommel commented

2023-04-14 15:49:21 +02:00

Don't use static variables like this. There can be multiple Cycles sessions active at once that should not be blocked by each other here.

William Leeson commented

2023-04-17 16:40:17 +02:00

Doh, I'd forgotten there could be multiple sessions :-/ I'll look into moving this into the top level BVH.

William Leeson commented

2023-04-21 17:25:49 +02:00

I have updated this to storing the building count in the top level BVH so it should work with multiple sessions now.

intern/cycles/device/device.cpp Outdated

						
				@ -55,2 +54,2 @@

				  if (refit) {

				    bvh2->refit(progress);

				  static std::atomic<int> building{0};

				  /* Top level BVH2 build must wait on all other BVH2 builds to finish

Brecht Van Lommel commented

2023-04-14 15:57:11 +02:00

I don't understand why all this complicated synchronization logic is here. Don't we already guarantee that the top level BVH is built after the bottom level BVHs from the caller functions?

I would think all that is needed here is a double checked lock, no condition variables or waiting.

I don't understand why all this complicated synchronization logic is here. Don't we already guarantee that the top level BVH is built after the bottom level BVHs from the caller functions? I would think all that is needed here is a double checked lock, no condition variables or waiting.

William Leeson commented

2023-04-17 16:39:35 +02:00

No, it only stops 2 different threads from building the same BVH. However, some threads can reach this point before all the object BVHs are built so it needs to wait on those BVHs to be built before it can proceed.

I don't this this would work in this case.

I'll look into moving the atomic into the top level BVH that way it won't block multiple sessions.

No, it only stops 2 different threads from building the same BVH. However, some threads can reach this point before all the object BVHs are built so it needs to wait on those BVHs to be built before it can proceed. I don't this this would work in this case. I'll look into moving the atomic<int> into the top level BVH that way it won't block multiple sessions.

William Leeson commented

2023-04-28 14:41:00 +02:00

This is needed because if more than 1 device uses BVH2 they all work on building it. This happens when you combine GPU + CPU rendering or more than one GPU that uses BVH2.

intern/cycles/device/hip/device_impl.cpp Outdated

						
				@ -531,3 +536,2 @@

				  if (mem.type == MEM_GLOBAL) {

				    global_free(mem);

				    global_alloc(mem);

				    if ((mem.device_size < mem.memory_size()) || (!mem.device_pointer)) {

Brecht Van Lommel commented

2023-04-14 16:18:47 +02:00

Same comment as before, change < to != to ensure memory decreases.

Same comment as before, change `<` to `!=` to ensure memory decreases.

William Leeson commented

2023-04-17 16:34:05 +02:00

Done

intern/cycles/scene/devicescene.cpp

						
				@ -0,0 +90,4 @@

				}

				/*

				 * Clears the modified tags for all elements of the device scene

Brecht Van Lommel commented

2023-04-14 16:35:08 +02:00

It's not all of them? For example object_motion_pass is not done here.

The function has a generic name but then only works on some subset related to geometry, that's confusing.

It's not all of them? For example `object_motion_pass` is not done here. The function has a generic name but then only works on some subset related to geometry, that's confusing.

William Leeson commented

2023-04-28 14:38:24 +02:00

I renamed the function device_free_geometry

I renamed the function `device_free_geometry`

intern/cycles/scene/devicescene.cpp

						
				@ -0,0 +128,4 @@

				                                              DeviceScene *dscene,

				                                              const GeometrySizes *p_sizes)

				{

				  if (p_sizes->tri_size != 0) {

Brecht Van Lommel commented

2023-04-14 16:37:36 +02:00

Are these conditionals checking for the size really needed? Why can't it just always do this? Don't all of the individual device vectors have sufficient info to skip operations if they are unnecessary?

William Leeson commented

2023-04-17 16:16:56 +02:00

I believe this was just replicating similar code in geometry.cpp:1092 in method GeometryManager::device_update_mesh with an optimization of moving the copy inside the if. I'll remove these and rely on the device_memory instead.

intern/cycles/session/session.cpp Outdated

						
				@ -630,3 +630,3 @@

				}

				double Session::get_estimated_remaining_time() const

				double Session::get_estimated_remaining_time() //const

Brecht Van Lommel commented

2023-04-14 16:11:54 +02:00

Add back const? The implementation did not change.

William Leeson commented

2023-04-17 16:17:50 +02:00

Done

intern/cycles/util/progress.h Outdated

						
				@ -7,0 +11,4 @@

				#else

				#define MUTEX              thread_mutex

				#define SCOPED_LOCK(mutex) thread_scoped_lock scopedlock(mutex)

				#endif

Brecht Van Lommel commented

2023-04-14 15:59:09 +02:00

I rather not use macros, especially not with such generic names. If a spin lock is faster we might as well just use that.

If there's really a need for both to be switchable, I prefer something like this inside the class:

using progress_mutex = thread_spin_lock;
using progress_lock = thread_scoped_spin_lock;

I rather not use macros, especially not with such generic names. If a spin lock is faster we might as well just use that. If there's really a need for both to be switchable, I prefer something like this inside the class: ``` using progress_mutex = thread_spin_lock; using progress_lock = thread_scoped_spin_lock; ```

William Leeson commented

2023-04-28 14:36:53 +02:00

I removed this.

William Leeson added 1 commit 2023-04-17 12:26:43 +02:00

e69abe1659 Merge branch 'upstream_main' into geometry_update

William Leeson added 1 commit 2023-04-17 14:17:55 +02:00

7ffb958444 Remove only grow allocation for HIP devices

William Leeson added 2 commits 2023-04-17 18:14:19 +02:00

f89abb02c2 Initialise built variable in class definition.

5f0678a81b Restore get_estimated_remaining_time methods constness

William Leeson added 1 commit 2023-04-18 10:34:07 +02:00

5d0c87c1e4 Use info in device_memory to check to determine what data to upload

William Leeson referenced this pull request

2023-04-18 15:16:37 +02:00

Break up geometry.cpp and scene.cpp file into smaller pieces #107079

William Leeson added 2 commits 2023-04-19 13:46:35 +02:00

dea1a464ef Merge branch 'unbroken_main' into geometry_update

215f1267ca Allow BVH2 building to handle multiple sessions correctly

William Leeson added 1 commit 2023-04-19 15:36:03 +02:00

ce91384a2a Rearrange methods to match the main_reorg PR

William Leeson referenced this issue from a commit

2023-04-20 12:26:14 +02:00

Cycles: Break up geometry.cpp and scene.cpp file into smaller pieces

William Leeson added 1 commit 2023-04-20 12:27:21 +02:00

b850a5d4b1 Merge branch 'upstream_main' into geometry_update

William Leeson added 1 commit 2023-04-20 17:55:52 +02:00

168366b89d Merge branch 'upstream_main' into geometry_update

William Leeson added 2 commits 2023-04-21 15:37:53 +02:00

d96fd9b08f Handle the bvh root index correctly for the device specific DeviceScene

Uses the unique_ptr to decallocate the BVH2 structure in the
BVH postprocess and assignes the bvh root indices in the device
specific areas also.

89f11dedda Manage sub-BVHs life cycles using unique_ptr

William Leeson added 1 commit 2023-04-21 15:56:35 +02:00

a126bf3b53 Revert progress related changes

William Leeson added 1 commit 2023-04-21 17:29:54 +02:00

0669acc5c5 Remove geometry_additions.cpp

William Leeson added 1 commit 2023-04-24 10:38:35 +02:00

ee8d450adc Merge branch 'upstream_main' into geometry_update

William Leeson added 4 commits 2023-04-26 12:23:25 +02:00

4ca1060431 Rename DeviceScene method device_free to device_free_geometry

d949741a33 FIX: Clear the device buffers after hair transparency calculations

1560ec66c9 FIX: Create CUDA context earlier and remove lock

Some CUDA commands were used before the context was setup. Also
since the BVHs are build in order there is not need for the lock
to prevent 2 being build at tehe same time to save memory.

8ba98b8138 Move displacement image upload out of geometry loop

William Leeson added 1 commit 2023-04-26 15:00:37 +02:00

aa24f3dd06 Merge branch 'upstream_main' into geometry_update

William Leeson added 3 commits 2023-04-27 11:16:31 +02:00

0af6a3efbe FIX:Add correct parameters to the OneAPI build_bvh method

58cddaeefe FIX: Upload textures to all devices

For the displacement and shadow texture generation the textures
were originally only uploaded to 1 device. However, this resulted
in some textures that were needed across all devices not being
uploaded. This fixes that by uploading them to all devices.

a916e34033 Move scene DeviceScene to use unique_ptr

To reduce the amount of code and manage the life cycles of the
DeviceScenes they are now stored as unique_ptrs

William Leeson added 1 commit 2023-04-27 14:45:36 +02:00

70f531fc2e Merge branch 'upstream_main' into geometry_update

William Leeson added 2 commits 2023-04-28 10:33:24 +02:00

e5250ab494 FIX: Remove this from lambda capture for parallel_for

df62806c94 Use task pool for object BVH build also fixes stats recording

Re-instates the task pool used to build the BVHs for the objects.
Also the displacement timing was being reported twice and this is
fixed.

William Leeson added 1 commit 2023-04-28 14:30:36 +02:00

2485460fb6 Merge branch 'upstream_main' into geometry_update

William Leeson requested review from Brecht Van Lommel 2023-04-28 14:45:53 +02:00

William Leeson added 1 commit 2023-04-28 15:29:35 +02:00

buildbot/vexp-code-patch-coordinator Build done. Details

429f953d6c Move CUDA context into a smaller scope

William Leeson commented

2023-04-28 17:31:06 +02:00

@blender-bot package

Blender Bot commented

2023-04-28 17:31:08 +02:00

Package build started. Download here when ready.

Package build started. [Download here](https://builder.blender.org/download/patch/PR105403) when ready.

William Leeson added 1 commit 2023-05-01 10:05:18 +02:00

buildbot/vexp-code-patch-coordinator Build done. Details

de58c2ab8e Merge branch 'upstream_main' into geometry_update

William Leeson commented

2023-05-01 12:12:58 +02:00

I have updated the scene, geometry, object BVH and scene BVH update times graphs and updated the results in the spreadsheet.

I have updated the scene, geometry, object BVH and scene BVH update times graphs and updated the results in the [spreadsheet](https://docs.google.com/spreadsheets/d/1OYF3g988jZQXQG1pVopzIMQTRVo39V1znrLx8c92Rhs/edit?usp=sharing).

William Leeson commented

2023-05-01 12:21:10 +02:00

@blender-bot package

Blender Bot commented

2023-05-01 12:21:13 +02:00

Package build started. Download here when ready.

Package build started. [Download here](https://builder.blender.org/download/patch/PR105403) when ready.

William Leeson added 1 commit 2023-05-03 11:22:11 +02:00

ee340d7b11 Merge branch 'upstream_main' into geometry_update

William Leeson added 1 commit 2023-05-04 13:49:39 +02:00

c8a0b24120 Merge branch 'upstream_main' into geometry_update

William Leeson added 2 commits 2023-05-08 14:23:28 +02:00

184e445502 FIX: get_device_bvh returns the MultiBVH if the device is the MultiDevice

If the device was the MultiDevice it previously returned the first
BVH. This was incorrect it should return the MultiBVH. Also, if
it cannot find the device then it should return NULL.

5ea6f1591d Merge branch 'upstream_main' into geometry_update

William Leeson added 2 commits 2023-05-08 16:49:26 +02:00

5cb0b8a3bf Move the BVH layout determination into a method for reuse

This code was duplicated in MultiDevice and geometry_bvh.cpp so
this removed that duplication.

c8b685fe2e FIX: BVH layout determination is the same across all devices.

William Leeson referenced this pull request

2023-05-09 11:34:31 +02:00

Alternative Upload geometry data in parallel to multiple GPUs using the "Multi-Device" #107552

William Leeson added 1 commit 2023-05-09 13:57:22 +02:00

e701b8e09d FIX: Don't disable ray query on scene BVH

This fixes an issue when using the BVH indirectly through the
MultiDevice. Setting the bvh_layout to none causes the ray queries
to be disabled. Also this does some refactoring to clean up the
use of the device_data_xfer_and_bvh_update method.

William Leeson added 1 commit 2023-05-10 15:33:48 +02:00

c172fc3a71 Merge branch 'upstream_main' into geometry_update

William Leeson added 1 commit 2023-05-12 10:16:40 +02:00

ef96757f9e Merge branch 'upstream_main' into geometry_update

William Leeson added 1 commit 2023-05-15 16:12:58 +02:00

621de2cb63 Merge branch 'upstream_main' into geometry_update

William Leeson added 1 commit 2023-05-16 10:11:25 +02:00

8b7c9c3d62 Merge branch 'upstream_main' into geometry_update

William Leeson added 1 commit 2023-05-17 10:54:47 +02:00

59dbe8c61e FIX: Adds missing device import for MacOS bvh.mm

William Leeson added 1 commit 2023-05-17 15:38:15 +02:00

568ed88793 FIX: Use hardware info to detect if ray tracing is available.

William Leeson added 1 commit 2023-05-17 18:05:19 +02:00

0611a5ccff Fix up formatting to minimise changes.

William Leeson added 2 commits 2023-05-19 08:51:12 +02:00

e48d4b759d Merge branch 'upstream_main' into geometry_update

1045b50679 Merge branch 'geometry_update' of projects.blender.org:leesonw/blender-cluster into geometry_update

William Leeson added 1 commit 2023-05-22 10:49:55 +02:00

9637bf0900 Merge branch 'upstream_main' into geometry_update

William Leeson added 1 commit 2023-05-25 13:35:29 +02:00

2f396da4c4 Merge branch 'upstream_main' into geometry_update

William Leeson added 1 commit 2023-05-29 10:04:19 +02:00

45ac48329d Merge branch 'upstream_main' into geometry_update

Brecht Van Lommel commented

2023-05-31 20:19:29 +02:00

I've started looking into this in detail, trying to split off changes in separate commits to understand what is going on. I started doing that in #108454, but hit against deeper design issues.

For me the added code complexity in scene update in this PR and #107552 is still way too much. For me the whole idea of making multi device handle this would be to keep the scene update code almost the same. Yet #107552 still has multiple DeviceScene instances, still tracks various memory sizes separately and various other changes. Why is that still there? Was it just left in because it was an experiment and not cleaned up yet, or is there something that makes this necessary?

I can't accept this level of code complexity for this optimization, it has to be simpler or we should not do it.

Additionally, I believe the multiple device scenes do not work correctly with host memory fallback and peering. Host memory fallback can not be done purely in parallel, there needs to be some coordination between devices to share the same pinned memory instead of each allocating its own copy in a dedicated device_memory for each device. Peering is currently handled in the multi device class, and also seemingly not handled as each device has its own device_memory, not one per peering island.

I believe a solution mostly handled by the device API is possible.

We can queue up mem_alloc, mem_copy_to and mem_zero operations in the multi device, and then execute them in parallel across devices at a few synchronization points, like before BVH build, displacement and after scene updates.
Similarly BVH build operations can also be queued up, and then executed in parallel across devices.
Peering can work with this without much trouble, as we can just do operations in parallel across peering islands instead of devices.
Host memory fallback is the trickiest part. The related members in device_memory need to be shared between devices, and it must be mutex locked.

I started trying to implement this, but especially the host memory fallback sharing is more work than I hoped. So before I spend more time on this, I'd like to understand if I am missing something. Especially regarding why #107552 still has so many changes in scene update, I feel like there may be a reason for this that I do not understand.

I've started looking into this in detail, trying to split off changes in separate commits to understand what is going on. I started doing that in #108454, but hit against deeper design issues. For me the added code complexity in scene update in this PR and #107552 is still way too much. For me the whole idea of making multi device handle this would be to keep the scene update code almost the same. Yet #107552 still has multiple `DeviceScene` instances, still tracks various memory sizes separately and various other changes. Why is that still there? Was it just left in because it was an experiment and not cleaned up yet, or is there something that makes this necessary? I can't accept this level of code complexity for this optimization, it has to be simpler or we should not do it. Additionally, I believe the multiple device scenes do not work correctly with host memory fallback and peering. Host memory fallback can not be done purely in parallel, there needs to be some coordination between devices to share the same pinned memory instead of each allocating its own copy in a dedicated `device_memory` for each device. Peering is currently handled in the multi device class, and also seemingly not handled as each device has its own `device_memory`, not one per peering island. --- I believe a solution mostly handled by the device API is possible. * We can queue up `mem_alloc`, `mem_copy_to` and `mem_zero` operations in the multi device, and then execute them in parallel across devices at a few synchronization points, like before BVH build, displacement and after scene updates. * Similarly BVH build operations can also be queued up, and then executed in parallel across devices. * Peering can work with this without much trouble, as we can just do operations in parallel across peering islands instead of devices. * Host memory fallback is the trickiest part. The related members in `device_memory` need to be shared between devices, and it must be mutex locked. I started trying to implement this, but especially the host memory fallback sharing is more work than I hoped. So before I spend more time on this, I'd like to understand if I am missing something. Especially regarding why #107552 still has so many changes in scene update, I feel like there may be a reason for this that I do not understand.

William Leeson commented

2023-06-01 09:49:25 +02:00

I've started looking into this in detail, trying to split off changes in separate commits to understand what is going on. I started doing that in #108454, but hit against deeper design issues.

For me the added code complexity in scene update in this PR and #107552 is still way too much. For me the whole idea of making multi device handle this would be to keep the scene update code almost the same. Yet #107552 still has multiple DeviceScene instances, still tracks various memory sizes separately and various other changes. Why is that still there? Was it just left in because it was an experiment and not cleaned up yet, or is there something that makes this necessary?

I can't accept this level of code complexity for this optimization, it has to be simpler or we should not do it.

Additionally, I believe the multiple device scenes do not work correctly with host memory fallback and peering. Host memory fallback can not be done purely in parallel, there needs to be some coordination between devices to share the same pinned memory instead of each allocating its own copy in a dedicated device_memory for each device. Peering is currently handled in the multi device class, and also seemingly not handled as each device has its own device_memory, not one per peering island.

I believe a solution mostly handled by the device API is possible.

We can queue up mem_alloc, mem_copy_to and mem_zero operations in the multi device, and then execute them in parallel across devices at a few synchronization points, like before BVH build, displacement and after scene updates.

Similarly BVH build operations can also be queued up, and then executed in parallel across devices.

Peering can work with this without much trouble, as we can just do operations in parallel across peering islands instead of devices.

Host memory fallback is the trickiest part. The related members in device_memory need to be shared between devices, and it must be mutex locked.

I started trying to implement this, but especially the host memory fallback sharing is more work than I hoped. So before I spend more time on this, I'd like to understand if I am missing something. Especially regarding why #107552 still has so many changes in scene update, I feel like there may be a reason for this that I do not understand.

Yes the reason they are similar is that I decided to keep the scene changes as is so that the performance differences could be compared directly. There are some performance wins from the reorganization which would have favored the first PR. To bring this over to the current code base it should be possible to just bring over the just the Multi-Device changes without the scene changes which would be relatively simple. However, as you can see from the performance graphs the wins are quite a bit smaller due to the change in the parallel structure from 1 parallel_for to many. This causes a lock step which the single per device parallel structure does not have. I can do this (without the scene) in another PR if you wish? Also, I believe you are correct about the peering but from my understanding only texture memory gets host mapped so this should not happen for this case where it is only dealing with scene geometry.

> I've started looking into this in detail, trying to split off changes in separate commits to understand what is going on. I started doing that in #108454, but hit against deeper design issues. > > For me the added code complexity in scene update in this PR and #107552 is still way too much. For me the whole idea of making multi device handle this would be to keep the scene update code almost the same. Yet #107552 still has multiple `DeviceScene` instances, still tracks various memory sizes separately and various other changes. Why is that still there? Was it just left in because it was an experiment and not cleaned up yet, or is there something that makes this necessary? > > I can't accept this level of code complexity for this optimization, it has to be simpler or we should not do it. > > Additionally, I believe the multiple device scenes do not work correctly with host memory fallback and peering. Host memory fallback can not be done purely in parallel, there needs to be some coordination between devices to share the same pinned memory instead of each allocating its own copy in a dedicated `device_memory` for each device. Peering is currently handled in the multi device class, and also seemingly not handled as each device has its own `device_memory`, not one per peering island. > > --- > > I believe a solution mostly handled by the device API is possible. > * We can queue up `mem_alloc`, `mem_copy_to` and `mem_zero` operations in the multi device, and then execute them in parallel across devices at a few synchronization points, like before BVH build, displacement and after scene updates. > * Similarly BVH build operations can also be queued up, and then executed in parallel across devices. > * Peering can work with this without much trouble, as we can just do operations in parallel across peering islands instead of devices. > * Host memory fallback is the trickiest part. The related members in `device_memory` need to be shared between devices, and it must be mutex locked. > > I started trying to implement this, but especially the host memory fallback sharing is more work than I hoped. So before I spend more time on this, I'd like to understand if I am missing something. Especially regarding why #107552 still has so many changes in scene update, I feel like there may be a reason for this that I do not understand. Yes the reason they are similar is that I decided to keep the scene changes as is so that the performance differences could be compared directly. There are some performance wins from the reorganization which would have favored the first PR. To bring this over to the current code base it should be possible to just bring over the just the Multi-Device changes without the scene changes which would be relatively simple. However, as you can see from the performance graphs the wins are quite a bit smaller due to the change in the parallel structure from 1 `parallel_for` to many. This causes a lock step which the single per device parallel structure does not have. I can do this (without the scene) in another PR if you wish? Also, I believe you are correct about the peering but from my understanding only texture memory gets host mapped so this should not happen for this case where it is only dealing with scene geometry.

Brecht Van Lommel commented

2023-06-01 18:27:48 +02:00

Yes the reason they are similar is that I decided to keep the scene changes as is so that the performance differences could be compared directly. There are some performance wins from the reorganization which would have favored the first PR.

Ok, having those other performance wins all in one PR makes it difficult to compare. I want to see a basic parallel memory copy implementation first, and leave further refactoring for another PR. It's too difficult to understand it all together.

To bring this over to the current code base it should be possible to just bring over the just the Multi-Device changes without the scene changes which would be relatively simple. However, as you can see from the performance graphs the wins are quite a bit smaller due to the change in the parallel structure from 1 parallel_for to many. This causes a lock step which the single per device parallel structure does not have.

I don't think you need many parallel_for if the memory operations are queued up in the multi device. One synchronization point before BVH build maybe, and one at the end of scene update? Even that could be made to work so one device does not block the other.

Other synchronization points would be for shader evaluation, like displacement, transparency baking and world importance map. But those only need to be computed once so would not be within a parallel_for over devices anyway.

So I think if that implementation has many parallel_for invocations, I expect than can be solved by improving the implementation. But I might be missing something.

I can do this (without the scene) in another PR if you wish?

I added the work in progress changes I have in #108454, which does parallel memory operations and BVH building. It doesn't do host memory fallback, but peering should be correct.

It would be very helpful if you could test that PR for performance, and look at the implementation. I don't have a multi GPU setup for testing right now, but tested it with a single GPU added twice. And then if there is still a significant performance benefit from this PR, it would be good to understand where that is exactly.

I attached a log from rendering the junkshop with that PR, every "flush" is a synchronization point. It's probably possible to reduce those further if that is the bottleneck.

Also, I believe you are correct about the peering but from my understanding only texture memory gets host mapped so this should not happen for this case where it is only dealing with scene geometry.

The memory for scene geometry can also be host mapped, it's not just image textures.

> Yes the reason they are similar is that I decided to keep the scene changes as is so that the performance differences could be compared directly. There are some performance wins from the reorganization which would have favored the first PR. Ok, having those other performance wins all in one PR makes it difficult to compare. I want to see a basic parallel memory copy implementation first, and leave further refactoring for another PR. It's too difficult to understand it all together. > To bring this over to the current code base it should be possible to just bring over the just the Multi-Device changes without the scene changes which would be relatively simple. However, as you can see from the performance graphs the wins are quite a bit smaller due to the change in the parallel structure from 1 `parallel_for` to many. This causes a lock step which the single per device parallel structure does not have. I don't think you need many `parallel_for` if the memory operations are queued up in the multi device. One synchronization point before BVH build maybe, and one at the end of scene update? Even that could be made to work so one device does not block the other. Other synchronization points would be for shader evaluation, like displacement, transparency baking and world importance map. But those only need to be computed once so would not be within a `parallel_for` over devices anyway. So I think if that implementation has many `parallel_for` invocations, I expect than can be solved by improving the implementation. But I might be missing something. > I can do this (without the scene) in another PR if you wish? I added the work in progress changes I have in #108454, which does parallel memory operations and BVH building. It doesn't do host memory fallback, but peering should be correct. It would be very helpful if you could test that PR for performance, and look at the implementation. I don't have a multi GPU setup for testing right now, but tested it with a single GPU added twice. And then if there is still a significant performance benefit from this PR, it would be good to understand where that is exactly. I attached a log from rendering the junkshop with that PR, every "flush" is a synchronization point. It's probably possible to reduce those further if that is the bottleneck. > Also, I believe you are correct about the peering but from my understanding only texture memory gets host mapped so this should not happen for this case where it is only dealing with scene geometry. The memory for scene geometry can also be host mapped, it's not just image textures.

junkshop_log.txt

54 KiB

William Leeson commented

2023-06-02 10:06:50 +02:00

Yes the reason they are similar is that I decided to keep the scene changes as is so that the performance differences could be compared directly. There are some performance wins from the reorganization which would have favored the first PR.

Ok, having those other performance wins all in one PR makes it difficult to compare. I want to see a basic parallel memory copy implementation first, and leave further refactoring for another PR. It's too difficult to understand it all together.

Yes, I can understand that it's a lot of changes in one PR.

To bring this over to the current code base it should be possible to just bring over the just the Multi-Device changes without the scene changes which would be relatively simple. However, as you can see from the performance graphs the wins are quite a bit smaller due to the change in the parallel structure from 1 parallel_for to many. This causes a lock step which the single per device parallel structure does not have.

I don't think you need many parallel_for if the memory operations are queued up in the multi device. One synchronization point before BVH build maybe, and one at the end of scene update? Even that could be made to work so one device does not block the other.

Other synchronization points would be for shader evaluation, like displacement, transparency baking and world importance map. But those only need to be computed once so would not be within a parallel_for over devices anyway.

So I think if that implementation has many parallel_for invocations, I expect than can be solved by improving the implementation. But I might be missing something.

It's a bit of a trade off between having fewer parallel_for's and trying to overlap CPU work with GPU work. The more GPUs you have the more important it was to just have one thread per device to remove the synchronization that occurred due the parallel_for's. The scene changes were to separate the CPU side calculations from the GPU specific work so that each GPU thread would not repeat work. For instance calculating the normals only needs to be done once not per device. This also had the side effect of cleaning up the CPU calculations and removing some that were repeated. Also it was best to execute the GPU commands a soon as possible. For memory transfers it seems that a few large transfers are better than lots of smaller ones (for instance per object transfers vs transferring the whole scene in one go) .

I can do this (without the scene) in another PR if you wish?

I added the work in progress changes I have in #108454, which does parallel memory operations and BVH building. It doesn't do host memory fallback, but peering should be correct.

It would be very helpful if you could test that PR for performance, and look at the implementation. I don't have a multi GPU setup for testing right now, but tested it with a single GPU added twice. And then if there is still a significant performance benefit from this PR, it would be good to understand where that is exactly.

Yes, certainly I'll try this out on my 3080 + 3090 dual GPU config. It will be interesting to see how queuing up work like this compares to the alternatives.

I attached a log from rendering the junkshop with that PR, every "flush" is a synchronization point. It's probably possible to reduce those further if that is the bottleneck.

In theory you only need to synchronize at the end if you are just submitting a list of commands to each device which will be executed in order. However, I guess the flushes are needed to start the execution which results in more parallel_fors.

Also, I believe you are correct about the peering but from my understanding only texture memory gets host mapped so this should not happen for this case where it is only dealing with scene geometry.

The memory for scene geometry can also be host mapped, it's not just image textures.

Ok, I guess I understood that code wrong. Looks like I'll need to go over it again to make sure I understand it correctly.

> > Yes the reason they are similar is that I decided to keep the scene changes as is so that the performance differences could be compared directly. There are some performance wins from the reorganization which would have favored the first PR. > > Ok, having those other performance wins all in one PR makes it difficult to compare. I want to see a basic parallel memory copy implementation first, and leave further refactoring for another PR. It's too difficult to understand it all together. > Yes, I can understand that it's a lot of changes in one PR. > > To bring this over to the current code base it should be possible to just bring over the just the Multi-Device changes without the scene changes which would be relatively simple. However, as you can see from the performance graphs the wins are quite a bit smaller due to the change in the parallel structure from 1 `parallel_for` to many. This causes a lock step which the single per device parallel structure does not have. > > I don't think you need many `parallel_for` if the memory operations are queued up in the multi device. One synchronization point before BVH build maybe, and one at the end of scene update? Even that could be made to work so one device does not block the other. > > Other synchronization points would be for shader evaluation, like displacement, transparency baking and world importance map. But those only need to be computed once so would not be within a `parallel_for` over devices anyway. > > So I think if that implementation has many `parallel_for` invocations, I expect than can be solved by improving the implementation. But I might be missing something. > It's a bit of a trade off between having fewer `parallel_for`'s and trying to overlap CPU work with GPU work. The more GPUs you have the more important it was to just have one thread per device to remove the synchronization that occurred due the `parallel_for`'s. The scene changes were to separate the CPU side calculations from the GPU specific work so that each GPU thread would not repeat work. For instance calculating the normals only needs to be done once not per device. This also had the side effect of cleaning up the CPU calculations and removing some that were repeated. Also it was best to execute the GPU commands a soon as possible. For memory transfers it seems that a few large transfers are better than lots of smaller ones (for instance per object transfers vs transferring the whole scene in one go) . > > I can do this (without the scene) in another PR if you wish? > > I added the work in progress changes I have in #108454, which does parallel memory operations and BVH building. It doesn't do host memory fallback, but peering should be correct. > > It would be very helpful if you could test that PR for performance, and look at the implementation. I don't have a multi GPU setup for testing right now, but tested it with a single GPU added twice. And then if there is still a significant performance benefit from this PR, it would be good to understand where that is exactly. > Yes, certainly I'll try this out on my 3080 + 3090 dual GPU config. It will be interesting to see how queuing up work like this compares to the alternatives. > I attached a log from rendering the junkshop with that PR, every "flush" is a synchronization point. It's probably possible to reduce those further if that is the bottleneck. > In theory you only need to synchronize at the end if you are just submitting a list of commands to each device which will be executed in order. However, I guess the flushes are needed to start the execution which results in more `parallel_for`s. > > Also, I believe you are correct about the peering but from my understanding only texture memory gets host mapped so this should not happen for this case where it is only dealing with scene geometry. > > The memory for scene geometry can also be host mapped, it's not just image textures. Ok, I guess I understood that code wrong. Looks like I'll need to go over it again to make sure I understand it correctly.

Brecht Van Lommel commented

2023-06-02 14:52:29 +02:00

It's a bit of a trade off between having fewer parallel_for's and trying to overlap CPU work with GPU work.

Is there any overlap between CPU and GPU work currently? From what I can tell this PR is still using a blocking memory copy calls. With queued memory operations like #108454, using async memory copy is more practical I think.

My understanding is that this PR does not do something like that, if it does where does that happen?

The more GPUs you have the more important it was to just have one thread per device to remove the synchronization that occurred due the parallel_for's. The scene changes were to separate the CPU side calculations from the GPU specific work so that each GPU thread would not repeat work. For instance calculating the normals only needs to be done once not per device. This also had the side effect of cleaning up the CPU calculations and removing some that were repeated.

Removing such a repeated CPU normal calculation should be submitted as a separate PR, that kind of thing should be easy to review and land quickly. And then we can also better understand what the actual impact of parallel copies is.

Also it was best to execute the GPU commands a soon as possible. For memory transfers it seems that a few large transfers are better than lots of smaller ones (for instance per object transfers vs transferring the whole scene in one go) .

I thought we were doing synchronous memory copies so doing it earlier would not matter, but maybe CUDA makes a copy of the host side memory in some cases? Still the best solution here seem to do async memory copy then.

In theory you only need to synchronize at the end if you are just submitting a list of commands to each device which will be executed in order. However, I guess the flushes are needed to start the execution which results in more parallel_fors.

For some things like displacement shader evaluation you do need some synchronization, since that is computed on one device and then affects the BVH build of all devices. But after all shader evaluation is done it could all be queued in principle.

> It's a bit of a trade off between having fewer `parallel_for`'s and trying to overlap CPU work with GPU work. Is there any overlap between CPU and GPU work currently? From what I can tell this PR is still using a blocking memory copy calls. With queued memory operations like #108454, using async memory copy is more practical I think. My understanding is that this PR does not do something like that, if it does where does that happen? > The more GPUs you have the more important it was to just have one thread per device to remove the synchronization that occurred due the `parallel_for`'s. The scene changes were to separate the CPU side calculations from the GPU specific work so that each GPU thread would not repeat work. For instance calculating the normals only needs to be done once not per device. This also had the side effect of cleaning up the CPU calculations and removing some that were repeated. Removing such a repeated CPU normal calculation should be submitted as a separate PR, that kind of thing should be easy to review and land quickly. And then we can also better understand what the actual impact of parallel copies is. > Also it was best to execute the GPU commands a soon as possible. For memory transfers it seems that a few large transfers are better than lots of smaller ones (for instance per object transfers vs transferring the whole scene in one go) . I thought we were doing synchronous memory copies so doing it earlier would not matter, but maybe CUDA makes a copy of the host side memory in some cases? Still the best solution here seem to do async memory copy then. > In theory you only need to synchronize at the end if you are just submitting a list of commands to each device which will be executed in order. However, I guess the flushes are needed to start the execution which results in more `parallel_for`s. For some things like displacement shader evaluation you do need some synchronization, since that is computed on one device and then affects the BVH build of all devices. But after all shader evaluation is done it could all be queued in principle.

William Leeson added 1 commit 2023-06-05 11:00:05 +02:00

0f338d5b29 Merge branch 'upstream_main' into geometry_update

William Leeson added 1 commit 2023-06-12 11:05:25 +02:00

3b76a90249 Merge branch 'upstream_main' into geometry_update

William Leeson added 1 commit 2023-06-26 14:20:11 +02:00

812a705254 Merge branch 'upstream_main' into geometry_update

William Leeson added 1 commit 2023-06-29 18:23:48 +02:00

f271bb20d2 Merge branch 'upstream_main' into geometry_update

William Leeson added 1 commit 2023-07-04 16:09:15 +02:00

834bc5fee5 Merge branch 'upstream_main' into geometry_update

William Leeson added 2 commits 2023-07-06 12:07:01 +02:00

286a640045 Merge branch 'upstream_main' into geometry_update