Refactor geometry update for better parallelization and upload performance #105403

Open
William Leeson wants to merge 144 commits from leesonw/blender-cluster:geometry_update into main

When changing the target branch, be careful to rebase the branch in your fork to match. See documentation.
Member

Why

Cycle's geometry upload and preparation can be refactored to be easier to understand and also to improve its efficiency. By creating a single parallel_for for each device instead of one thread per copy operation the overhead of handling threading can be reduced thereby improving performance. Also, by splitting the host side work from the device side work it is easier to distribute the workload. Some calculations are performed twice or unnecessarily as they only needed for specific tasks but are performed regardless as this is hard to work out with the current code. This also allows for further potential optimizations in the future.

What

  • Break the geometry update into host and device update.
  • Re-order the update so that it is easier to understand
  • Create a parallel loop over each device and from there upload the data.
    • Create a DeviceScene per device instead of using the MultiDevice
    • Create a BVH per device instead of using the MultiDevice
  • Only perform calculations that are required
    This results in the program flow looking like the following
    This method works as follows:
    ----HOST SIDE CALCULATIONS AND SETUP------
  1. Calculate vertex normals and gather undisplaced vertices
  2. Tessellate mesh if needed
  3. Compute object bounds
  4. fill/pack attribute buffers
  5. fill/pack geometry buffers
    --- only run on the first device ---
  6. displace meshes and calculate shadow transparency if needed
  7. if displacement or shadow transparency occurred then refill buffers
    ----DEVICE SPECIFIC BEGIN----
  8. copy attribute buffers to device
  9. copy geometry buffers to device
  10. build object BVHs
  11. build scene BVH
    ----DEVICE SPECIFIC END----
    ----HOST SIDE BELOW THIS POINT
  12. Remove modified or update tags

The original flow and the new flow are shown in the files below. To achieve this a DeviceScene is created for each device. However, as this is just updating the geometry device update (and not all uses of the DeviceScene) the old DeviceScene needs to remain so other that buffers can be handled by the multi-device. The host side pointers are shared by all the DeviceScenes so they only need to be packed once. This is done using a method to update the host pointer of one DeviceScene from an existing DeviceScene.

I have created a spreedsheet here comparing this patch versus Blender main. Also and chart of the results so far is available in the images below. I have also created another spreadsheet here comparing it with both the original Blender implementation and the alternative method #107552. This patch is "geom_update" the other is "parallel_copy" in the graphs.

## Why Cycle's geometry upload and preparation can be refactored to be easier to understand and also to improve its efficiency. By creating a single parallel_for for each device instead of one thread per copy operation the overhead of handling threading can be reduced thereby improving performance. Also, by splitting the host side work from the device side work it is easier to distribute the workload. Some calculations are performed twice or unnecessarily as they only needed for specific tasks but are performed regardless as this is hard to work out with the current code. This also allows for further potential optimizations in the future. ## What * Break the geometry update into host and device update. * Re-order the update so that it is easier to understand * Create a parallel loop over each device and from there upload the data. * Create a DeviceScene per device instead of using the MultiDevice * Create a BVH per device instead of using the MultiDevice * Only perform calculations that are required This results in the program flow looking like the following This method works as follows: ----HOST SIDE CALCULATIONS AND SETUP------ 1. Calculate vertex normals and gather undisplaced vertices 2. Tessellate mesh if needed 3. Compute object bounds 4. fill/pack attribute buffers 5. fill/pack geometry buffers --- only run on the first device --- 6. displace meshes and calculate shadow transparency if needed 7. if displacement or shadow transparency occurred then refill buffers ----DEVICE SPECIFIC BEGIN---- 8. copy attribute buffers to device 9. copy geometry buffers to device 10. build object BVHs 11. build scene BVH ----DEVICE SPECIFIC END---- ----HOST SIDE BELOW THIS POINT 12. Remove modified or update tags The original flow and the new flow are shown in the files below. To achieve this a DeviceScene is created for each device. However, as this is just updating the geometry device update (and not all uses of the DeviceScene) the old DeviceScene needs to remain so other that buffers can be handled by the multi-device. The host side pointers are shared by all the DeviceScenes so they only need to be packed once. This is done using a method to update the host pointer of one DeviceScene from an existing DeviceScene. I have created a spreedsheet [here](https://docs.google.com/spreadsheets/d/1OYF3g988jZQXQG1pVopzIMQTRVo39V1znrLx8c92Rhs/edit?usp=sharing) comparing this patch versus Blender main. Also and chart of the results so far is available in the images below. I have also created another spreadsheet [here](https://docs.google.com/spreadsheets/d/1ywQym3LTLgGIGMedExxsS4RpPKwpP60BjeLtXf1BaF4/edit?usp=sharing) comparing it with both the original Blender implementation and the alternative method #107552. This patch is "geom_update" the other is "parallel_copy" in the graphs.
William Leeson added 7 commits 2023-03-03 15:06:12 +01:00
598c7c151d Remove some locks from progress update
Removes mutex locks when updating pixel counts etc. Also, when
in background mode text status updates are removed.
bbce8a0aae Add macro to switch from std::mutex to SpinLock
Simplified the code to be more similar to the original by
implementing it as a spin lock. Then added a macro to be able to
switch back and forth between std::mutex and the SpinLock.
72b918a9e2 Adds scoped event markers
Add scoped event markers to highlight code regions of interest.
William Leeson added 2 commits 2023-03-08 14:31:11 +01:00
William Leeson added this to the Render & Cycles project 2023-03-09 18:00:57 +01:00
William Leeson added 1 commit 2023-03-10 12:29:47 +01:00
William Leeson added 1 commit 2023-03-14 10:13:35 +01:00
William Leeson added 1 commit 2023-03-14 12:01:34 +01:00
2183d70e9d Move scene BVH update into parallel_for
The scen BVH update was outside the parallel for this moves it
inside the parallel_for so it can be performed in parallel also.
William Leeson added 1 commit 2023-03-14 12:21:31 +01:00
5c56f518eb FIX: Preallocate timer arrays in scene
The Windows build failed as the timers were using
a dynamic array. This has been changed to use an
heap array which is allocated with the scene
instead.
William Leeson added 1 commit 2023-03-14 12:58:33 +01:00
cba4d732e2 Remove stats params from deviceDataXferAndBVHUpdate
The stats parameters are now contained in the Scene class and
no longer need to be passed as a parameter.
Hans Goudey changed title from WIP:Refactor geometry update for better parallelization and upload performance to WIP: Refactor geometry update for better parallelization and upload performance 2023-03-14 13:27:16 +01:00
William Leeson added 1 commit 2023-03-14 13:32:05 +01:00
665a0e84b2 Move scne BVH upload into deviceDataXferAndBVHUpdate
To simplify the code and make it easier to read this moves the
scene BVH upload into the deviceDataXferAndBVHUpdate method.
William Leeson added 1 commit 2023-03-14 15:07:05 +01:00
William Leeson requested review from Brecht Van Lommel 2023-03-14 15:33:33 +01:00
Author
Member

@brecht Ok I think this is close to ready. It would be good to get an overall impression even if not a complete review when you have time. I'll hopefully have performance data sometime this week.

@brecht Ok I think this is close to ready. It would be good to get an overall impression even if not a complete review when you have time. I'll hopefully have performance data sometime this week.
William Leeson added 2 commits 2023-03-15 16:19:56 +01:00
William Leeson changed title from WIP: Refactor geometry update for better parallelization and upload performance to Refactor geometry update for better parallelization and upload performance 2023-03-16 08:23:39 +01:00
William Leeson added 1 commit 2023-03-16 09:37:19 +01:00

I see a bunch of commented code and benchmarking utilities in business logic.

I see a bunch of commented code and benchmarking utilities in business logic.
William Leeson added 1 commit 2023-03-16 15:56:25 +01:00
William Leeson added 1 commit 2023-03-16 15:59:47 +01:00
William Leeson changed title from Refactor geometry update for better parallelization and upload performance to WIP: Refactor geometry update for better parallelization and upload performance 2023-03-16 16:00:03 +01:00

I'm not going to have time this week, but will look next week.

I'm not going to have time this week, but will look next week.
William Leeson added 1 commit 2023-03-17 12:51:29 +01:00
William Leeson added 1 commit 2023-03-17 16:02:01 +01:00
William Leeson added 1 commit 2023-03-20 12:41:55 +01:00
buildbot/vexp-code-patch-coordinator Build done. Details
ef84fe9e5d
Merge branch 'upstream_main' into geometry_update

@blender-bot build

@blender-bot build

This part I'm not sure about.

Create a BVH per device instead of using the MultiDevice

Right now we can build the BVH once for multi-device rendering on CUDA, HIP, Metal, OneAPI. With native hardware ray-tracing this may become impossible, but that usually comes with GPU accelerated tree building so it's not so bad.

This performance regression I would rather avoid.

This part I'm not sure about. > Create a BVH per device instead of using the MultiDevice Right now we can build the BVH once for multi-device rendering on CUDA, HIP, Metal, OneAPI. With native hardware ray-tracing this may become impossible, but that usually comes with GPU accelerated tree building so it's not so bad. This performance regression I would rather avoid.

See the buildbot log for build errors, build warnings and test failures to be fixed.

See the buildbot log for build errors, build warnings and test failures to be fixed.
Brecht Van Lommel requested changes 2023-03-20 19:28:34 +01:00
Brecht Van Lommel left a comment
Owner

I did not go through every line of code in geometry.cpp and geometry_additions.cpp, but overall:

  • Try to find some way to avoid counting sizes in a separate struct?
  • Splitting the code into multiple files seems good to get this organized a bit. I'd suggest to have geometry_attributes.cpp, geometry_bvh.cpp, geometry_mesh.cpp, .. or something along those lines. And then each of those can handle as much as possible related to that part of the data, with consistent function names for the various stages.
I did not go through every line of code in `geometry.cpp` and `geometry_additions.cpp`, but overall: * Try to find some way to avoid counting sizes in a separate struct? * Splitting the code into multiple files seems good to get this organized a bit. I'd suggest to have `geometry_attributes.cpp`, `geometry_bvh.cpp`, `geometry_mesh.cpp`, .. or something along those lines. And then each of those can handle as much as possible related to that part of the data, with consistent function names for the various stages.
@ -887,7 +887,6 @@ typedef enum {
typedef struct _nvrtcProgram* nvrtcProgram;

Unrelated whitespace change.

Unrelated whitespace change.
brecht marked this conversation as resolved
@ -681,7 +682,6 @@ static int cuewNvrtcInit(void) {
return result;
}

Unrelated whitespace change.

Unrelated whitespace change.
brecht marked this conversation as resolved
@ -68,6 +68,7 @@ typedef void* DynamicLibrary;
static DynamicLibrary cuda_lib;
static DynamicLibrary nvrtc_lib;
static DynamicLibrary nvtx_lib;

This probably comes from another patch and should not be part of this?

This probably comes from another patch and should not be part of this?
brecht marked this conversation as resolved
@ -125,6 +125,7 @@ void BlenderSession::create_session()
session = new Session(session_params, scene_params);
session->progress.set_update_callback(function_bind(&BlenderSession::tag_redraw, this));
session->progress.set_cancel_callback(function_bind(&BlenderSession::test_cancel, this));
session->progress.set_updates(!background);

This seems like a change to review as a separate PR?

This seems like a change to review as a separate PR?
brecht marked this conversation as resolved
@ -14,1 +13,4 @@
const Device *device_)
: BVH(params_, geometry_, objects_), device(device_)
{
// Resize the sub-bvh container to match the number of devices

Maybe there can be some abstraction at this level to avoid building the BVH multiple times when not needed?

Maybe there can be some abstraction at this level to avoid building the BVH multiple times when not needed?
leesonw marked this conversation as resolved
@ -149,6 +149,15 @@ void CPUDevice::mem_copy_to(device_memory &mem)
}
}
void CPUDevice::mem_copy_to(device_memory &mem, size_t, size_t offset)

Silence unused argument warnings with , size_t, size_t)

Silence unused argument warnings with `, size_t, size_t)`
brecht marked this conversation as resolved
@ -31,0 +30,4 @@
int flags = CUEW_INIT_CUDA;
int cuew_result = cuewInit(flags);

This change does nothing (anymore?).

This change does nothing (anymore?).
leesonw marked this conversation as resolved
@ -548,3 +548,3 @@
const CUDAContextScope scope(this);
cuda_assert(cuMemcpyHtoD((CUdeviceptr)device_pointer, host_pointer, size));
cuda_assert(cuMemcpyHtoD((CUdeviceptr)device_pointer + offset,

Checking the definition of CUdeviceptr this seems to work, but it's not obvious by just looking at the code or for all devices types. Maybe better to use reinterpret_cast<unsigned char *>( in all device code, to make it clear?

Checking the definition of `CUdeviceptr` this seems to work, but it's not obvious by just looking at the code or for all devices types. Maybe better to use `reinterpret_cast<unsigned char *>(` in all device code, to make it clear?
brecht marked this conversation as resolved
@ -551,0 +554,4 @@
void CUDADevice::host_mem_alloc(size_t size, int aligment, void **p_mem) {
CUDAContextScope scope(this);
CUresult result = cuMemAllocHost(p_mem, size);

The CUDA docs for cuMemAllocHost say:

Allocating excessive amounts of memory with cuMemAllocHost() may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to allocate staging areas for data exchange between host and device.

It's not clear to me that we should be using this for allocating the entire scene representation on the CPU? Seems like that could lead to Blender/Cycles running out of memory faster than before.

The CUDA docs for `cuMemAllocHost` say: ``` Allocating excessive amounts of memory with cuMemAllocHost() may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to allocate staging areas for data exchange between host and device. ``` It's not clear to me that we should be using this for allocating the entire scene representation on the CPU? Seems like that could lead to Blender/Cycles running out of memory faster than before.
brecht marked this conversation as resolved
@ -551,0 +555,4 @@
void CUDADevice::host_mem_alloc(size_t size, int aligment, void **p_mem) {
CUDAContextScope scope(this);
CUresult result = cuMemAllocHost(p_mem, size);
if(result != CUDA_SUCCESS) {

This should not have custom error handling. We have cuda_assert for error reporting, so that an error is set in the device and reported to the user.

This should not have custom error handling. We have `cuda_assert` for error reporting, so that an error is set in the device and reported to the user.
brecht marked this conversation as resolved
@ -566,8 +584,13 @@ void CUDADevice::mem_alloc(device_memory &mem)
void CUDADevice::mem_copy_to(device_memory &mem)

Can this code be deduplicated by calling mem_copy_to(mem, mem.memory_size(), 0)?

Can this code be deduplicated by calling `mem_copy_to(mem, mem.memory_size(), 0)`?
brecht marked this conversation as resolved
@ -568,3 +586,2 @@
if (mem.type == MEM_GLOBAL) {
global_free(mem);
global_alloc(mem);
if ((mem.device_size < mem.memory_size()) || (!mem.device_pointer)) {

This means memory allocations never decrease in size? Shouldn't it reallocate whenever the size changes?

This means memory allocations never decrease in size? Shouldn't it reallocate whenever the size changes?
brecht marked this conversation as resolved
@ -584,0 +607,4 @@
void CUDADevice::mem_copy_to(device_memory &mem, size_t size, size_t offset)
{
if (mem.type == MEM_GLOBAL) {
if ((mem.device_size < mem.memory_size()) || (!mem.device_pointer)) {

Same comment as above.

Same comment as above.
brecht marked this conversation as resolved
@ -766,0 +774,4 @@
}
}
void GPUDevice::generic_copy_to(device_memory &mem, size_t size, size_t offset)

Similar comment about code deduplication as above.

Similar comment about code deduplication as above.
brecht marked this conversation as resolved
@ -8,3 +8,3 @@
#include "bvh/params.h"
#include "scene/scene.h"

The intent for the device API was to not be aware of Scene or DeviceScene, though that is already broken somewhat.

Ideally we'd avoid it entirely, but here at least it should include but use a class DeviceScene; forward declaration.

The intent for the device API was to not be aware of `Scene` or `DeviceScene`, though that is already broken somewhat. Ideally we'd avoid it entirely, but here at least it should include but use a `class DeviceScene;` forward declaration.
leesonw marked this conversation as resolved
@ -217,2 +222,4 @@
return 0;
}
virtual device_ptr find_matching_mem(device_ptr key, Device * /*sub*/)

Add a comment explaining what this does.

Add a comment explaining what this does.
leesonw marked this conversation as resolved
@ -290,8 +300,11 @@ class Device {
friend class DeviceServer;
friend class device_memory;
virtual void host_mem_alloc(size_t size, int alignment, void **p_mem);

This can return void * instead of using void **p_mem.

This can return `void *` instead of using `void **p_mem`.
leesonw marked this conversation as resolved
@ -369,17 +379,78 @@ template<typename T> class device_vector : public device_memory {
assert(data_elements > 0);
}
device_vector(Device *device,

TODO: what is this constructor for?

TODO: what is this constructor for?
leesonw marked this conversation as resolved
@ -41,3 +45,3 @@
/* Always add CPU devices at the back since GPU devices can change
* host memory pointers, which CPU uses as device pointer. */
SubDevice *sub;
SubDevice *sub = new SubDevice;

Use unique_ptr and make_unique instead of new in new code.

Use `unique_ptr` and `make_unique` instead of `new` in new code.
leesonw marked this conversation as resolved
@ -49,2 +52,2 @@
devices.emplace_front();
sub = &devices.front();
devices.emplace_back(sub);
// find first CPU device and swop it with the new device

swop -> swap

swop -> swap
leesonw marked this conversation as resolved
@ -51,0 +54,4 @@
// to keep the CPU devices at the end of the vector.
int last = devices.size() - 1;
int o = last;
while ((o > 0) && (devices[o - 1]->device->info.type == DEVICE_CPU)) {

If we ensure the CPU is always the last device in the vector, there is no need to have a while loop searching for it.

If we ensure the CPU is always the last device in the vector, there is no need to have a while loop searching for it.
leesonw marked this conversation as resolved
@ -1041,2 +637,2 @@
}
}
/* copy svm attributes to device */
dscene->attributes_map.copy_to_device_if_modified();

Looking at code like this, it feels like maybe DeviceScene should somehow have a list of the device_memory it contains. And then perhaps that could be used to have a function to transfer all modified memory, rather than manually doing it here.

Looking at code like this, it feels like maybe `DeviceScene` should somehow have a list of the `device_memory` it contains. And then perhaps that could be used to have a function to transfer all modified memory, rather than manually doing it here.
@ -2088,3 +1487,1 @@
if (geom->is_mesh()) {
Mesh *mesh = static_cast<Mesh *>(geom);
mesh->subd_attributes.clear_modified();
size_t n_scenes = scene->dscenes.size();

n_scenes -> num_scenes is the convention.

n_scenes -> num_scenes is the convention.
leesonw marked this conversation as resolved
@ -2091,0 +1487,4 @@
size_t n_scenes = scene->dscenes.size();
// Parallel upload the geometry data to the devices and
// calculate or refit the BVHs
tbb::parallel_for(size_t(0),

Use just parallel_for, not tbb::parallel_for. We abstact such things in case we want to override them.

Use just `parallel_for`, not `tbb::parallel_for`. We abstact such things in case we want to override them.
leesonw marked this conversation as resolved
@ -2091,0 +1489,4 @@
// calculate or refit the BVHs
tbb::parallel_for(size_t(0),
n_scenes,
[this,

Can we automatically capture all these variables instead of doing it manually?

Can we automatically capture all these variables instead of doing it manually?
leesonw marked this conversation as resolved
@ -162,0 +185,4 @@
size_t *motion_vert_offsets;
};
struct AttributeSizes {

Can we avoid having these AttributeSizes and GeometrySizes structs? This feels like something we should be able to set on the device_memory in DeviceScene instead of tracking in a separate struct.

Or maybe it's not a good because every device has their own DeviceScene?

Can we avoid having these `AttributeSizes` and `GeometrySizes` structs? This feels like something we should be able to set on the `device_memory` in `DeviceScene` instead of tracking in a separate struct. Or maybe it's not a good because every device has their own `DeviceScene`?
brecht marked this conversation as resolved
@ -0,0 +1,1045 @@
/* SPDX-License-Identifier: Apache-2.0

This file needs a better name, or probably code needs to split across files in a different way. This doesn't seem like a logical grouping.

This file needs a better name, or probably code needs to split across files in a different way. This doesn't seem like a logical grouping.

Did you forget to remove this file?

Did you forget to remove this file?
Author
Member

Yes, I have removed it now :-/

Yes, I have removed it now :-/
@ -0,0 +34,4 @@
/*
* Clears all tags used to indicate the the shader needs to be updated.
*/
void GeometryManager::clearShaderUpdateTags(Scene *scene)

Don't use camelCase function names.

Don't use camelCase function names.
leesonw marked this conversation as resolved
@ -0,0 +121,4 @@
progress.set_status("Updating Mesh", "Computing attributes");
// SHOULD NOT ALLOC ONLY ALLOC IF MORE SPACE IS NEEDED

I guess this is a TODO comment to be resolved.

I guess this is a TODO comment to be resolved.
leesonw marked this conversation as resolved
@ -660,11 +660,8 @@ void ObjectManager::device_update_transforms(DeviceScene *dscene, Scene *scene,
}
});
if (progress.get_cancel()) {

Why remove this?

In general we want to be able to cancel rendering in the middle of updates so users don't have to wait too long.

Why remove this? In general we want to be able to cancel rendering in the middle of updates so users don't have to wait too long.
leesonw marked this conversation as resolved
@ -36,0 +37,4 @@
* checks the progress for if a cancel has been requested and also
* the device to see if an error has occurred.
*/
bool Scene::progressErrorCheck(Progress &progress, Device *device) {

I'd name this check_cancel_update or something like that, progressErrorCheck to me reads as if it's only checking for errors.

I'd name this `check_cancel_update` or something like that, `progressErrorCheck` to me reads as if it's only checking for errors.
leesonw marked this conversation as resolved
@ -36,0 +39,4 @@
*/
bool Scene::progressErrorCheck(Progress &progress, Device *device) {
bool status = false;
if (!background && progress.get_updates()) {

Cancelling quickly is important for F12 renders as well, which are "background" in Cycles.

Cancelling quickly is important for F12 renders as well, which are "background" in Cycles.
leesonw marked this conversation as resolved
@ -104,1 +116,4 @@
{
/* Create a DeviceScene for each device */
device->foreach_device([this](Device *sub_device) {
DeviceScene *sub_dscene = new DeviceScene(sub_device);

Use unique_ptr instead of new in new code.

Use `unique_ptr` instead of `new` in new code.
leesonw marked this conversation as resolved
@ -107,0 +127,4 @@
mesh_times = new double[device_count];
attrib_times = new double[device_count];
object_bvh_times = new double[device_count];
scene_bvh_times = new double[device_count];

Use vector instead of new[]. Also perhaps it's better to put these in a struct instead of separate arrays.

Use `vector` instead of `new[]`. Also perhaps it's better to put these in a struct instead of separate arrays.
leesonw marked this conversation as resolved
@ -17,3 +28,4 @@
CCL_NAMESPACE_BEGIN
class SpinLock {

Don't implement a custom spinlock here, or at all if we can avoid it since getting that right can be surprisingly difficult.

Don't implement a custom spinlock here, or at all if we can avoid it since getting that right can be surprisingly difficult.
leesonw marked this conversation as resolved
@ -41,2 +93,4 @@
error_message = "";
cancel_cb = function_null;
updates = true;
// update_call = true;

Don't leave uncommented code in.

Don't leave uncommented code in.
leesonw marked this conversation as resolved
William Leeson added 4 commits 2023-03-21 15:58:21 +01:00
William Leeson added 2 commits 2023-03-22 11:17:44 +01:00
01e96490f1 Change params for host_mem_alloc to host_mem_alloc(size_t size, int aligment)
The parameters for host_mem_alloc were changed to
`host_mem_alloc(size_t size, int aligment) from
host_mem_alloc(size_t size, int aligment, void **p_mem)
4e1469b8bb Only build the BVH2 BVH once
Ensures that only 1 BVH2 BVH is built for each object or scene.
This is done using lock to aquire the BVH2 and other threads
can skip the building if it is for an object. However, for the
top level BVH all threads wait on the TLAS to be finished.
William Leeson added 1 commit 2023-03-22 12:35:33 +01:00
William Leeson added 1 commit 2023-03-23 21:35:52 +01:00
6e0875b729 Only build a BVH2 BVH once
Basically checks if the BVH2 BVH has been built or is being built.
Then when building the top level BVH2 BVH all devices that use it
wait for the one that is building it to finish before continuing.
William Leeson added 1 commit 2023-03-24 12:20:32 +01:00
William Leeson added 1 commit 2023-03-28 09:13:02 +02:00
William Leeson added 1 commit 2023-03-28 10:18:24 +02:00
William Leeson added 2 commits 2023-03-28 13:25:13 +02:00
c952e5d159 FIX: Fix memory leak when using more than 1 device
The mult-device set the device pointer for the device_memory as it
iterated through the devices. However, it did not restore the
original pointer to the device if the device_ptr (pointer to the
memory buffer) was 0 which it always is after a mem_free.
3a566ccee4 FIX: Using the Optix BVH for the NVidia device
Optix device was using the multi-bvh as the Optix BVH which was
causing the device to have the wrong handles for the objects.
William Leeson added 2 commits 2023-03-28 14:56:21 +02:00
b0345701a0 FIX:Use the headless flag to switch on/off progress updates
Don't do progress updates when in headless mode. Previously
the background flag was used but this is also used for F12 render.
Which needs the updates.
d59a30b4a0 FIX: Lock mutex before using condition variable waiting on BVH build
Mutex needs to be locked before using the condition variable can
be used. This adds a scoped lock to ensure it is locked before
waiting on the condition_variable.
William Leeson added 1 commit 2023-03-29 10:11:04 +02:00
b75f6ad788 FIX: Make sure top level BVH is built after all other BVHs
The top level BVH can only be built once all the other (bottom
level) BVHs are completed.
William Leeson added 1 commit 2023-03-29 18:20:50 +02:00
Brecht Van Lommel requested review from Brecht Van Lommel 2023-03-29 18:51:19 +02:00
William Leeson added 1 commit 2023-04-04 14:41:32 +02:00
William Leeson added 1 commit 2023-04-05 18:31:18 +02:00
William Leeson added 1 commit 2023-04-10 18:09:31 +02:00
William Leeson added 17 commits 2023-04-11 15:41:13 +02:00
a27247e28f Refactor geometry.cpp and geometry_additions.cpp
This breaks up the geometry.cpp and geometry_additions.cpp into
the following files:
- geometry.cpp
- geometry_mesh.cpp
- geometry_attributes.cpp
- geometry_bvh.cpp
so as to organise the code into more manageable pieces.
9b3b180d23 FIX: Enable OneAPI build and fix compile issues
Needed to add the memory copy methods the have offset and size.
ac67371fa4 Makes sure the CPU is the last device just once at the end.
Previously after every device was added the CPU device was swopped
to the end of the list. Now the position it was added at is saved
and it is swopped with the last device at the end.
674721194e FIX: Adds case where no CPU is selected in MultiDevice
Forgot to push the code which skips swopping if the is no CPU in
the MultiDevice set.
3c81e479a6 Switch off using pinned memory on CUDA devices
Pinned memory was used for all device_memory. However, due to
that it may cause memory pressure this was removed and can be
enabled by passing USE_DEVICE_PINNED_MEMORY.
cddd2bfdf0 Replae upload and building times arrays with a vector struct
Arrays of dougle values are replaced by a vector<> of a struct
scene_times.
William Leeson reviewed 2023-04-11 15:43:15 +02:00
Author
Member

Removed

Removed
leesonw marked this conversation as resolved
@ -14,1 +13,4 @@
const Device *device_)
: BVH(params_, geometry_, objects_), device(device_)
{
// Resize the sub-bvh container to match the number of devices
Author
Member

I added a variable and mutex to indicate if it was being built or has been built. This ensures that it is built only once. However, I put this code in build_bvh in device.cpp as the BVHMulti can have different BVH types within it and this only applies to BVH2.

I added a variable and mutex to indicate if it was being built or has been built. This ensures that it is built only once. However, I put this code in build_bvh in device.cpp as the BVHMulti can have different BVH types within it and this only applies to BVH2.
brecht marked this conversation as resolved
@ -548,3 +548,3 @@
const CUDAContextScope scope(this);
cuda_assert(cuMemcpyHtoD((CUdeviceptr)device_pointer, host_pointer, size));
cuda_assert(cuMemcpyHtoD((CUdeviceptr)device_pointer + offset,
Author
Member

Changing this to use reinterpret_cast<unsigned char *>() results in having to cast it back to CUdeviceptr as follows reinterpret_cast<CUdeviceptr>(reinterpret_cast<unsigned char *>(device_pointer)) otherwise the following error occurs:

device_impl.cpp:550:78: error: invalid conversion from ‘unsigned char*’ to ‘CUdeviceptr’ {aka ‘long long unsigned int’} [-fpermissive]
  550 | uMemcpyHtoD(reinterpret_cast<unsigned char *>(device_pointer) + offset,
      |             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~
      |                                                               |
      |                                                               unsigned char*

which is fine but it seems a bigger mouthful to swallow that just using CUdeviceptr in the first place.

Changing this to use `reinterpret_cast<unsigned char *>()` results in having to cast it back to `CUdeviceptr` as follows `reinterpret_cast<CUdeviceptr>(reinterpret_cast<unsigned char *>(device_pointer))` otherwise the following error occurs: ``` device_impl.cpp:550:78: error: invalid conversion from ‘unsigned char*’ to ‘CUdeviceptr’ {aka ‘long long unsigned int’} [-fpermissive] 550 | uMemcpyHtoD(reinterpret_cast<unsigned char *>(device_pointer) + offset, | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~ | | | unsigned char* ``` which is fine but it seems a bigger mouthful to swallow that just using `CUdeviceptr` in the first place.

Ok, fine to leave it as is then.

Ok, fine to leave it as is then.
brecht marked this conversation as resolved
@ -551,0 +554,4 @@
void *CUDADevice::host_mem_alloc(size_t size, int aligment) {
void *p_mem = NULL;
CUDAContextScope scope(this);
Author
Member

That is a possibility, I can switch it off for now. In our usage we never had that happen but most of the systems were not memory constrained. Not putting it in pinned memory ends up with a double copy as the driver then copies it to pinned memory before copying to the device. I switched this back to using regular memory for now.

That is a possibility, I can switch it off for now. In our usage we never had that happen but most of the systems were not memory constrained. Not putting it in pinned memory ends up with a double copy as the driver then copies it to pinned memory before copying to the device. I switched this back to using regular memory for now.
brecht marked this conversation as resolved
@ -581,6 +607,29 @@ void CUDADevice::mem_copy_to(device_memory &mem)
}
}
void CUDADevice::mem_copy_to(device_memory &mem, size_t size, size_t offset)
Author
Member

This gives a great speed boost for interactive preview. I'll remove it from this PR and I can add it in a separate PR later.

This gives a great speed boost for interactive preview. I'll remove it from this PR and I can add it in a separate PR later.
brecht marked this conversation as resolved
@ -8,3 +8,3 @@
#include "bvh/params.h"
#include "scene/scene.h"
Author
Member

Ok, I replaced the include with a forward declaration. In another patch, instead of uploading the geometry data a second time we use the data in the DeviceScene which is already uploaded. This saves both memory and time. Would that not be acceptable?

Ok, I replaced the include with a forward declaration. In another patch, instead of uploading the geometry data a second time we use the data in the DeviceScene which is already uploaded. This saves both memory and time. Would that not be acceptable?
brecht marked this conversation as resolved
@ -377,4 +449,2 @@
{
size_t new_size = size(width, height, depth);
if (new_size != data_size) {
Author
Member

I reverted this change also.

I reverted this change also.
brecht marked this conversation as resolved
@ -369,1 +379,4 @@
device_vector(Device *device,
const char *name,
void *p_mem,
Author
Member

Yes, it looks like it's not used in this PR so I removed it.

Yes, it looks like it's not used in this PR so I removed it.
brecht marked this conversation as resolved
@ -833,0 +856,4 @@
}
else {
generic_copy_to(mem, size, offset);
}
Author
Member

I reverted this change also.

I reverted this change also.
brecht marked this conversation as resolved
@ -41,3 +45,3 @@
/* Always add CPU devices at the back since GPU devices can change
* host memory pointers, which CPU uses as device pointer. */
SubDevice *sub;
SubDevice *sub = new SubDevice;
Author
Member

Done. I also replaced the device creation with a unique_ptr<>

Done. I also replaced the device creation with a `unique_ptr<>`
brecht marked this conversation as resolved
@ -51,0 +54,4 @@
// to keep the CPU devices at the end of the vector.
int last = devices.size() - 1;
int o = last;
while ((o > 0) && (devices[o - 1]->device->info.type == DEVICE_CPU)) {
Author
Member

Yes, I fixed this up.

Yes, I fixed this up.
brecht marked this conversation as resolved
@ -162,0 +185,4 @@
size_t *motion_vert_offsets;
};
struct AttributeSizes {
Author
Member

I'll see what I can do. It is this way because in the original implementation the device_memory is not always exactly the same size (sometimes it is bigger) so we need a place to track the actual size of upload otherwise more memory is uploaded that needed.

I don't think putting it in DeviceScene is great because as you say if I put it in the DeviceScene then that information is replicated for each device.

I'll investigate more to see if there is somewhere else that this information would be logical.

I'll see what I can do. It is this way because in the original implementation the device_memory is not always exactly the same size (sometimes it is bigger) so we need a place to track the actual size of upload otherwise more memory is uploaded that needed. I don't think putting it in DeviceScene is great because as you say if I put it in the DeviceScene then that information is replicated for each device. I'll investigate more to see if there is somewhere else that this information would be logical.

I still think it's important to eliminate these, tracking this separately is really error prone.

I don't understand what you mean by "device_memory is not always exactly the same size (sometimes it is bigger)". Is this related to the device memory not shrinking with mem.device_size < mem.memory_size(), that I think we should change to != anyway?

Even if we were not shrinking it, that should be an implementation detail of device_vector similar to how it's an implementation detail of std::vector. It's not something that should be tracked externally.

I still think it's important to eliminate these, tracking this separately is really error prone. I don't understand what you mean by "device_memory is not always exactly the same size (sometimes it is bigger)". Is this related to the device memory not shrinking with `mem.device_size < mem.memory_size()`, that I think we should change to `!=` anyway? Even if we were not shrinking it, that should be an implementation detail of `device_vector` similar to how it's an implementation detail of `std::vector`. It's not something that should be tracked externally.
@ -36,0 +37,4 @@
* checks the progress for if a cancel has been requested and also
* the device to see if an error has occurred.
*/
bool Scene::progressErrorCheck(Progress &progress, Device *device) {
Author
Member

I changed this to check_cancel_update.

I changed this to check_cancel_update.
brecht marked this conversation as resolved
@ -36,0 +39,4 @@
*/
bool Scene::progressErrorCheck(Progress &progress, Device *device) {
bool status = false;
if (!background && progress.get_updates()) {
Author
Member

I changed this to only happen if updates are on. i.e. I removed the !background condition

I changed this to only happen if updates are on. i.e. I removed the !background condition
brecht marked this conversation as resolved
@ -107,0 +127,4 @@
mesh_times = new double[device_count];
attrib_times = new double[device_count];
object_bvh_times = new double[device_count];
scene_bvh_times = new double[device_count];
Author
Member

I replaced these with a vector and moved them into a struct.

I replaced these with a vector and moved them into a struct.
@ -17,3 +28,4 @@
CCL_NAMESPACE_BEGIN
class SpinLock {
Author
Member

I replaced this with the one in utils/thread.h and removed this code.

I replaced this with the one in utils/thread.h and removed this code.
brecht marked this conversation as resolved
@ -41,2 +93,4 @@
error_message = "";
cancel_cb = function_null;
updates = true;
// update_call = true;
Author
Member

Removed

Removed
brecht marked this conversation as resolved
William Leeson added 1 commit 2023-04-11 16:57:48 +02:00
66a6a7a0af Clean up code
1. remove camelCase method names.
2. fix up white space changes.
3. remove some unnecessary changes.
William Leeson added 1 commit 2023-04-11 18:50:44 +02:00
William Leeson added 2 commits 2023-04-12 12:04:04 +02:00
William Leeson added 1 commit 2023-04-13 15:37:25 +02:00
William Leeson reviewed 2023-04-13 15:57:32 +02:00
@ -0,0 +1,407 @@
/* SPDX-License-Identifier: Apache-2.0
Author
Member

I split this into geometry.cpp, geometry_mesh.cpp, geometry_attributes.cpp and geometry_bvh.cpp

I split this into geometry.cpp, geometry_mesh.cpp, geometry_attributes.cpp and geometry_bvh.cpp
brecht marked this conversation as resolved
@ -0,0 +34,4 @@
/*
* Clears all tags used to indicate the the shader needs to be updated.
*/
void GeometryManager::clearShaderUpdateTags(Scene *scene)
Author
Member

I removed the camelCase function names.

I removed the camelCase function names.
brecht marked this conversation as resolved
@ -0,0 +121,4 @@
sub_dscene->tri_verts.assign_mem(dscene->tri_verts);
sub_dscene->tri_verts.tag_modified();
}
{
Author
Member

Removed as it was an old comment

Removed as it was an old comment
brecht marked this conversation as resolved
William Leeson added 1 commit 2023-04-13 16:20:56 +02:00
William Leeson added 3 commits 2023-04-13 16:25:26 +02:00
William Leeson added 2 commits 2023-04-14 10:23:33 +02:00
William Leeson added 1 commit 2023-04-14 10:57:01 +02:00
bbbf76c4db GeometrySizes and AttributeSizes are stored in Scene
Moved GeometrySizes and AttributeSizes to be stored in scene. This
resulted in them being removed as parameters where a Scene was
already passed in.
William Leeson added 1 commit 2023-04-14 13:42:04 +02:00
b1dd204d42 Moves DeviceScene related methods to the DeviceScene class
Refactored the GeometryManager methods so that those relating to
the DeviceScene are now methods in the DeviceScene.
William Leeson added 1 commit 2023-04-14 14:20:59 +02:00
f63c508b40 Move BVH2 device update to DeviceScene
As the device_update_bvh2 method just involves transferring data
to the device it was moved to DeviceScene.
William Leeson added 2 commits 2023-04-14 15:10:42 +02:00
d7cd4a4951 FIX: Add device type definitions
Moving DeviceScene from scene.h caused the device definitions to be
no longer present in various files. This add them back in by
including device.h in the required files.
William Leeson changed title from WIP: Refactor geometry update for better parallelization and upload performance to Refactor geometry update for better parallelization and upload performance 2023-04-14 15:13:20 +02:00
Author
Member

@brecht I hope this is closer to what you are looking for. I have removed the WIP as I think it is ready now :-)

@brecht I hope this is closer to what you are looking for. I have removed the WIP as I think it is ready now :-)
Member

@blender-bot package

@blender-bot package
Member

Package build started. Download here when ready.

Package build started. [Download here](https://builder.blender.org/download/patch/PR105403) when ready.
William Leeson added 1 commit 2023-04-14 15:53:47 +02:00
b407dba398 FIX: Use new memory copy method that replaced the old one
Forgot to add the file for the previous change to the memory copy
methods, which resulted in the OneAPI build failing.

On the design side, I'm still confused about why parallelizing memory copying to multiple devices requires such deep changes on the host side, why this can't be abstracted more inside Device and MultiDevice.

CUDA and HIP already have a MemMap device_mem_map that keeps track of all device_memory. So if device_memory contains a bool modified_, that should be enough information to implement a method like Device::copy_modified_memory_to_device that just loops over all memory and copies where needed? And then that can be done inside a parallel for in MultiDevice.

Or am I missing a reason for why this is not practical?

On the design side, I'm still confused about why parallelizing memory copying to multiple devices requires such deep changes on the host side, why this can't be abstracted more inside `Device` and `MultiDevice`. CUDA and HIP already have a `MemMap device_mem_map` that keeps track of all `device_memory`. So if `device_memory` contains a `bool modified_`, that should be enough information to implement a method like `Device::copy_modified_memory_to_device` that just loops over all memory and copies where needed? And then that can be done inside a parallel for in `MultiDevice`. Or am I missing a reason for why this is not practical?
Brecht Van Lommel requested changes 2023-04-14 18:06:24 +02:00
Brecht Van Lommel left a comment
Owner

I'm having trouble reviewing the geometry update changes, since it's both moving code and changing the logic. It would be very helpful if you could submit a PR that just moves code to new files and order functions to be similar to the new code. Then I can quickly review and approve that, and then when this PR is updated hopefully it will show a readable delta between old and new code?

I'm having trouble reviewing the geometry update changes, since it's both moving code and changing the logic. It would be very helpful if you could submit a PR that just moves code to new files and order functions to be similar to the new code. Then I can quickly review and approve that, and then when this PR is updated hopefully it will show a readable delta between old and new code?
@ -125,6 +125,7 @@ void BlenderSession::create_session()
session = new Session(session_params, scene_params);
session->progress.set_update_callback(function_bind(&BlenderSession::tag_redraw, this));
session->progress.set_cancel_callback(function_bind(&BlenderSession::test_cancel, this));
session->progress.set_updates(!headless);

It's still not clear to me why showing updates should ever be disabled. Regardless if rendering headless or not it's good to see what is happening so Cycles doesn't look like it's stuck. If performance is an issue we can look at other ways to optimize this.

Can you explain the thinking behind this?

Regardless, I think this change and then changes in progress.h should be submitted a a separate PR. I think that can be reviewed and landed separate from this, there doesn't seem to be a dependency in either direction.

It's still not clear to me why showing updates should ever be disabled. Regardless if rendering headless or not it's good to see what is happening so Cycles doesn't look like it's stuck. If performance is an issue we can look at other ways to optimize this. Can you explain the thinking behind this? Regardless, I think this change and then changes in `progress.h` should be submitted a a separate PR. I think that can be reviewed and landed separate from this, there doesn't seem to be a dependency in either direction.
Author
Member

The reason behind this is that with many threads the mutexs in the progress calls were causing large stalls. So this was an easy way to switch then on an off. The stalls were frequent especially where the progress update was called inside a loop.

I have removed this for now.

The reason behind this is that with many threads the mutexs in the progress calls were causing large stalls. So this was an easy way to switch then on an off. The stalls were frequent especially where the progress update was called inside a loop. I have removed this for now.
@ -66,3 +66,3 @@
vector<Geometry *> geometry;
vector<Object *> objects;
bool built;

Add empty line after this, initialize to false directly here.

Add empty line after this, initialize to false directly here.
Author
Member

Done

Done
@ -24,0 +36,4 @@
{
int id = device->device_number(subdevice);
resize_sub_bvhs_if_needed(id);
sub_bvhs[id] = bvh;

It's not clear to me how the previous BVH that was stored here gets freed, or how the ownership here works in general. Ideally unique_ptr would be used to make that really clear.

It's not clear to me how the previous BVH that was stored here gets freed, or how the ownership here works in general. Ideally `unique_ptr` would be used to make that really clear.
Author
Member

I have changed this to use unique_ptr now

I have changed this to use `unique_ptr` now
leesonw marked this conversation as resolved
@ -54,3 +53,2 @@
BVH2 *const bvh2 = static_cast<BVH2 *>(bvh);
if (refit) {
bvh2->refit(progress);
static std::atomic<int> building{0};

Don't use static variables like this. There can be multiple Cycles sessions active at once that should not be blocked by each other here.

Don't use static variables like this. There can be multiple Cycles sessions active at once that should not be blocked by each other here.
Author
Member

Doh, I'd forgotten there could be multiple sessions :-/ I'll look into moving this into the top level BVH.

Doh, I'd forgotten there could be multiple sessions :-/ I'll look into moving this into the top level BVH.
Author
Member

I have updated this to storing the building count in the top level BVH so it should work with multiple sessions now.

I have updated this to storing the building count in the top level BVH so it should work with multiple sessions now.
@ -55,2 +54,2 @@
if (refit) {
bvh2->refit(progress);
static std::atomic<int> building{0};
/* Top level BVH2 build must wait on all other BVH2 builds to finish

I don't understand why all this complicated synchronization logic is here. Don't we already guarantee that the top level BVH is built after the bottom level BVHs from the caller functions?

I would think all that is needed here is a double checked lock, no condition variables or waiting.

I don't understand why all this complicated synchronization logic is here. Don't we already guarantee that the top level BVH is built after the bottom level BVHs from the caller functions? I would think all that is needed here is a double checked lock, no condition variables or waiting.
Author
Member

No, it only stops 2 different threads from building the same BVH. However, some threads can reach this point before all the object BVHs are built so it needs to wait on those BVHs to be built before it can proceed.

I don't this this would work in this case.

I'll look into moving the atomic into the top level BVH that way it won't block multiple sessions.

No, it only stops 2 different threads from building the same BVH. However, some threads can reach this point before all the object BVHs are built so it needs to wait on those BVHs to be built before it can proceed. I don't this this would work in this case. I'll look into moving the atomic<int> into the top level BVH that way it won't block multiple sessions.
Author
Member

This is needed because if more than 1 device uses BVH2 they all work on building it. This happens when you combine GPU + CPU rendering or more than one GPU that uses BVH2.

This is needed because if more than 1 device uses BVH2 they all work on building it. This happens when you combine GPU + CPU rendering or more than one GPU that uses BVH2.
@ -531,3 +536,2 @@
if (mem.type == MEM_GLOBAL) {
global_free(mem);
global_alloc(mem);
if ((mem.device_size < mem.memory_size()) || (!mem.device_pointer)) {

Same comment as before, change < to != to ensure memory decreases.

Same comment as before, change `<` to `!=` to ensure memory decreases.
Author
Member

Done

Done
@ -0,0 +90,4 @@
}
/*
* Clears the modified tags for all elements of the device scene

It's not all of them? For example object_motion_pass is not done here.

The function has a generic name but then only works on some subset related to geometry, that's confusing.

It's not all of them? For example `object_motion_pass` is not done here. The function has a generic name but then only works on some subset related to geometry, that's confusing.
Author
Member

I renamed the function device_free_geometry

I renamed the function `device_free_geometry`
@ -0,0 +128,4 @@
DeviceScene *dscene,
const GeometrySizes *p_sizes)
{
if (p_sizes->tri_size != 0) {

Are these conditionals checking for the size really needed? Why can't it just always do this? Don't all of the individual device vectors have sufficient info to skip operations if they are unnecessary?

Are these conditionals checking for the size really needed? Why can't it just always do this? Don't all of the individual device vectors have sufficient info to skip operations if they are unnecessary?
Author
Member

I believe this was just replicating similar code in geometry.cpp:1092 in method GeometryManager::device_update_mesh with an optimization of moving the copy inside the if. I'll remove these and rely on the device_memory instead.

I believe this was just replicating similar code in geometry.cpp:1092 in method GeometryManager::device_update_mesh with an optimization of moving the copy inside the if. I'll remove these and rely on the device_memory instead.
@ -630,3 +630,3 @@
}
double Session::get_estimated_remaining_time() const
double Session::get_estimated_remaining_time() //const

Add back const? The implementation did not change.

Add back const? The implementation did not change.
Author
Member

Done

Done
@ -7,0 +11,4 @@
#else
#define MUTEX thread_mutex
#define SCOPED_LOCK(mutex) thread_scoped_lock scopedlock(mutex)
#endif

I rather not use macros, especially not with such generic names. If a spin lock is faster we might as well just use that.

If there's really a need for both to be switchable, I prefer something like this inside the class:

using progress_mutex = thread_spin_lock;
using progress_lock = thread_scoped_spin_lock;
I rather not use macros, especially not with such generic names. If a spin lock is faster we might as well just use that. If there's really a need for both to be switchable, I prefer something like this inside the class: ``` using progress_mutex = thread_spin_lock; using progress_lock = thread_scoped_spin_lock; ```
Author
Member

I removed this.

I removed this.
William Leeson added 1 commit 2023-04-17 12:26:43 +02:00
William Leeson added 1 commit 2023-04-17 14:17:55 +02:00
William Leeson added 2 commits 2023-04-17 18:14:19 +02:00
William Leeson added 1 commit 2023-04-18 10:34:07 +02:00
William Leeson added 2 commits 2023-04-19 13:46:35 +02:00
William Leeson added 1 commit 2023-04-19 15:36:03 +02:00
William Leeson added 1 commit 2023-04-20 12:27:21 +02:00
William Leeson added 1 commit 2023-04-20 17:55:52 +02:00
William Leeson added 2 commits 2023-04-21 15:37:53 +02:00
d96fd9b08f Handle the bvh root index correctly for the device specific DeviceScene
Uses the unique_ptr to decallocate the BVH2 structure in the
BVH postprocess and assignes the bvh root indices in the device
specific areas also.
William Leeson added 1 commit 2023-04-21 15:56:35 +02:00
William Leeson added 1 commit 2023-04-21 17:29:54 +02:00
William Leeson added 1 commit 2023-04-24 10:38:35 +02:00
William Leeson added 4 commits 2023-04-26 12:23:25 +02:00
1560ec66c9 FIX: Create CUDA context earlier and remove lock
Some CUDA commands were used before the context was setup. Also
since the BVHs are build in order there is not need for the lock
to prevent 2 being build at tehe same time to save memory.
William Leeson added 1 commit 2023-04-26 15:00:37 +02:00
William Leeson added 3 commits 2023-04-27 11:16:31 +02:00
58cddaeefe FIX: Upload textures to all devices
For the displacement and shadow texture generation the textures
were originally only uploaded to 1 device. However, this resulted
in some textures that were needed across all devices not being
uploaded. This fixes that by uploading them to all devices.
a916e34033 Move scene DeviceScene to use unique_ptr
To reduce the amount of code and manage the life cycles of the
DeviceScenes they are now stored as unique_ptrs
William Leeson added 1 commit 2023-04-27 14:45:36 +02:00
William Leeson added 2 commits 2023-04-28 10:33:24 +02:00
df62806c94 Use task pool for object BVH build also fixes stats recording
Re-instates the task pool used to build the BVHs for the objects.
Also the displacement timing was being reported twice and this is
fixed.
William Leeson added 1 commit 2023-04-28 14:30:36 +02:00
William Leeson requested review from Brecht Van Lommel 2023-04-28 14:45:53 +02:00
William Leeson added 1 commit 2023-04-28 15:29:35 +02:00
buildbot/vexp-code-patch-coordinator Build done. Details
429f953d6c
Move CUDA context into a smaller scope
Author
Member

@blender-bot package

@blender-bot package
Member

Package build started. Download here when ready.

Package build started. [Download here](https://builder.blender.org/download/patch/PR105403) when ready.
William Leeson added 1 commit 2023-05-01 10:05:18 +02:00
buildbot/vexp-code-patch-coordinator Build done. Details
de58c2ab8e
Merge branch 'upstream_main' into geometry_update
Author
Member

I have updated the scene, geometry, object BVH and scene BVH update times graphs and updated the results in the spreadsheet.

I have updated the scene, geometry, object BVH and scene BVH update times graphs and updated the results in the [spreadsheet](https://docs.google.com/spreadsheets/d/1OYF3g988jZQXQG1pVopzIMQTRVo39V1znrLx8c92Rhs/edit?usp=sharing).
Author
Member

@blender-bot package

@blender-bot package
Member

Package build started. Download here when ready.

Package build started. [Download here](https://builder.blender.org/download/patch/PR105403) when ready.
William Leeson added 1 commit 2023-05-03 11:22:11 +02:00
William Leeson added 1 commit 2023-05-04 13:49:39 +02:00
William Leeson added 2 commits 2023-05-08 14:23:28 +02:00
184e445502 FIX: get_device_bvh returns the MultiBVH if the device is the MultiDevice
If the device was the MultiDevice it previously returned the first
BVH. This was incorrect it should return the MultiBVH. Also, if
it cannot find the device then it should return NULL.
William Leeson added 2 commits 2023-05-08 16:49:26 +02:00
5cb0b8a3bf Move the BVH layout determination into a method for reuse
This code was duplicated in MultiDevice and geometry_bvh.cpp so
this removed that duplication.
William Leeson added 1 commit 2023-05-09 13:57:22 +02:00
e701b8e09d FIX: Don't disable ray query on scene BVH
This fixes an issue when using the BVH indirectly through the
MultiDevice. Setting the bvh_layout to none causes the ray queries
to be disabled. Also this does some refactoring to clean up the
use of the device_data_xfer_and_bvh_update method.
William Leeson added 1 commit 2023-05-10 15:33:48 +02:00
William Leeson added 1 commit 2023-05-12 10:16:40 +02:00
William Leeson added 1 commit 2023-05-15 16:12:58 +02:00
William Leeson added 1 commit 2023-05-16 10:11:25 +02:00
William Leeson added 1 commit 2023-05-17 10:54:47 +02:00
William Leeson added 1 commit 2023-05-17 15:38:15 +02:00
William Leeson added 1 commit 2023-05-17 18:05:19 +02:00
William Leeson added 2 commits 2023-05-19 08:51:12 +02:00
William Leeson added 1 commit 2023-05-22 10:49:55 +02:00
William Leeson added 1 commit 2023-05-25 13:35:29 +02:00
William Leeson added 1 commit 2023-05-29 10:04:19 +02:00

I've started looking into this in detail, trying to split off changes in separate commits to understand what is going on. I started doing that in #108454, but hit against deeper design issues.

For me the added code complexity in scene update in this PR and #107552 is still way too much. For me the whole idea of making multi device handle this would be to keep the scene update code almost the same. Yet #107552 still has multiple DeviceScene instances, still tracks various memory sizes separately and various other changes. Why is that still there? Was it just left in because it was an experiment and not cleaned up yet, or is there something that makes this necessary?

I can't accept this level of code complexity for this optimization, it has to be simpler or we should not do it.

Additionally, I believe the multiple device scenes do not work correctly with host memory fallback and peering. Host memory fallback can not be done purely in parallel, there needs to be some coordination between devices to share the same pinned memory instead of each allocating its own copy in a dedicated device_memory for each device. Peering is currently handled in the multi device class, and also seemingly not handled as each device has its own device_memory, not one per peering island.


I believe a solution mostly handled by the device API is possible.

  • We can queue up mem_alloc, mem_copy_to and mem_zero operations in the multi device, and then execute them in parallel across devices at a few synchronization points, like before BVH build, displacement and after scene updates.
  • Similarly BVH build operations can also be queued up, and then executed in parallel across devices.
  • Peering can work with this without much trouble, as we can just do operations in parallel across peering islands instead of devices.
  • Host memory fallback is the trickiest part. The related members in device_memory need to be shared between devices, and it must be mutex locked.

I started trying to implement this, but especially the host memory fallback sharing is more work than I hoped. So before I spend more time on this, I'd like to understand if I am missing something. Especially regarding why #107552 still has so many changes in scene update, I feel like there may be a reason for this that I do not understand.

I've started looking into this in detail, trying to split off changes in separate commits to understand what is going on. I started doing that in #108454, but hit against deeper design issues. For me the added code complexity in scene update in this PR and #107552 is still way too much. For me the whole idea of making multi device handle this would be to keep the scene update code almost the same. Yet #107552 still has multiple `DeviceScene` instances, still tracks various memory sizes separately and various other changes. Why is that still there? Was it just left in because it was an experiment and not cleaned up yet, or is there something that makes this necessary? I can't accept this level of code complexity for this optimization, it has to be simpler or we should not do it. Additionally, I believe the multiple device scenes do not work correctly with host memory fallback and peering. Host memory fallback can not be done purely in parallel, there needs to be some coordination between devices to share the same pinned memory instead of each allocating its own copy in a dedicated `device_memory` for each device. Peering is currently handled in the multi device class, and also seemingly not handled as each device has its own `device_memory`, not one per peering island. --- I believe a solution mostly handled by the device API is possible. * We can queue up `mem_alloc`, `mem_copy_to` and `mem_zero` operations in the multi device, and then execute them in parallel across devices at a few synchronization points, like before BVH build, displacement and after scene updates. * Similarly BVH build operations can also be queued up, and then executed in parallel across devices. * Peering can work with this without much trouble, as we can just do operations in parallel across peering islands instead of devices. * Host memory fallback is the trickiest part. The related members in `device_memory` need to be shared between devices, and it must be mutex locked. I started trying to implement this, but especially the host memory fallback sharing is more work than I hoped. So before I spend more time on this, I'd like to understand if I am missing something. Especially regarding why #107552 still has so many changes in scene update, I feel like there may be a reason for this that I do not understand.
Author
Member

I've started looking into this in detail, trying to split off changes in separate commits to understand what is going on. I started doing that in #108454, but hit against deeper design issues.

For me the added code complexity in scene update in this PR and #107552 is still way too much. For me the whole idea of making multi device handle this would be to keep the scene update code almost the same. Yet #107552 still has multiple DeviceScene instances, still tracks various memory sizes separately and various other changes. Why is that still there? Was it just left in because it was an experiment and not cleaned up yet, or is there something that makes this necessary?

I can't accept this level of code complexity for this optimization, it has to be simpler or we should not do it.

Additionally, I believe the multiple device scenes do not work correctly with host memory fallback and peering. Host memory fallback can not be done purely in parallel, there needs to be some coordination between devices to share the same pinned memory instead of each allocating its own copy in a dedicated device_memory for each device. Peering is currently handled in the multi device class, and also seemingly not handled as each device has its own device_memory, not one per peering island.


I believe a solution mostly handled by the device API is possible.

  • We can queue up mem_alloc, mem_copy_to and mem_zero operations in the multi device, and then execute them in parallel across devices at a few synchronization points, like before BVH build, displacement and after scene updates.
  • Similarly BVH build operations can also be queued up, and then executed in parallel across devices.
  • Peering can work with this without much trouble, as we can just do operations in parallel across peering islands instead of devices.
  • Host memory fallback is the trickiest part. The related members in device_memory need to be shared between devices, and it must be mutex locked.

I started trying to implement this, but especially the host memory fallback sharing is more work than I hoped. So before I spend more time on this, I'd like to understand if I am missing something. Especially regarding why #107552 still has so many changes in scene update, I feel like there may be a reason for this that I do not understand.

Yes the reason they are similar is that I decided to keep the scene changes as is so that the performance differences could be compared directly. There are some performance wins from the reorganization which would have favored the first PR. To bring this over to the current code base it should be possible to just bring over the just the Multi-Device changes without the scene changes which would be relatively simple. However, as you can see from the performance graphs the wins are quite a bit smaller due to the change in the parallel structure from 1 parallel_for to many. This causes a lock step which the single per device parallel structure does not have. I can do this (without the scene) in another PR if you wish? Also, I believe you are correct about the peering but from my understanding only texture memory gets host mapped so this should not happen for this case where it is only dealing with scene geometry.

> I've started looking into this in detail, trying to split off changes in separate commits to understand what is going on. I started doing that in #108454, but hit against deeper design issues. > > For me the added code complexity in scene update in this PR and #107552 is still way too much. For me the whole idea of making multi device handle this would be to keep the scene update code almost the same. Yet #107552 still has multiple `DeviceScene` instances, still tracks various memory sizes separately and various other changes. Why is that still there? Was it just left in because it was an experiment and not cleaned up yet, or is there something that makes this necessary? > > I can't accept this level of code complexity for this optimization, it has to be simpler or we should not do it. > > Additionally, I believe the multiple device scenes do not work correctly with host memory fallback and peering. Host memory fallback can not be done purely in parallel, there needs to be some coordination between devices to share the same pinned memory instead of each allocating its own copy in a dedicated `device_memory` for each device. Peering is currently handled in the multi device class, and also seemingly not handled as each device has its own `device_memory`, not one per peering island. > > --- > > I believe a solution mostly handled by the device API is possible. > * We can queue up `mem_alloc`, `mem_copy_to` and `mem_zero` operations in the multi device, and then execute them in parallel across devices at a few synchronization points, like before BVH build, displacement and after scene updates. > * Similarly BVH build operations can also be queued up, and then executed in parallel across devices. > * Peering can work with this without much trouble, as we can just do operations in parallel across peering islands instead of devices. > * Host memory fallback is the trickiest part. The related members in `device_memory` need to be shared between devices, and it must be mutex locked. > > I started trying to implement this, but especially the host memory fallback sharing is more work than I hoped. So before I spend more time on this, I'd like to understand if I am missing something. Especially regarding why #107552 still has so many changes in scene update, I feel like there may be a reason for this that I do not understand. Yes the reason they are similar is that I decided to keep the scene changes as is so that the performance differences could be compared directly. There are some performance wins from the reorganization which would have favored the first PR. To bring this over to the current code base it should be possible to just bring over the just the Multi-Device changes without the scene changes which would be relatively simple. However, as you can see from the performance graphs the wins are quite a bit smaller due to the change in the parallel structure from 1 `parallel_for` to many. This causes a lock step which the single per device parallel structure does not have. I can do this (without the scene) in another PR if you wish? Also, I believe you are correct about the peering but from my understanding only texture memory gets host mapped so this should not happen for this case where it is only dealing with scene geometry.

Yes the reason they are similar is that I  decided to keep the scene changes as is so that the performance differences could be compared directly. There are some performance wins from the reorganization which would have favored the first PR.

Ok, having those other performance wins all in one PR makes it difficult to compare. I want to see a basic parallel memory copy  implementation first, and leave further refactoring for another PR. It's too difficult to understand it all together.

To bring this over to the current code base it should be possible to just bring over the just the Multi-Device changes without the scene changes which would be relatively simple. However, as you can see from the performance graphs the wins are quite a bit smaller due to the change in the parallel structure from 1 parallel_for to many. This causes a lock step which the single per device parallel structure does not have.

I don't think you need many parallel_for if the memory operations are queued up in the multi device. One synchronization point before BVH build maybe, and one at the end of scene update? Even that could be made to work so one device does not block the other.

Other synchronization points would be for shader evaluation, like displacement, transparency baking and world importance map. But those only need to be computed once so would not be within a parallel_for over devices anyway.

So I think if that implementation has many parallel_for invocations, I expect than can be solved by improving the implementation. But I might be missing something.

I can do this (without the scene) in another PR if you wish?

I added the work in progress changes I have in #108454, which does parallel memory operations and BVH building. It doesn't do host memory fallback, but peering should be correct.

It would be very helpful if you could test that PR for performance, and look at the implementation. I don't have a multi GPU setup for testing right now, but tested it with a single GPU added twice. And then if there is still a significant performance benefit from this PR, it would be good to understand where that is exactly.

I attached a log from rendering the junkshop with that PR, every "flush" is a synchronization point. It's probably possible to reduce those further if that is the bottleneck.

Also, I believe you are correct about the peering but from my understanding only texture memory gets host mapped so this should not happen for this case where it is only dealing with scene geometry.

The memory for scene geometry can also be host mapped, it's not just image textures.

> Yes the reason they are similar is that I  decided to keep the scene changes as is so that the performance differences could be compared directly. There are some performance wins from the reorganization which would have favored the first PR. Ok, having those other performance wins all in one PR makes it difficult to compare. I want to see a basic parallel memory copy  implementation first, and leave further refactoring for another PR. It's too difficult to understand it all together. > To bring this over to the current code base it should be possible to just bring over the just the Multi-Device changes without the scene changes which would be relatively simple. However, as you can see from the performance graphs the wins are quite a bit smaller due to the change in the parallel structure from 1 `parallel_for` to many. This causes a lock step which the single per device parallel structure does not have. I don't think you need many `parallel_for` if the memory operations are queued up in the multi device. One synchronization point before BVH build maybe, and one at the end of scene update? Even that could be made to work so one device does not block the other. Other synchronization points would be for shader evaluation, like displacement, transparency baking and world importance map. But those only need to be computed once so would not be within a `parallel_for` over devices anyway. So I think if that implementation has many `parallel_for` invocations, I expect than can be solved by improving the implementation. But I might be missing something. > I can do this (without the scene) in another PR if you wish? I added the work in progress changes I have in #108454, which does parallel memory operations and BVH building. It doesn't do host memory fallback, but peering should be correct. It would be very helpful if you could test that PR for performance, and look at the implementation. I don't have a multi GPU setup for testing right now, but tested it with a single GPU added twice. And then if there is still a significant performance benefit from this PR, it would be good to understand where that is exactly. I attached a log from rendering the junkshop with that PR, every "flush" is a synchronization point. It's probably possible to reduce those further if that is the bottleneck. > Also, I believe you are correct about the peering but from my understanding only texture memory gets host mapped so this should not happen for this case where it is only dealing with scene geometry. The memory for scene geometry can also be host mapped, it's not just image textures.
Author
Member

Yes the reason they are similar is that I  decided to keep the scene changes as is so that the performance differences could be compared directly. There are some performance wins from the reorganization which would have favored the first PR.

Ok, having those other performance wins all in one PR makes it difficult to compare. I want to see a basic parallel memory copy  implementation first, and leave further refactoring for another PR. It's too difficult to understand it all together.

Yes, I can understand that it's a lot of changes in one PR.

To bring this over to the current code base it should be possible to just bring over the just the Multi-Device changes without the scene changes which would be relatively simple. However, as you can see from the performance graphs the wins are quite a bit smaller due to the change in the parallel structure from 1 parallel_for to many. This causes a lock step which the single per device parallel structure does not have.

I don't think you need many parallel_for if the memory operations are queued up in the multi device. One synchronization point before BVH build maybe, and one at the end of scene update? Even that could be made to work so one device does not block the other.

Other synchronization points would be for shader evaluation, like displacement, transparency baking and world importance map. But those only need to be computed once so would not be within a parallel_for over devices anyway.

So I think if that implementation has many parallel_for invocations, I expect than can be solved by improving the implementation. But I might be missing something.

It's a bit of a trade off between having fewer parallel_for's and trying to overlap CPU work with GPU work. The more GPUs you have the more important it was to just have one thread per device to remove the synchronization that occurred due the parallel_for's. The scene changes were to separate the CPU side calculations from the GPU specific work so that each GPU thread would not repeat work. For instance calculating the normals only needs to be done once not per device. This also had the side effect of cleaning up the CPU calculations and removing some that were repeated. Also it was best to execute the GPU commands a soon as possible. For memory transfers it seems that a few large transfers are better than lots of smaller ones (for instance per object transfers vs transferring the whole scene in one go) .

I can do this (without the scene) in another PR if you wish?

I added the work in progress changes I have in #108454, which does parallel memory operations and BVH building. It doesn't do host memory fallback, but peering should be correct.

It would be very helpful if you could test that PR for performance, and look at the implementation. I don't have a multi GPU setup for testing right now, but tested it with a single GPU added twice. And then if there is still a significant performance benefit from this PR, it would be good to understand where that is exactly.

Yes, certainly I'll try this out on my 3080 + 3090 dual GPU config. It will be interesting to see how queuing up work like this compares to the alternatives.

I attached a log from rendering the junkshop with that PR, every "flush" is a synchronization point. It's probably possible to reduce those further if that is the bottleneck.

In theory you only need to synchronize at the end if you are just submitting a list of commands to each device which will be executed in order. However, I guess the flushes are needed to start the execution which results in more parallel_fors.

Also, I believe you are correct about the peering but from my understanding only texture memory gets host mapped so this should not happen for this case where it is only dealing with scene geometry.

The memory for scene geometry can also be host mapped, it's not just image textures.

Ok, I guess I understood that code wrong. Looks like I'll need to go over it again to make sure I understand it correctly.

> > Yes the reason they are similar is that I  decided to keep the scene changes as is so that the performance differences could be compared directly. There are some performance wins from the reorganization which would have favored the first PR. > > Ok, having those other performance wins all in one PR makes it difficult to compare. I want to see a basic parallel memory copy  implementation first, and leave further refactoring for another PR. It's too difficult to understand it all together. > Yes, I can understand that it's a lot of changes in one PR. > > To bring this over to the current code base it should be possible to just bring over the just the Multi-Device changes without the scene changes which would be relatively simple. However, as you can see from the performance graphs the wins are quite a bit smaller due to the change in the parallel structure from 1 `parallel_for` to many. This causes a lock step which the single per device parallel structure does not have. > > I don't think you need many `parallel_for` if the memory operations are queued up in the multi device. One synchronization point before BVH build maybe, and one at the end of scene update? Even that could be made to work so one device does not block the other. > > Other synchronization points would be for shader evaluation, like displacement, transparency baking and world importance map. But those only need to be computed once so would not be within a `parallel_for` over devices anyway. > > So I think if that implementation has many `parallel_for` invocations, I expect than can be solved by improving the implementation. But I might be missing something. > It's a bit of a trade off between having fewer `parallel_for`'s and trying to overlap CPU work with GPU work. The more GPUs you have the more important it was to just have one thread per device to remove the synchronization that occurred due the `parallel_for`'s. The scene changes were to separate the CPU side calculations from the GPU specific work so that each GPU thread would not repeat work. For instance calculating the normals only needs to be done once not per device. This also had the side effect of cleaning up the CPU calculations and removing some that were repeated. Also it was best to execute the GPU commands a soon as possible. For memory transfers it seems that a few large transfers are better than lots of smaller ones (for instance per object transfers vs transferring the whole scene in one go) . > > I can do this (without the scene) in another PR if you wish? > > I added the work in progress changes I have in #108454, which does parallel memory operations and BVH building. It doesn't do host memory fallback, but peering should be correct. > > It would be very helpful if you could test that PR for performance, and look at the implementation. I don't have a multi GPU setup for testing right now, but tested it with a single GPU added twice. And then if there is still a significant performance benefit from this PR, it would be good to understand where that is exactly. > Yes, certainly I'll try this out on my 3080 + 3090 dual GPU config. It will be interesting to see how queuing up work like this compares to the alternatives. > I attached a log from rendering the junkshop with that PR, every "flush" is a synchronization point. It's probably possible to reduce those further if that is the bottleneck. > In theory you only need to synchronize at the end if you are just submitting a list of commands to each device which will be executed in order. However, I guess the flushes are needed to start the execution which results in more `parallel_for`s. > > Also, I believe you are correct about the peering but from my understanding only texture memory gets host mapped so this should not happen for this case where it is only dealing with scene geometry. > > The memory for scene geometry can also be host mapped, it's not just image textures. Ok, I guess I understood that code wrong. Looks like I'll need to go over it again to make sure I understand it correctly.

It's a bit of a trade off between having fewer parallel_for's and trying to overlap CPU work with GPU work.

Is there any overlap between CPU and GPU work currently? From what I can tell this PR is still using a blocking memory copy calls. With queued memory operations like #108454, using async memory copy is more practical I think.

My understanding is that this PR does not do something like that, if it does where does that happen?

The more GPUs you have the more important it was to just have one thread per device to remove the synchronization that occurred due the parallel_for's. The scene changes were to separate the CPU side calculations from the GPU specific work so that each GPU thread would not repeat work. For instance calculating the normals only needs to be done once not per device. This also had the side effect of cleaning up the CPU calculations and removing some that were repeated.

Removing such a repeated CPU normal calculation should be submitted as a separate PR, that kind of thing should be easy to review and land quickly. And then we can also better understand what the actual impact of parallel copies is.

Also it was best to execute the GPU commands a soon as possible. For memory transfers it seems that a few large transfers are better than lots of smaller ones (for instance per object transfers vs transferring the whole scene in one go) .

I thought we were doing synchronous memory copies so doing it earlier would not matter, but maybe CUDA makes a copy of the host side memory in some cases? Still the best solution here seem to do async memory copy then.

In theory you only need to synchronize at the end if you are just submitting a list of commands to each device which will be executed in order. However, I guess the flushes are needed to start the execution which results in more parallel_fors.

For some things like displacement shader evaluation you do need some synchronization, since that is computed on one device and then affects the BVH build of all devices. But after all shader evaluation is done it could all be queued in principle.

> It's a bit of a trade off between having fewer `parallel_for`'s and trying to overlap CPU work with GPU work. Is there any overlap between CPU and GPU work currently? From what I can tell this PR is still using a blocking memory copy calls. With queued memory operations like #108454, using async memory copy is more practical I think. My understanding is that this PR does not do something like that, if it does where does that happen? > The more GPUs you have the more important it was to just have one thread per device to remove the synchronization that occurred due the `parallel_for`'s. The scene changes were to separate the CPU side calculations from the GPU specific work so that each GPU thread would not repeat work. For instance calculating the normals only needs to be done once not per device. This also had the side effect of cleaning up the CPU calculations and removing some that were repeated. Removing such a repeated CPU normal calculation should be submitted as a separate PR, that kind of thing should be easy to review and land quickly. And then we can also better understand what the actual impact of parallel copies is. > Also it was best to execute the GPU commands a soon as possible. For memory transfers it seems that a few large transfers are better than lots of smaller ones (for instance per object transfers vs transferring the whole scene in one go) . I thought we were doing synchronous memory copies so doing it earlier would not matter, but maybe CUDA makes a copy of the host side memory in some cases? Still the best solution here seem to do async memory copy then. > In theory you only need to synchronize at the end if you are just submitting a list of commands to each device which will be executed in order. However, I guess the flushes are needed to start the execution which results in more `parallel_for`s. For some things like displacement shader evaluation you do need some synchronization, since that is computed on one device and then affects the BVH build of all devices. But after all shader evaluation is done it could all be queued in principle.
William Leeson added 1 commit 2023-06-05 11:00:05 +02:00
William Leeson added 1 commit 2023-06-12 11:05:25 +02:00
William Leeson added 1 commit 2023-06-26 14:20:11 +02:00
William Leeson added 1 commit 2023-06-29 18:23:48 +02:00
William Leeson added 1 commit 2023-07-04 16:09:15 +02:00
William Leeson added 2 commits 2023-07-06 12:07:01 +02:00
William Leeson added 1 commit 2023-07-07 15:56:34 +02:00
William Leeson added 1 commit 2023-07-10 16:45:26 +02:00
William Leeson added 1 commit 2023-07-17 10:01:51 +02:00
William Leeson added 1 commit 2023-07-24 10:59:06 +02:00
William Leeson added 1 commit 2023-07-28 16:09:24 +02:00
William Leeson added 1 commit 2023-08-01 12:59:43 +02:00
William Leeson added 3 commits 2023-08-07 10:31:07 +02:00
William Leeson added 1 commit 2023-08-09 15:48:46 +02:00
William Leeson added 1 commit 2023-08-17 13:38:35 +02:00
This pull request has changes conflicting with the target branch.
  • intern/cycles/device/device.h
  • intern/cycles/device/metal/device_impl.mm
  • intern/cycles/device/metal/queue.mm
  • intern/cycles/device/multi/device.cpp
  • intern/cycles/scene/geometry.cpp
  • intern/cycles/scene/geometry_attributes.cpp
  • intern/cycles/scene/geometry_bvh.cpp
  • intern/cycles/scene/geometry_mesh.cpp
  • intern/cycles/scene/scene.cpp

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u geometry_update:leesonw-geometry_update
git checkout leesonw-geometry_update
Sign in to join this conversation.
No reviewers
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No Assignees
5 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#105403
No description provided.