Refactor geometry update for better parallelization and upload performance #105403
No reviewers
Labels
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
5 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: blender/blender#105403
Loading…
Reference in New Issue
No description provided.
Delete Branch "leesonw/blender-cluster:geometry_update"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Why
Cycle's geometry upload and preparation can be refactored to be easier to understand and also to improve its efficiency. By creating a single parallel_for for each device instead of one thread per copy operation the overhead of handling threading can be reduced thereby improving performance. Also, by splitting the host side work from the device side work it is easier to distribute the workload. Some calculations are performed twice or unnecessarily as they only needed for specific tasks but are performed regardless as this is hard to work out with the current code. This also allows for further potential optimizations in the future.
What
This results in the program flow looking like the following
This method works as follows:
----HOST SIDE CALCULATIONS AND SETUP------
--- only run on the first device ---
----DEVICE SPECIFIC BEGIN----
----DEVICE SPECIFIC END----
----HOST SIDE BELOW THIS POINT
The original flow and the new flow are shown in the files below. To achieve this a DeviceScene is created for each device. However, as this is just updating the geometry device update (and not all uses of the DeviceScene) the old DeviceScene needs to remain so other that buffers can be handled by the multi-device. The host side pointers are shared by all the DeviceScenes so they only need to be packed once. This is done using a method to update the host pointer of one DeviceScene from an existing DeviceScene.
I have created a spreedsheet here comparing this patch versus Blender main. Also and chart of the results so far is available in the images below. I have also created another spreadsheet here comparing it with both the original Blender implementation and the alternative method #107552. This patch is "geom_update" the other is "parallel_copy" in the graphs.
WIP:Refactor geometry update for better parallelization and upload performanceto WIP: Refactor geometry update for better parallelization and upload performance@brecht Ok I think this is close to ready. It would be good to get an overall impression even if not a complete review when you have time. I'll hopefully have performance data sometime this week.
WIP: Refactor geometry update for better parallelization and upload performanceto Refactor geometry update for better parallelization and upload performanceI see a bunch of commented code and benchmarking utilities in business logic.
Refactor geometry update for better parallelization and upload performanceto WIP: Refactor geometry update for better parallelization and upload performanceI'm not going to have time this week, but will look next week.
@blender-bot build
This part I'm not sure about.
Right now we can build the BVH once for multi-device rendering on CUDA, HIP, Metal, OneAPI. With native hardware ray-tracing this may become impossible, but that usually comes with GPU accelerated tree building so it's not so bad.
This performance regression I would rather avoid.
See the buildbot log for build errors, build warnings and test failures to be fixed.
I did not go through every line of code in
geometry.cpp
andgeometry_additions.cpp
, but overall:geometry_attributes.cpp
,geometry_bvh.cpp
,geometry_mesh.cpp
, .. or something along those lines. And then each of those can handle as much as possible related to that part of the data, with consistent function names for the various stages.@ -887,7 +887,6 @@ typedef enum {
typedef struct _nvrtcProgram* nvrtcProgram;
Unrelated whitespace change.
@ -681,7 +682,6 @@ static int cuewNvrtcInit(void) {
return result;
}
Unrelated whitespace change.
@ -68,6 +68,7 @@ typedef void* DynamicLibrary;
static DynamicLibrary cuda_lib;
static DynamicLibrary nvrtc_lib;
static DynamicLibrary nvtx_lib;
This probably comes from another patch and should not be part of this?
@ -125,6 +125,7 @@ void BlenderSession::create_session()
session = new Session(session_params, scene_params);
session->progress.set_update_callback(function_bind(&BlenderSession::tag_redraw, this));
session->progress.set_cancel_callback(function_bind(&BlenderSession::test_cancel, this));
session->progress.set_updates(!background);
This seems like a change to review as a separate PR?
@ -14,1 +13,4 @@
const Device *device_)
: BVH(params_, geometry_, objects_), device(device_)
{
// Resize the sub-bvh container to match the number of devices
Maybe there can be some abstraction at this level to avoid building the BVH multiple times when not needed?
@ -149,6 +149,15 @@ void CPUDevice::mem_copy_to(device_memory &mem)
}
}
void CPUDevice::mem_copy_to(device_memory &mem, size_t, size_t offset)
Silence unused argument warnings with
, size_t, size_t)
@ -31,0 +30,4 @@
int flags = CUEW_INIT_CUDA;
int cuew_result = cuewInit(flags);
This change does nothing (anymore?).
@ -548,3 +548,3 @@
const CUDAContextScope scope(this);
cuda_assert(cuMemcpyHtoD((CUdeviceptr)device_pointer, host_pointer, size));
cuda_assert(cuMemcpyHtoD((CUdeviceptr)device_pointer + offset,
Checking the definition of
CUdeviceptr
this seems to work, but it's not obvious by just looking at the code or for all devices types. Maybe better to usereinterpret_cast<unsigned char *>(
in all device code, to make it clear?@ -551,0 +554,4 @@
void CUDADevice::host_mem_alloc(size_t size, int aligment, void **p_mem) {
CUDAContextScope scope(this);
CUresult result = cuMemAllocHost(p_mem, size);
The CUDA docs for
cuMemAllocHost
say:It's not clear to me that we should be using this for allocating the entire scene representation on the CPU? Seems like that could lead to Blender/Cycles running out of memory faster than before.
@ -551,0 +555,4 @@
void CUDADevice::host_mem_alloc(size_t size, int aligment, void **p_mem) {
CUDAContextScope scope(this);
CUresult result = cuMemAllocHost(p_mem, size);
if(result != CUDA_SUCCESS) {
This should not have custom error handling. We have
cuda_assert
for error reporting, so that an error is set in the device and reported to the user.@ -566,8 +584,13 @@ void CUDADevice::mem_alloc(device_memory &mem)
void CUDADevice::mem_copy_to(device_memory &mem)
Can this code be deduplicated by calling
mem_copy_to(mem, mem.memory_size(), 0)
?@ -568,3 +586,2 @@
if (mem.type == MEM_GLOBAL) {
global_free(mem);
global_alloc(mem);
if ((mem.device_size < mem.memory_size()) || (!mem.device_pointer)) {
This means memory allocations never decrease in size? Shouldn't it reallocate whenever the size changes?
@ -584,0 +607,4 @@
void CUDADevice::mem_copy_to(device_memory &mem, size_t size, size_t offset)
{
if (mem.type == MEM_GLOBAL) {
if ((mem.device_size < mem.memory_size()) || (!mem.device_pointer)) {
Same comment as above.
@ -766,0 +774,4 @@
}
}
void GPUDevice::generic_copy_to(device_memory &mem, size_t size, size_t offset)
Similar comment about code deduplication as above.
@ -8,3 +8,3 @@
#include "bvh/params.h"
#include "scene/scene.h"
The intent for the device API was to not be aware of
Scene
orDeviceScene
, though that is already broken somewhat.Ideally we'd avoid it entirely, but here at least it should include but use a
class DeviceScene;
forward declaration.@ -217,2 +222,4 @@
return 0;
}
virtual device_ptr find_matching_mem(device_ptr key, Device * /*sub*/)
Add a comment explaining what this does.
@ -290,8 +300,11 @@ class Device {
friend class DeviceServer;
friend class device_memory;
virtual void host_mem_alloc(size_t size, int alignment, void **p_mem);
This can return
void *
instead of usingvoid **p_mem
.@ -369,17 +379,78 @@ template<typename T> class device_vector : public device_memory {
assert(data_elements > 0);
}
device_vector(Device *device,
TODO: what is this constructor for?
@ -41,3 +45,3 @@
/* Always add CPU devices at the back since GPU devices can change
* host memory pointers, which CPU uses as device pointer. */
SubDevice *sub;
SubDevice *sub = new SubDevice;
Use
unique_ptr
andmake_unique
instead ofnew
in new code.@ -49,2 +52,2 @@
devices.emplace_front();
sub = &devices.front();
devices.emplace_back(sub);
// find first CPU device and swop it with the new device
swop -> swap
@ -51,0 +54,4 @@
// to keep the CPU devices at the end of the vector.
int last = devices.size() - 1;
int o = last;
while ((o > 0) && (devices[o - 1]->device->info.type == DEVICE_CPU)) {
If we ensure the CPU is always the last device in the vector, there is no need to have a while loop searching for it.
@ -1041,2 +637,2 @@
}
}
/* copy svm attributes to device */
dscene->attributes_map.copy_to_device_if_modified();
Looking at code like this, it feels like maybe
DeviceScene
should somehow have a list of thedevice_memory
it contains. And then perhaps that could be used to have a function to transfer all modified memory, rather than manually doing it here.@ -2088,3 +1487,1 @@
if (geom->is_mesh()) {
Mesh *mesh = static_cast<Mesh *>(geom);
mesh->subd_attributes.clear_modified();
size_t n_scenes = scene->dscenes.size();
n_scenes -> num_scenes is the convention.
@ -2091,0 +1487,4 @@
size_t n_scenes = scene->dscenes.size();
// Parallel upload the geometry data to the devices and
// calculate or refit the BVHs
tbb::parallel_for(size_t(0),
Use just
parallel_for
, nottbb::parallel_for
. We abstact such things in case we want to override them.@ -2091,0 +1489,4 @@
// calculate or refit the BVHs
tbb::parallel_for(size_t(0),
n_scenes,
[this,
Can we automatically capture all these variables instead of doing it manually?
@ -162,0 +185,4 @@
size_t *motion_vert_offsets;
};
struct AttributeSizes {
Can we avoid having these
AttributeSizes
andGeometrySizes
structs? This feels like something we should be able to set on thedevice_memory
inDeviceScene
instead of tracking in a separate struct.Or maybe it's not a good because every device has their own
DeviceScene
?@ -0,0 +1,1045 @@
/* SPDX-License-Identifier: Apache-2.0
This file needs a better name, or probably code needs to split across files in a different way. This doesn't seem like a logical grouping.
Did you forget to remove this file?
Yes, I have removed it now :-/
@ -0,0 +34,4 @@
/*
* Clears all tags used to indicate the the shader needs to be updated.
*/
void GeometryManager::clearShaderUpdateTags(Scene *scene)
Don't use camelCase function names.
@ -0,0 +121,4 @@
progress.set_status("Updating Mesh", "Computing attributes");
// SHOULD NOT ALLOC ONLY ALLOC IF MORE SPACE IS NEEDED
I guess this is a TODO comment to be resolved.
@ -660,11 +660,8 @@ void ObjectManager::device_update_transforms(DeviceScene *dscene, Scene *scene,
}
});
if (progress.get_cancel()) {
Why remove this?
In general we want to be able to cancel rendering in the middle of updates so users don't have to wait too long.
@ -36,0 +37,4 @@
* checks the progress for if a cancel has been requested and also
* the device to see if an error has occurred.
*/
bool Scene::progressErrorCheck(Progress &progress, Device *device) {
I'd name this
check_cancel_update
or something like that,progressErrorCheck
to me reads as if it's only checking for errors.@ -36,0 +39,4 @@
*/
bool Scene::progressErrorCheck(Progress &progress, Device *device) {
bool status = false;
if (!background && progress.get_updates()) {
Cancelling quickly is important for F12 renders as well, which are "background" in Cycles.
@ -104,1 +116,4 @@
{
/* Create a DeviceScene for each device */
device->foreach_device([this](Device *sub_device) {
DeviceScene *sub_dscene = new DeviceScene(sub_device);
Use
unique_ptr
instead ofnew
in new code.@ -107,0 +127,4 @@
mesh_times = new double[device_count];
attrib_times = new double[device_count];
object_bvh_times = new double[device_count];
scene_bvh_times = new double[device_count];
Use
vector
instead ofnew[]
. Also perhaps it's better to put these in a struct instead of separate arrays.@ -17,3 +28,4 @@
CCL_NAMESPACE_BEGIN
class SpinLock {
Don't implement a custom spinlock here, or at all if we can avoid it since getting that right can be surprisingly difficult.
@ -41,2 +93,4 @@
error_message = "";
cancel_cb = function_null;
updates = true;
// update_call = true;
Don't leave uncommented code in.
Removed
@ -14,1 +13,4 @@
const Device *device_)
: BVH(params_, geometry_, objects_), device(device_)
{
// Resize the sub-bvh container to match the number of devices
I added a variable and mutex to indicate if it was being built or has been built. This ensures that it is built only once. However, I put this code in build_bvh in device.cpp as the BVHMulti can have different BVH types within it and this only applies to BVH2.
@ -548,3 +548,3 @@
const CUDAContextScope scope(this);
cuda_assert(cuMemcpyHtoD((CUdeviceptr)device_pointer, host_pointer, size));
cuda_assert(cuMemcpyHtoD((CUdeviceptr)device_pointer + offset,
Changing this to use
reinterpret_cast<unsigned char *>()
results in having to cast it back toCUdeviceptr
as followsreinterpret_cast<CUdeviceptr>(reinterpret_cast<unsigned char *>(device_pointer))
otherwise the following error occurs:which is fine but it seems a bigger mouthful to swallow that just using
CUdeviceptr
in the first place.Ok, fine to leave it as is then.
@ -551,0 +554,4 @@
void *CUDADevice::host_mem_alloc(size_t size, int aligment) {
void *p_mem = NULL;
CUDAContextScope scope(this);
That is a possibility, I can switch it off for now. In our usage we never had that happen but most of the systems were not memory constrained. Not putting it in pinned memory ends up with a double copy as the driver then copies it to pinned memory before copying to the device. I switched this back to using regular memory for now.
@ -581,6 +607,29 @@ void CUDADevice::mem_copy_to(device_memory &mem)
}
}
void CUDADevice::mem_copy_to(device_memory &mem, size_t size, size_t offset)
This gives a great speed boost for interactive preview. I'll remove it from this PR and I can add it in a separate PR later.
@ -8,3 +8,3 @@
#include "bvh/params.h"
#include "scene/scene.h"
Ok, I replaced the include with a forward declaration. In another patch, instead of uploading the geometry data a second time we use the data in the DeviceScene which is already uploaded. This saves both memory and time. Would that not be acceptable?
@ -377,4 +449,2 @@
{
size_t new_size = size(width, height, depth);
if (new_size != data_size) {
I reverted this change also.
@ -369,1 +379,4 @@
device_vector(Device *device,
const char *name,
void *p_mem,
Yes, it looks like it's not used in this PR so I removed it.
@ -833,0 +856,4 @@
}
else {
generic_copy_to(mem, size, offset);
}
I reverted this change also.
@ -41,3 +45,3 @@
/* Always add CPU devices at the back since GPU devices can change
* host memory pointers, which CPU uses as device pointer. */
SubDevice *sub;
SubDevice *sub = new SubDevice;
Done. I also replaced the device creation with a
unique_ptr<>
@ -51,0 +54,4 @@
// to keep the CPU devices at the end of the vector.
int last = devices.size() - 1;
int o = last;
while ((o > 0) && (devices[o - 1]->device->info.type == DEVICE_CPU)) {
Yes, I fixed this up.
@ -162,0 +185,4 @@
size_t *motion_vert_offsets;
};
struct AttributeSizes {
I'll see what I can do. It is this way because in the original implementation the device_memory is not always exactly the same size (sometimes it is bigger) so we need a place to track the actual size of upload otherwise more memory is uploaded that needed.
I don't think putting it in DeviceScene is great because as you say if I put it in the DeviceScene then that information is replicated for each device.
I'll investigate more to see if there is somewhere else that this information would be logical.
I still think it's important to eliminate these, tracking this separately is really error prone.
I don't understand what you mean by "device_memory is not always exactly the same size (sometimes it is bigger)". Is this related to the device memory not shrinking with
mem.device_size < mem.memory_size()
, that I think we should change to!=
anyway?Even if we were not shrinking it, that should be an implementation detail of
device_vector
similar to how it's an implementation detail ofstd::vector
. It's not something that should be tracked externally.@ -36,0 +37,4 @@
* checks the progress for if a cancel has been requested and also
* the device to see if an error has occurred.
*/
bool Scene::progressErrorCheck(Progress &progress, Device *device) {
I changed this to check_cancel_update.
@ -36,0 +39,4 @@
*/
bool Scene::progressErrorCheck(Progress &progress, Device *device) {
bool status = false;
if (!background && progress.get_updates()) {
I changed this to only happen if updates are on. i.e. I removed the !background condition
@ -107,0 +127,4 @@
mesh_times = new double[device_count];
attrib_times = new double[device_count];
object_bvh_times = new double[device_count];
scene_bvh_times = new double[device_count];
I replaced these with a vector and moved them into a struct.
@ -17,3 +28,4 @@
CCL_NAMESPACE_BEGIN
class SpinLock {
I replaced this with the one in utils/thread.h and removed this code.
@ -41,2 +93,4 @@
error_message = "";
cancel_cb = function_null;
updates = true;
// update_call = true;
Removed
@ -0,0 +1,407 @@
/* SPDX-License-Identifier: Apache-2.0
I split this into geometry.cpp, geometry_mesh.cpp, geometry_attributes.cpp and geometry_bvh.cpp
@ -0,0 +34,4 @@
/*
* Clears all tags used to indicate the the shader needs to be updated.
*/
void GeometryManager::clearShaderUpdateTags(Scene *scene)
I removed the camelCase function names.
@ -0,0 +121,4 @@
sub_dscene->tri_verts.assign_mem(dscene->tri_verts);
sub_dscene->tri_verts.tag_modified();
}
{
Removed as it was an old comment
WIP: Refactor geometry update for better parallelization and upload performanceto Refactor geometry update for better parallelization and upload performance@brecht I hope this is closer to what you are looking for. I have removed the WIP as I think it is ready now :-)
@blender-bot package
Package build started. Download here when ready.
On the design side, I'm still confused about why parallelizing memory copying to multiple devices requires such deep changes on the host side, why this can't be abstracted more inside
Device
andMultiDevice
.CUDA and HIP already have a
MemMap device_mem_map
that keeps track of alldevice_memory
. So ifdevice_memory
contains abool modified_
, that should be enough information to implement a method likeDevice::copy_modified_memory_to_device
that just loops over all memory and copies where needed? And then that can be done inside a parallel for inMultiDevice
.Or am I missing a reason for why this is not practical?
I'm having trouble reviewing the geometry update changes, since it's both moving code and changing the logic. It would be very helpful if you could submit a PR that just moves code to new files and order functions to be similar to the new code. Then I can quickly review and approve that, and then when this PR is updated hopefully it will show a readable delta between old and new code?
@ -125,6 +125,7 @@ void BlenderSession::create_session()
session = new Session(session_params, scene_params);
session->progress.set_update_callback(function_bind(&BlenderSession::tag_redraw, this));
session->progress.set_cancel_callback(function_bind(&BlenderSession::test_cancel, this));
session->progress.set_updates(!headless);
It's still not clear to me why showing updates should ever be disabled. Regardless if rendering headless or not it's good to see what is happening so Cycles doesn't look like it's stuck. If performance is an issue we can look at other ways to optimize this.
Can you explain the thinking behind this?
Regardless, I think this change and then changes in
progress.h
should be submitted a a separate PR. I think that can be reviewed and landed separate from this, there doesn't seem to be a dependency in either direction.The reason behind this is that with many threads the mutexs in the progress calls were causing large stalls. So this was an easy way to switch then on an off. The stalls were frequent especially where the progress update was called inside a loop.
I have removed this for now.
@ -66,3 +66,3 @@
vector<Geometry *> geometry;
vector<Object *> objects;
bool built;
Add empty line after this, initialize to false directly here.
Done
@ -24,0 +36,4 @@
{
int id = device->device_number(subdevice);
resize_sub_bvhs_if_needed(id);
sub_bvhs[id] = bvh;
It's not clear to me how the previous BVH that was stored here gets freed, or how the ownership here works in general. Ideally
unique_ptr
would be used to make that really clear.I have changed this to use
unique_ptr
now@ -54,3 +53,2 @@
BVH2 *const bvh2 = static_cast<BVH2 *>(bvh);
if (refit) {
bvh2->refit(progress);
static std::atomic<int> building{0};
Don't use static variables like this. There can be multiple Cycles sessions active at once that should not be blocked by each other here.
Doh, I'd forgotten there could be multiple sessions :-/ I'll look into moving this into the top level BVH.
I have updated this to storing the building count in the top level BVH so it should work with multiple sessions now.
@ -55,2 +54,2 @@
if (refit) {
bvh2->refit(progress);
static std::atomic<int> building{0};
/* Top level BVH2 build must wait on all other BVH2 builds to finish
I don't understand why all this complicated synchronization logic is here. Don't we already guarantee that the top level BVH is built after the bottom level BVHs from the caller functions?
I would think all that is needed here is a double checked lock, no condition variables or waiting.
No, it only stops 2 different threads from building the same BVH. However, some threads can reach this point before all the object BVHs are built so it needs to wait on those BVHs to be built before it can proceed.
I don't this this would work in this case.
I'll look into moving the atomic into the top level BVH that way it won't block multiple sessions.
This is needed because if more than 1 device uses BVH2 they all work on building it. This happens when you combine GPU + CPU rendering or more than one GPU that uses BVH2.
@ -531,3 +536,2 @@
if (mem.type == MEM_GLOBAL) {
global_free(mem);
global_alloc(mem);
if ((mem.device_size < mem.memory_size()) || (!mem.device_pointer)) {
Same comment as before, change
<
to!=
to ensure memory decreases.Done
@ -0,0 +90,4 @@
}
/*
* Clears the modified tags for all elements of the device scene
It's not all of them? For example
object_motion_pass
is not done here.The function has a generic name but then only works on some subset related to geometry, that's confusing.
I renamed the function
device_free_geometry
@ -0,0 +128,4 @@
DeviceScene *dscene,
const GeometrySizes *p_sizes)
{
if (p_sizes->tri_size != 0) {
Are these conditionals checking for the size really needed? Why can't it just always do this? Don't all of the individual device vectors have sufficient info to skip operations if they are unnecessary?
I believe this was just replicating similar code in geometry.cpp:1092 in method GeometryManager::device_update_mesh with an optimization of moving the copy inside the if. I'll remove these and rely on the device_memory instead.
@ -630,3 +630,3 @@
}
double Session::get_estimated_remaining_time() const
double Session::get_estimated_remaining_time() //const
Add back const? The implementation did not change.
Done
@ -7,0 +11,4 @@
#else
#define MUTEX thread_mutex
#define SCOPED_LOCK(mutex) thread_scoped_lock scopedlock(mutex)
#endif
I rather not use macros, especially not with such generic names. If a spin lock is faster we might as well just use that.
If there's really a need for both to be switchable, I prefer something like this inside the class:
I removed this.
@blender-bot package
Package build started. Download here when ready.
I have updated the scene, geometry, object BVH and scene BVH update times graphs and updated the results in the spreadsheet.
@blender-bot package
Package build started. Download here when ready.
I've started looking into this in detail, trying to split off changes in separate commits to understand what is going on. I started doing that in #108454, but hit against deeper design issues.
For me the added code complexity in scene update in this PR and #107552 is still way too much. For me the whole idea of making multi device handle this would be to keep the scene update code almost the same. Yet #107552 still has multiple
DeviceScene
instances, still tracks various memory sizes separately and various other changes. Why is that still there? Was it just left in because it was an experiment and not cleaned up yet, or is there something that makes this necessary?I can't accept this level of code complexity for this optimization, it has to be simpler or we should not do it.
Additionally, I believe the multiple device scenes do not work correctly with host memory fallback and peering. Host memory fallback can not be done purely in parallel, there needs to be some coordination between devices to share the same pinned memory instead of each allocating its own copy in a dedicated
device_memory
for each device. Peering is currently handled in the multi device class, and also seemingly not handled as each device has its owndevice_memory
, not one per peering island.I believe a solution mostly handled by the device API is possible.
mem_alloc
,mem_copy_to
andmem_zero
operations in the multi device, and then execute them in parallel across devices at a few synchronization points, like before BVH build, displacement and after scene updates.device_memory
need to be shared between devices, and it must be mutex locked.I started trying to implement this, but especially the host memory fallback sharing is more work than I hoped. So before I spend more time on this, I'd like to understand if I am missing something. Especially regarding why #107552 still has so many changes in scene update, I feel like there may be a reason for this that I do not understand.
Yes the reason they are similar is that I decided to keep the scene changes as is so that the performance differences could be compared directly. There are some performance wins from the reorganization which would have favored the first PR. To bring this over to the current code base it should be possible to just bring over the just the Multi-Device changes without the scene changes which would be relatively simple. However, as you can see from the performance graphs the wins are quite a bit smaller due to the change in the parallel structure from 1
parallel_for
to many. This causes a lock step which the single per device parallel structure does not have. I can do this (without the scene) in another PR if you wish? Also, I believe you are correct about the peering but from my understanding only texture memory gets host mapped so this should not happen for this case where it is only dealing with scene geometry.Ok, having those other performance wins all in one PR makes it difficult to compare. I want to see a basic parallel memory copy implementation first, and leave further refactoring for another PR. It's too difficult to understand it all together.
I don't think you need many
parallel_for
if the memory operations are queued up in the multi device. One synchronization point before BVH build maybe, and one at the end of scene update? Even that could be made to work so one device does not block the other.Other synchronization points would be for shader evaluation, like displacement, transparency baking and world importance map. But those only need to be computed once so would not be within a
parallel_for
over devices anyway.So I think if that implementation has many
parallel_for
invocations, I expect than can be solved by improving the implementation. But I might be missing something.I added the work in progress changes I have in #108454, which does parallel memory operations and BVH building. It doesn't do host memory fallback, but peering should be correct.
It would be very helpful if you could test that PR for performance, and look at the implementation. I don't have a multi GPU setup for testing right now, but tested it with a single GPU added twice. And then if there is still a significant performance benefit from this PR, it would be good to understand where that is exactly.
I attached a log from rendering the junkshop with that PR, every "flush" is a synchronization point. It's probably possible to reduce those further if that is the bottleneck.
The memory for scene geometry can also be host mapped, it's not just image textures.
Yes, I can understand that it's a lot of changes in one PR.
It's a bit of a trade off between having fewer
parallel_for
's and trying to overlap CPU work with GPU work. The more GPUs you have the more important it was to just have one thread per device to remove the synchronization that occurred due theparallel_for
's. The scene changes were to separate the CPU side calculations from the GPU specific work so that each GPU thread would not repeat work. For instance calculating the normals only needs to be done once not per device. This also had the side effect of cleaning up the CPU calculations and removing some that were repeated. Also it was best to execute the GPU commands a soon as possible. For memory transfers it seems that a few large transfers are better than lots of smaller ones (for instance per object transfers vs transferring the whole scene in one go) .Yes, certainly I'll try this out on my 3080 + 3090 dual GPU config. It will be interesting to see how queuing up work like this compares to the alternatives.
In theory you only need to synchronize at the end if you are just submitting a list of commands to each device which will be executed in order. However, I guess the flushes are needed to start the execution which results in more
parallel_for
s.Ok, I guess I understood that code wrong. Looks like I'll need to go over it again to make sure I understand it correctly.
Is there any overlap between CPU and GPU work currently? From what I can tell this PR is still using a blocking memory copy calls. With queued memory operations like #108454, using async memory copy is more practical I think.
My understanding is that this PR does not do something like that, if it does where does that happen?
Removing such a repeated CPU normal calculation should be submitted as a separate PR, that kind of thing should be easy to review and land quickly. And then we can also better understand what the actual impact of parallel copies is.
I thought we were doing synchronous memory copies so doing it earlier would not matter, but maybe CUDA makes a copy of the host side memory in some cases? Still the best solution here seem to do async memory copy then.
For some things like displacement shader evaluation you do need some synchronization, since that is computed on one device and then affects the BVH build of all devices. But after all shader evaluation is done it could all be queued in principle.
Checkout
From your project repository, check out a new branch and test the changes.