Cycle CPU + GPU hybrid path tracer #95687

Open
opened 2022-02-10 20:19:45 +01:00 by Sahar A. Kashi · 17 comments
Member

The main idea of CPU + GPU path tracer is to have two full size accumulation/render buffers. A unit of work is a single tile (the same size as the frame buffer) with a varying number of samples to render for each scheduled task.

Two accumulation threads, one for the CPU and one for the GPU, run almost independently until the sum of samples has reached the number of required samples (the threads have to synchronize on the number of rendered samples and the offset for random number generation).

On every execution, each processor renders the amount of sample that is most suited to its capabilities. The initial number of samples assigned for the GPU is higher than the one assigned to the CPU. As mor rendering information becomes available, the number of samples per device is determined based on the number of samples it can process within an update interval. Only GPU thread updates the display while the rendering is still in process. It allows for more flexibility to determine the launch frequency/number of samples on the CPU device. When all samples are rendered, the two buffers are merged, to produce the final output (the overhead of this method).


This is a preliminary and quick implementation, without software design consideration, as a proof of concept to show the benefit of distributing work on two processors with significant capability differences. The implementation has limited support. It is configured to work only on a single GPU + a single CPU, with no support for adaptive sampling, tiling, or offline rendering to name a few.


This version does not consist of new methods for sample distribution of scheduled tasks and is not optimized to its full potential.

The main idea of CPU + GPU path tracer is to have two full size accumulation/render buffers. A unit of work is a single tile (the same size as the frame buffer) with a varying number of samples to render for each scheduled task. Two accumulation threads, one for the CPU and one for the GPU, run almost independently until the sum of samples has reached the number of required samples (the threads have to synchronize on the number of rendered samples and the offset for random number generation). On every execution, each processor renders the amount of sample that is most suited to its capabilities. The initial number of samples assigned for the GPU is higher than the one assigned to the CPU. As mor rendering information becomes available, the number of samples per device is determined based on the number of samples it can process within an update interval. Only GPU thread updates the display while the rendering is still in process. It allows for more flexibility to determine the launch frequency/number of samples on the CPU device. When all samples are rendered, the two buffers are merged, to produce the final output (the overhead of this method). ``` ``` This is a preliminary and quick implementation, without software design consideration, as a proof of concept to show the benefit of distributing work on two processors with significant capability differences. The implementation has limited support. It is configured to work only on a single GPU + a single CPU, with no support for adaptive sampling, tiling, or offline rendering to name a few. ``` ``` This version does not consist of new methods for sample distribution of scheduled tasks and is not optimized to its full potential.
Author
Member

Added subscriber: @salipour

Added subscriber: @salipour

#97554 was marked as duplicate of this issue

#97554 was marked as duplicate of this issue

Added subscriber: @BrianSavery

Added subscriber: @BrianSavery
Member

Changed status from 'Needs Triage' to: 'Confirmed'

Changed status from 'Needs Triage' to: 'Confirmed'

Added subscriber: @Michael-Jones

Added subscriber: @Michael-Jones

Added subscriber: @brecht

Added subscriber: @brecht

@brecht This is the patch I mentioned from AMD

@brecht This is the patch I mentioned from AMD

Added subscribers: @fx, @Sergey

Added subscribers: @fx, @Sergey

Thanks for the contribution! I briefly discussed this with @Sergey here.

Do you have any performance comparison numbers? Or analysis about where the current implementation is doing poorly in your tests?

From my understanding this does two things:

  • Split work along the samples instead of along the image dimensions. This gives a more even work distribution when parts of the image are more expensive to render than others.
  • Within one batch of samples (which in background render mode may be e.g. 64 samples), devices render 1 or more samples at a time, until all samples in the batch are done. For a sufficient number of samples dynamically distributing like this is better than statically assigning the amount of work to each device in advance, since we don't have a good (initial) estimate of how fast each device is.

The first part is about solving the same problem as D14014: Cycles multi-GPU distribution using slices [WIP] by @fx, but in a less fine grained way. Splitting along samples can work for batch rendering with many samples at once, but not for interactive viewport rendering with just 1 or a few samples. If we have a good slicing implementation, I think that would be preferable over splitting along samples. It allows for more fine grained distribution, as the number of pixels in a batch is much higher than the number of samples. Another advantage of splitting along pixels is that it lowers memory usage and memory traffic between devices.

For the dynamic distribution, we could imagine this working with slicing too. Merging buffers then is not adding samples but copying slices to their appropriate place in the full buffer. For offline rendering I think this approach makes a lot of sense. I guess one of the reasons we didn't do it was because we were looking for a single implementation to handle both interactive and offline rendering, but it's unclear if that's possible. I imagine there can be an implementation that's not adding too much maintenance cost, where we handle both dynamic and static distribution.

However I'm curious in which way this patch improves things specifically. It seems to assume GPUs are 8x faster CPUs as the initial estimate, and for sure we could have better heuristics for this regardless of the distribution type. So I imagine with a typical CPU + GPU this already speeds up the first sample. Then for the following samples, I wonder if the dynamic distribution ends up with assigning a very different amount of work to each device than the dynamic rebalancing we do now, because of a poor estimate (and lack of slicing)? Or if it's pretty similar and it's the overhead of rebalancing that makes the difference?

Thanks for the contribution! I briefly discussed this with @Sergey here. Do you have any performance comparison numbers? Or analysis about where the current implementation is doing poorly in your tests? From my understanding this does two things: * Split work along the samples instead of along the image dimensions. This gives a more even work distribution when parts of the image are more expensive to render than others. * Within one batch of samples (which in background render mode may be e.g. 64 samples), devices render 1 or more samples at a time, until all samples in the batch are done. For a sufficient number of samples dynamically distributing like this is better than statically assigning the amount of work to each device in advance, since we don't have a good (initial) estimate of how fast each device is. The first part is about solving the same problem as [D14014: Cycles multi-GPU distribution using slices [WIP]](https://archive.blender.org/developer/D14014) by @fx, but in a less fine grained way. Splitting along samples can work for batch rendering with many samples at once, but not for interactive viewport rendering with just 1 or a few samples. If we have a good slicing implementation, I think that would be preferable over splitting along samples. It allows for more fine grained distribution, as the number of pixels in a batch is much higher than the number of samples. Another advantage of splitting along pixels is that it lowers memory usage and memory traffic between devices. For the dynamic distribution, we could imagine this working with slicing too. Merging buffers then is not adding samples but copying slices to their appropriate place in the full buffer. For offline rendering I think this approach makes a lot of sense. I guess one of the reasons we didn't do it was because we were looking for a single implementation to handle both interactive and offline rendering, but it's unclear if that's possible. I imagine there can be an implementation that's not adding too much maintenance cost, where we handle both dynamic and static distribution. However I'm curious in which way this patch improves things specifically. It seems to assume GPUs are 8x faster CPUs as the initial estimate, and for sure we could have better heuristics for this regardless of the distribution type. So I imagine with a typical CPU + GPU this already speeds up the first sample. Then for the following samples, I wonder if the dynamic distribution ends up with assigning a very different amount of work to each device than the dynamic rebalancing we do now, because of a poor estimate (and lack of slicing)? Or if it's pretty similar and it's the overhead of rebalancing that makes the difference?
Author
Member

The first table below compares rendering performance of GPU, GPU + CPU (original blender implementation), and GPU + CPU (submitted patch). The performance draining parts of the current implementation is mainly related to constant re-balancing and the lack of convergence that results in numerous device-host copies. So, the higher the number of region split updates, the worse the final performance. Also, initialization plays a role in the final result. Bad Initialization of the work distribution can play a role in poor performance. For example, in CPU + GPU workload distribution, an initialization of 50-50 delays the convergence and GPU cycles are possibly wasted before the equilibrium is reached. Additionally, the choice for initial sample value can delay the convergence. When the initial number of samples is 1, the operation doesn't really benefit from the GPU processing, and the execution time (in many cases not much better than the CPU time) informs the splitting/balancing decision but the execution time is not really representative of the power of GPU.

image.png

The submitted patch doesn't employ any tweaking but there are a number of areas that can be tweaked to improve the performance, such as sample initialization, or assigning work only to GPU after a certain percentage of samples are rendered. For example, just a small change to ignore CPU after 95% of samples are rendered results in the following

Tweak.JPG

The first table below compares rendering performance of GPU, GPU + CPU (original blender implementation), and GPU + CPU (submitted patch). The performance draining parts of the current implementation is mainly related to constant re-balancing and the lack of convergence that results in numerous device-host copies. So, the higher the number of region split updates, the worse the final performance. Also, initialization plays a role in the final result. Bad Initialization of the work distribution can play a role in poor performance. For example, in CPU + GPU workload distribution, an initialization of 50-50 delays the convergence and GPU cycles are possibly wasted before the equilibrium is reached. Additionally, the choice for initial sample value can delay the convergence. When the initial number of samples is 1, the operation doesn't really benefit from the GPU processing, and the execution time (in many cases not much better than the CPU time) informs the splitting/balancing decision but the execution time is not really representative of the power of GPU. ![image.png](https://archive.blender.org/developer/F12863265/image.png) The submitted patch doesn't employ any tweaking but there are a number of areas that can be tweaked to improve the performance, such as sample initialization, or assigning work only to GPU after a certain percentage of samples are rendered. For example, just a small change to ignore CPU after 95% of samples are rendered results in the following ![Tweak.JPG](https://archive.blender.org/developer/F12863278/Tweak.JPG)

Added subscriber: @Roggii-4

Added subscriber: @Roggii-4
Contributor

Added subscriber: @Raimund58

Added subscriber: @Raimund58

This is very interesting comparison table. The initial though I had was: does it worth a code complexity and maintenance cost when the speedup is so small? Form my understanding bringing back adaptive sampling without loosing performance back will be rather hard.

I am not entirely sure about the memory copy. Surely, the rebalancing process is rather slow due to all the transfers, but with the proposed patch there are still transfers to mere 2 full frame render results. So question is: why data transfer for merging is more efficient than for rebalancing?

Another point I wonder is: what if we assume GPU is 8 times faster than CPU for an initial guess (from my understanding this is the current assumption in the patch is) will this help within current code?

This is very interesting comparison table. The initial though I had was: does it worth a code complexity and maintenance cost when the speedup is so small? Form my understanding bringing back adaptive sampling without loosing performance back will be rather hard. I am not entirely sure about the memory copy. Surely, the rebalancing process is rather slow due to all the transfers, but with the proposed patch there are still transfers to mere 2 full frame render results. So question is: why data transfer for merging is more efficient than for rebalancing? Another point I wonder is: what if we assume GPU is 8 times faster than CPU for an initial guess (from my understanding this is the current assumption in the patch is) will this help within current code?

Added subscriber: @makizar

Added subscriber: @makizar

Added subscriber: @Garek

Added subscriber: @Garek

Added subscriber: @SteffenD

Added subscriber: @SteffenD
Member
Added subscribers: @B4rr3l, @OmarEmaraDev, @Sayak-Biswas, @Funnybob
Philipp Oeser removed the
Interest
Render & Cycles
label 2023-02-09 14:03:20 +01:00
Sign in to join this conversation.
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
13 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#95687
No description provided.