Cycle CPU + GPU hybrid path tracer #95687

New Issue

Sahar A. Kashi · 2022-02-10T20:19:45+01:00

Sahar A. Kashi commented

2022-02-10 20:19:45 +01:00

The main idea of CPU + GPU path tracer is to have two full size accumulation/render buffers. A unit of work is a single tile (the same size as the frame buffer) with a varying number of samples to render for each scheduled task.

Two accumulation threads, one for the CPU and one for the GPU, run almost independently until the sum of samples has reached the number of required samples (the threads have to synchronize on the number of rendered samples and the offset for random number generation).

On every execution, each processor renders the amount of sample that is most suited to its capabilities. The initial number of samples assigned for the GPU is higher than the one assigned to the CPU. As mor rendering information becomes available, the number of samples per device is determined based on the number of samples it can process within an update interval. Only GPU thread updates the display while the rendering is still in process. It allows for more flexibility to determine the launch frequency/number of samples on the CPU device. When all samples are rendered, the two buffers are merged, to produce the final output (the overhead of this method).

This is a preliminary and quick implementation, without software design consideration, as a proof of concept to show the benefit of distributing work on two processors with significant capability differences. The implementation has limited support. It is configured to work only on a single GPU + a single CPU, with no support for adaptive sampling, tiling, or offline rendering to name a few.

This version does not consist of new methods for sample distribution of scheduled tasks and is not optimized to its full potential.

The main idea of CPU + GPU path tracer is to have two full size accumulation/render buffers. A unit of work is a single tile (the same size as the frame buffer) with a varying number of samples to render for each scheduled task. Two accumulation threads, one for the CPU and one for the GPU, run almost independently until the sum of samples has reached the number of required samples (the threads have to synchronize on the number of rendered samples and the offset for random number generation). On every execution, each processor renders the amount of sample that is most suited to its capabilities. The initial number of samples assigned for the GPU is higher than the one assigned to the CPU. As mor rendering information becomes available, the number of samples per device is determined based on the number of samples it can process within an update interval. Only GPU thread updates the display while the rendering is still in process. It allows for more flexibility to determine the launch frequency/number of samples on the CPU device. When all samples are rendered, the two buffers are merged, to produce the final output (the overhead of this method). ``` ``` This is a preliminary and quick implementation, without software design consideration, as a proof of concept to show the benefit of distributing work on two processors with significant capability differences. The implementation has limited support. It is configured to work only on a single GPU + a single CPU, with no support for adaptive sampling, tiling, or offline rendering to name a few. ``` ``` This version does not consist of new methods for sample distribution of scheduled tasks and is not optimized to its full potential.

Sahar A. Kashi commented

2022-02-10 20:19:45 +01:00

Added subscriber: @salipour

blender-admin commented

2022-02-10 20:19:45 +01:00

#97554 was marked as duplicate of this issue

Brian Savery (AMD) commented

2022-02-11 00:22:00 +01:00

Added subscriber: @BrianSavery

Aaron Carlisle commented

2022-02-11 04:56:38 +01:00

Changed status from 'Needs Triage' to: 'Confirmed'

Michael Jones (Apple) commented

2022-02-11 13:31:25 +01:00

Added subscriber: @Michael-Jones

Brian Savery (AMD) commented

2022-02-11 16:37:38 +01:00

Added subscriber: @brecht

Brian Savery (AMD) commented

2022-02-11 16:37:38 +01:00

@brecht This is the patch I mentioned from AMD

Brecht Van Lommel commented

2022-02-11 16:56:36 +01:00

Added subscribers: @fx, @Sergey

Brecht Van Lommel commented

2022-02-11 16:56:36 +01:00

Thanks for the contribution! I briefly discussed this with @Sergey here.

Do you have any performance comparison numbers? Or analysis about where the current implementation is doing poorly in your tests?

From my understanding this does two things:

Split work along the samples instead of along the image dimensions. This gives a more even work distribution when parts of the image are more expensive to render than others.
Within one batch of samples (which in background render mode may be e.g. 64 samples), devices render 1 or more samples at a time, until all samples in the batch are done. For a sufficient number of samples dynamically distributing like this is better than statically assigning the amount of work to each device in advance, since we don't have a good (initial) estimate of how fast each device is.

The first part is about solving the same problem as D14014: Cycles multi-GPU distribution using slices [WIP] by @fx, but in a less fine grained way. Splitting along samples can work for batch rendering with many samples at once, but not for interactive viewport rendering with just 1 or a few samples. If we have a good slicing implementation, I think that would be preferable over splitting along samples. It allows for more fine grained distribution, as the number of pixels in a batch is much higher than the number of samples. Another advantage of splitting along pixels is that it lowers memory usage and memory traffic between devices.

For the dynamic distribution, we could imagine this working with slicing too. Merging buffers then is not adding samples but copying slices to their appropriate place in the full buffer. For offline rendering I think this approach makes a lot of sense. I guess one of the reasons we didn't do it was because we were looking for a single implementation to handle both interactive and offline rendering, but it's unclear if that's possible. I imagine there can be an implementation that's not adding too much maintenance cost, where we handle both dynamic and static distribution.

However I'm curious in which way this patch improves things specifically. It seems to assume GPUs are 8x faster CPUs as the initial estimate, and for sure we could have better heuristics for this regardless of the distribution type. So I imagine with a typical CPU + GPU this already speeds up the first sample. Then for the following samples, I wonder if the dynamic distribution ends up with assigning a very different amount of work to each device than the dynamic rebalancing we do now, because of a poor estimate (and lack of slicing)? Or if it's pretty similar and it's the overhead of rebalancing that makes the difference?

Thanks for the contribution! I briefly discussed this with @Sergey here. Do you have any performance comparison numbers? Or analysis about where the current implementation is doing poorly in your tests? From my understanding this does two things: * Split work along the samples instead of along the image dimensions. This gives a more even work distribution when parts of the image are more expensive to render than others. * Within one batch of samples (which in background render mode may be e.g. 64 samples), devices render 1 or more samples at a time, until all samples in the batch are done. For a sufficient number of samples dynamically distributing like this is better than statically assigning the amount of work to each device in advance, since we don't have a good (initial) estimate of how fast each device is. The first part is about solving the same problem as [D14014: Cycles multi-GPU distribution using slices [WIP]](https://archive.blender.org/developer/D14014) by @fx, but in a less fine grained way. Splitting along samples can work for batch rendering with many samples at once, but not for interactive viewport rendering with just 1 or a few samples. If we have a good slicing implementation, I think that would be preferable over splitting along samples. It allows for more fine grained distribution, as the number of pixels in a batch is much higher than the number of samples. Another advantage of splitting along pixels is that it lowers memory usage and memory traffic between devices. For the dynamic distribution, we could imagine this working with slicing too. Merging buffers then is not adding samples but copying slices to their appropriate place in the full buffer. For offline rendering I think this approach makes a lot of sense. I guess one of the reasons we didn't do it was because we were looking for a single implementation to handle both interactive and offline rendering, but it's unclear if that's possible. I imagine there can be an implementation that's not adding too much maintenance cost, where we handle both dynamic and static distribution. However I'm curious in which way this patch improves things specifically. It seems to assume GPUs are 8x faster CPUs as the initial estimate, and for sure we could have better heuristics for this regardless of the distribution type. So I imagine with a typical CPU + GPU this already speeds up the first sample. Then for the following samples, I wonder if the dynamic distribution ends up with assigning a very different amount of work to each device than the dynamic rebalancing we do now, because of a poor estimate (and lack of slicing)? Or if it's pretty similar and it's the overhead of rebalancing that makes the difference?

Sahar A. Kashi commented

2022-02-12 03:42:34 +01:00

The first table below compares rendering performance of GPU, GPU + CPU (original blender implementation), and GPU + CPU (submitted patch). The performance draining parts of the current implementation is mainly related to constant re-balancing and the lack of convergence that results in numerous device-host copies. So, the higher the number of region split updates, the worse the final performance. Also, initialization plays a role in the final result. Bad Initialization of the work distribution can play a role in poor performance. For example, in CPU + GPU workload distribution, an initialization of 50-50 delays the convergence and GPU cycles are possibly wasted before the equilibrium is reached. Additionally, the choice for initial sample value can delay the convergence. When the initial number of samples is 1, the operation doesn't really benefit from the GPU processing, and the execution time (in many cases not much better than the CPU time) informs the splitting/balancing decision but the execution time is not really representative of the power of GPU.

The submitted patch doesn't employ any tweaking but there are a number of areas that can be tweaked to improve the performance, such as sample initialization, or assigning work only to GPU after a certain percentage of samples are rendered. For example, just a small change to ignore CPU after 95% of samples are rendered results in the following

The first table below compares rendering performance of GPU, GPU + CPU (original blender implementation), and GPU + CPU (submitted patch). The performance draining parts of the current implementation is mainly related to constant re-balancing and the lack of convergence that results in numerous device-host copies. So, the higher the number of region split updates, the worse the final performance. Also, initialization plays a role in the final result. Bad Initialization of the work distribution can play a role in poor performance. For example, in CPU + GPU workload distribution, an initialization of 50-50 delays the convergence and GPU cycles are possibly wasted before the equilibrium is reached. Additionally, the choice for initial sample value can delay the convergence. When the initial number of samples is 1, the operation doesn't really benefit from the GPU processing, and the execution time (in many cases not much better than the CPU time) informs the splitting/balancing decision but the execution time is not really representative of the power of GPU. ![image.png](https://archive.blender.org/developer/F12863265/image.png) The submitted patch doesn't employ any tweaking but there are a number of areas that can be tweaked to improve the performance, such as sample initialization, or assigning work only to GPU after a certain percentage of samples are rendered. For example, just a small change to ignore CPU after 95% of samples are rendered results in the following ![Tweak.JPG](https://archive.blender.org/developer/F12863278/Tweak.JPG)

Nana Beniako commented

2022-02-13 01:16:51 +01:00

Added subscriber: @Roggii-4

Raimund Klink commented

2022-02-14 10:56:34 +01:00

Added subscriber: @Raimund58

Sergey Sharybin commented

2022-02-14 17:42:34 +01:00

This is very interesting comparison table. The initial though I had was: does it worth a code complexity and maintenance cost when the speedup is so small? Form my understanding bringing back adaptive sampling without loosing performance back will be rather hard.

I am not entirely sure about the memory copy. Surely, the rebalancing process is rather slow due to all the transfers, but with the proposed patch there are still transfers to mere 2 full frame render results. So question is: why data transfer for merging is more efficient than for rebalancing?

Another point I wonder is: what if we assume GPU is 8 times faster than CPU for an initial guess (from my understanding this is the current assumption in the patch is) will this help within current code?

This is very interesting comparison table. The initial though I had was: does it worth a code complexity and maintenance cost when the speedup is so small? Form my understanding bringing back adaptive sampling without loosing performance back will be rather hard. I am not entirely sure about the memory copy. Surely, the rebalancing process is rather slow due to all the transfers, but with the proposed patch there are still transfers to mere 2 full frame render results. So question is: why data transfer for merging is more efficient than for rebalancing? Another point I wonder is: what if we assume GPU is 8 times faster than CPU for an initial guess (from my understanding this is the current assumption in the patch is) will this help within current code?

makizar commented

2022-02-17 02:12:26 +01:00

Added subscriber: @makizar