Cycles split kernel optimizations #82583

New Issue

Brecht Van Lommel · 2020-11-10T14:45:56+01:00

Brecht Van Lommel commented

2020-11-10 14:45:56 +01:00

This task intends to gather ideas to optimize the Cycles split kernel.

There are a few major things to work on:

Reorganize split kernels so that shader evaluation and ray-tracing are not duplicated in multiple kernels, or not duplicated as many times. This could reduce kernel compile times, register pressure, and improve coherence. Alternatively, shader evaluation for background, shadows, volumes, could be specialized so that it does not include nodes not used in any such shader (for example volumes don't need BSDFs).
This would likely involve some rethinking of how we handle transparent shadows, and maybe light and background shaders.
Replace usage of queues and atomics for scheduling work, and replace with sorting between split kernels as done by some other renderers.
Reduce the size of the state that needs to remain in memory between split kernels. The number of rays that can be active is limited by this. It can result in low occupancy or not enough memory leaft for scene data.
- Render passes could be written directly to memory to make PathRadiance much smaller (#72293)
- ShaderData: there are ways to reduce the size of shader globals. Computing some members on demand rather than storing them, compression (lossless or lossy), simpler differentials.
- It may be possible to structure the kernels so that only one closure needs to be stored at a time or for a shorter time, however this may come with some noise trade-offs.
If shader evaluation is isolated to one or fewer kernels, ray sorting by material ID can improve coherence. Similar sorting may help other split kernels.

I would consider removing branched path tracing for GPU rendering entirely (#52725), since this is not particularly suitable for GPUs and complicates the code. It would be easier to refactor without this.

This task intends to gather ideas to optimize the Cycles split kernel. There are a few major things to work on: * Reorganize split kernels so that shader evaluation and ray-tracing are not duplicated in multiple kernels, or not duplicated as many times. This could reduce kernel compile times, register pressure, and improve coherence. Alternatively, shader evaluation for background, shadows, volumes, could be specialized so that it does not include nodes not used in any such shader (for example volumes don't need BSDFs). * This would likely involve some rethinking of how we handle transparent shadows, and maybe light and background shaders. * Replace usage of queues and atomics for scheduling work, and replace with sorting between split kernels as done by some other renderers. * Reduce the size of the state that needs to remain in memory between split kernels. The number of rays that can be active is limited by this. It can result in low occupancy or not enough memory leaft for scene data. * Render passes could be written directly to memory to make `PathRadiance` much smaller (#72293) * `ShaderData`: there are ways to reduce the size of shader globals. Computing some members on demand rather than storing them, compression (lossless or lossy), simpler differentials. * It may be possible to structure the kernels so that only one closure needs to be stored at a time or for a shorter time, however this may come with some noise trade-offs. * If shader evaluation is isolated to one or fewer kernels, ray sorting by material ID can improve coherence. Similar sorting may help other split kernels. I would consider removing branched path tracing for GPU rendering entirely (#52725), since this is not particularly suitable for GPUs and complicates the code. It would be easier to refactor without this.

Brecht Van Lommel commented

2020-11-10 14:45:56 +01:00

Changed status from 'Needs Triage' to: 'Confirmed'

Brecht Van Lommel commented

2020-11-10 14:45:56 +01:00

Added subscribers: @brecht, @BrianSavery

Jeroen Bakker commented

2020-11-11 10:06:50 +01:00

Added subscriber: @Jeroen-Bakker

Jeroen Bakker commented

2020-11-11 10:19:37 +01:00

OpenCL 2.0 introduced Pipes; A mean to communicate between simultaneous running kernels.
The benefit is that a pipe is a fixed size in global memory and is more likely to be cached by HW caches. It could also be used fine tune performance by changing the size of the pipes.

Would be an idea to research this as a alternative to queueing and sorting.

Shader globals could also be spliced into multiple smaller variants one optimized for intersection, other one for shading, other one for integration etc.

OpenCL 2.0 introduced Pipes; A mean to communicate between simultaneous running kernels. The benefit is that a pipe is a fixed size in global memory and is more likely to be cached by HW caches. It could also be used fine tune performance by changing the size of the pipes. Would be an idea to research this as a alternative to queueing and sorting. Shader globals could also be spliced into multiple smaller variants one optimized for intersection, other one for shading, other one for integration etc.

Alaska commented

2021-03-18 01:35:08 +01:00

Added subscriber: @Alaska

Brecht Van Lommel commented

2021-05-04 15:20:03 +02:00

Changed status from 'Confirmed' to: 'Archived'

Brecht Van Lommel closed this issue

2021-05-04 15:20:03 +02:00

Brecht Van Lommel commented

2021-05-04 15:20:03 +02:00

Task superseded by #87836 (Cycles: GPU Performance).

Sign in to join this conversation.

No Label

Download

What's New

Blender Studio

Manual

Developers Blog

Documentation

Benchmark

Blender Conference

Development Fund

One-time Donations

Cycles split kernel optimizations #82583