Cycles: GPU Performance #87836

Open
opened 2021-04-26 17:14:32 +02:00 by Brecht Van Lommel · 13 comments

Memory usage

  • Auto detect good integrator state size depending on GPU (hardcoded to 1 million now)

  • Reduce size of IntegratorState

    • Don't store float4 as float3
    • Don't allocate memory for unused features (volumes, sss, denoising, light passes)
    • Dynamic volume stack size depending on scene contents (D12925)
    • Overlap SSS parameters with other memory
    • Compress some floats as half float
    • Reduce bits for integers where possible
    • Can diffuse/glossy/transmission bounces be limited to 255?
  • SoA

    • Individual arrays for XYZ components of float3?
    • Pack together 8bit/16bit values into 32bit?
  • Reduce size of ShaderData

    • Compute some differentials on the fly?
    • Read some data directly from ray/intersection?
    • Don't copy matrix if motion blur is disabled
  • Reduce kernel local memory

    • Dynamically allocate closures based on used shaders
    • Dynamically allocate SVM stack based on used shaders
    • Check if SVM stack is allocated multiple times
    • Check on deduplicating ShaderData instances
    • Check for other unknown sources of memory usage
    • Use either shade_surface or shade_surface_raytrace for reserving memory

Kernel Size

  • Replace megakernel used for last few samples? Only helps about 10% with viewport render, barely with batch render. But makes OptiX runtime compilation slow.
  • Make svm_eval_nodes a templated function and specialize it for
    • Background
    • Lights
    • Shadows
    • Shader-raytracing
    • Volumes
  • Avoid shader ray-tracing for AO pass (D12900)
  • Verify if more specialization is possible in svm_eval_nodes (seems not)
  • Deduplicate shader evaluation call in shade_background

Scheduling

  • Make shadow paths fully independent so they do not block main path kernels (D12889)
  • Accurately count number of available paths for scheduling additional work tiles
  • Compact shadow paths similar to main paths (D12944)
  • Consider adding scrambling distance in advanced options (D12318)
  • Compact path states for coherence
  • Tweak tile size for better coherence
  • Tweak work tiles or pixel order to improve coherence (many small tiles, Z-order, ..)
  • Try other shader sorting techniques (but will be more expensive than bucket sort)
    • Take into account object ID
    • Take into account random number for BSDF/BSSRDF selection?
  • Overlapped kernel execution
    • Use multiple independent GPU queueus? (so far was found to be 15% slower)
    • Use multiple GPU queues to schedule different kernels?
  • Optimize active index, sorting and prefix sum kernels
    • Parallelize prefix_sum
    • Build active/sorted index array for specific based on another array indicating active paths for all kernels, especially when number of paths is small
  • Try pushing (part of) integrator state in queues rather than persistent location, for more coherent memory access
    • Check potential performance benefit by coalescing state
    • Shadow paths
    • Main path
  • Cancelling renders and updates can be slow due to the occupancy heuristic that schedules more samples. Find a way to reduce this problem.
  • Shader sorting and locality balancing (D15331)
    • Find better heuristics for scenes with many shaders
    • Use for GPU devices other than Metal
  • Shader sorting by node graph similarity, to find coherence if e.g. the graph only has different image textures

Display

  • For final render, let Cycles draw the image instead of copying pixels to Blender
    • Render pass support

Render Algorithms

  • Use native OptiX curve primitive for thick hair
  • Transparent shadows: can we terminate OptiX rays earlier when enough hits are found? (D12524)
  • Transparent shadows: tune max hits for performance / memory usage
  • Detect constant transparent shadows for triangles and avoid recording intersection and evaluating shader entirely
  • Detect transparent shadows that are purely an image texture lookup and perform it in the hit kernel
  • For volume stack init, implement volume all intersection that writes directly into the stack

Tuning

  • Automatically increase integrator state size depending on available memory
  • Tweak kernel compilation parameters (num threads per block, max registers)
    • Different parameters per kernel?
**Memory usage** - [x] Auto detect good integrator state size depending on GPU (hardcoded to 1 million now) - [ ] Reduce size of IntegratorState - [x] Don't store float4 as float3 - [x] Don't allocate memory for unused features (volumes, sss, denoising, light passes) - [x] Dynamic volume stack size depending on scene contents ([D12925](https://archive.blender.org/developer/D12925)) - [ ] Overlap SSS parameters with other memory - [ ] Compress some floats as half float - [ ] Reduce bits for integers where possible - [ ] Can diffuse/glossy/transmission bounces be limited to 255? - [ ] SoA - [ ] Individual arrays for XYZ components of float3? - [ ] Pack together 8bit/16bit values into 32bit? - [ ] Reduce size of ShaderData - [ ] Compute some differentials on the fly? - [ ] Read some data directly from ray/intersection? - [x] Don't copy matrix if motion blur is disabled - [ ] Reduce kernel local memory - [ ] Dynamically allocate closures based on used shaders - [ ] Dynamically allocate SVM stack based on used shaders - [ ] Check if SVM stack is allocated multiple times - [ ] Check on deduplicating ShaderData instances - [ ] Check for other unknown sources of memory usage - [x] Use either `shade_surface` or `shade_surface_raytrace` for reserving memory **Kernel Size** - [x] Replace megakernel used for last few samples? Only helps about 10% with viewport render, barely with batch render. But makes OptiX runtime compilation slow. - [x] Make `svm_eval_nodes` a templated function and specialize it for - [x] Background - [x] Lights - [x] Shadows - [x] Shader-raytracing - [x] Volumes - [x] Avoid shader ray-tracing for AO pass ([D12900](https://archive.blender.org/developer/D12900)) - [x] Verify if more specialization is possible in `svm_eval_nodes` (seems not) - [ ] Deduplicate shader evaluation call in `shade_background` **Scheduling** - [x] Make shadow paths fully independent so they do not block main path kernels ([D12889](https://archive.blender.org/developer/D12889)) - [x] Accurately count number of available paths for scheduling additional work tiles - [x] Compact shadow paths similar to main paths ([D12944](https://archive.blender.org/developer/D12944)) - [x] Consider adding scrambling distance in advanced options ([D12318](https://archive.blender.org/developer/D12318)) - [x] Compact path states for coherence - [ ] Tweak tile size for better coherence - [ ] Tweak work tiles or pixel order to improve coherence (many small tiles, Z-order, ..) - [ ] Try other shader sorting techniques (but will be more expensive than bucket sort) - [ ] Take into account object ID - [ ] Take into account random number for BSDF/BSSRDF selection? - [ ] Overlapped kernel execution - [ ] Use multiple independent GPU queueus? (so far was found to be 15% slower) - [ ] Use multiple GPU queues to schedule different kernels? - [ ] Optimize active index, sorting and prefix sum kernels - [ ] Parallelize prefix_sum - [ ] Build active/sorted index array for specific based on another array indicating active paths for all kernels, especially when number of paths is small - [ ] Try pushing (part of) integrator state in queues rather than persistent location, for more coherent memory access - [ ] Check potential performance benefit by coalescing state - [ ] Shadow paths - [ ] Main path - [ ] Cancelling renders and updates can be slow due to the occupancy heuristic that schedules more samples. Find a way to reduce this problem. - [ ] Shader sorting and locality balancing ([D15331](https://archive.blender.org/developer/D15331)) - [ ] Find better heuristics for scenes with many shaders - [x] Use for GPU devices other than Metal - [ ] Shader sorting by node graph similarity, to find coherence if e.g. the graph only has different image textures **Display** - [x] For final render, let Cycles draw the image instead of copying pixels to Blender - [x] Render pass support **Render Algorithms** - [x] Use native OptiX curve primitive for thick hair - [x] Transparent shadows: can we terminate OptiX rays earlier when enough hits are found? ([D12524](https://archive.blender.org/developer/D12524)) - [x] Transparent shadows: tune max hits for performance / memory usage - [ ] Detect constant transparent shadows for triangles and avoid recording intersection and evaluating shader entirely - [ ] Detect transparent shadows that are purely an image texture lookup and perform it in the hit kernel - [ ] For volume stack init, implement volume all intersection that writes directly into the stack **Tuning** - [ ] Automatically increase integrator state size depending on available memory - [ ] Tweak kernel compilation parameters (num threads per block, max registers) - [ ] Different parameters per kernel?
Author
Owner

In a simple queue based system with fixed queue size for each kernel, memory usage increases as the number of kernels goes up, and much of the queue memory remains unused. However it does have the benefit that memory reads/writes may be faster due to coalescing.

Using more paths and associated memory can significantly improve performance though. The question is if there is an implementation that can get us both coalescing and little memory waste. Some brainstorming.

Filling Gaps

The current initialization of main paths works by filling gaps in the state array. Before init_from_camera is called, an array is constructed with unused path indices which is then filled in.

The same mechansim could be extended to the shadow queue, constructing an array of empty entries and using that in shade_surface, rather than forcing the shadow queue to be emptied before shade_surface can be executed.

This still leaves gaps until additional paths can be scheduled, and the fragmentation may cause incoherent paths to be grouped together.

Compaction

Rather than trying to write densely packed arrays, we could compact the kernel state to remove any empty gaps, using a dedicated kernel. This kernel causes additional memory reads and writes, which are hoped to pay off when multiple subsequent kernels can read memory more efficiently. It's not obvious that this is a win. In many cases there may only be 1 or 2 kernel executions before the next path iteration, and the total cost of memory access may increase.

We can do an experiment with a slow coalescing kernel before every kernel execution, to see how much kernel execution is impacted by non-coalesced memory access.

Ring Buffers

In simple cases where we ping-pong between two kernels like intersect_shadow and shade_shadow, an idea would be to share a single queue and fill consecutive empty items. With the idea of a ring buffer this can be made to work for two kernels.

In the single-threaded case this is straightforward, however synchronization to avoid overwriting items from the input queue is not so obvious with many threads. In practice we'd likely need to allocate additional memory based on the number of blocks that are executed in parallel, and that can be significant.

Chunks

An idea to avoid that would be to use a chunked allocator for queues, with the size of the chunks being equal to the block size of kernels.

When a kernel is about to write out queue items, it would allocate an additional chunk for the queue if the current chunk does not have enough space for all items. Queue items would then be written into 1 or 2 chunks. When executing a kernel memory reads would be coalesced, since each chunk matches the block size for that kernel and contains all the queue items in order. Sorting by shader would break coalescing for the shade_surface kernel though, and a separate queue per shader would likely waste too much memory.

Significant memory could still be unused, for two reasons:

  • For input queue items to not overwrite output queue items, a simple implementation may need to double the memory. Chunks could become available as soon as a block finishes executing. However many blocks are executed in parallel, so the practical memory savings are not obvious.
  • Up to block size x number of kernels would need to allocated extra to handle memory lost by incompletely filled chunks.
In a simple queue based system with fixed queue size for each kernel, memory usage increases as the number of kernels goes up, and much of the queue memory remains unused. However it does have the benefit that memory reads/writes may be faster due to coalescing. Using more paths and associated memory can significantly improve performance though. The question is if there is an implementation that can get us both coalescing and little memory waste. Some brainstorming. **Filling Gaps** The current initialization of main paths works by filling gaps in the state array. Before `init_from_camera` is called, an array is constructed with unused path indices which is then filled in. The same mechansim could be extended to the shadow queue, constructing an array of empty entries and using that in `shade_surface`, rather than forcing the shadow queue to be emptied before `shade_surface` can be executed. This still leaves gaps until additional paths can be scheduled, and the fragmentation may cause incoherent paths to be grouped together. **Compaction** Rather than trying to write densely packed arrays, we could compact the kernel state to remove any empty gaps, using a dedicated kernel. This kernel causes additional memory reads and writes, which are hoped to pay off when multiple subsequent kernels can read memory more efficiently. It's not obvious that this is a win. In many cases there may only be 1 or 2 kernel executions before the next path iteration, and the total cost of memory access may increase. We can do an experiment with a slow coalescing kernel before every kernel execution, to see how much kernel execution is impacted by non-coalesced memory access. **Ring Buffers** In simple cases where we ping-pong between two kernels like `intersect_shadow` and `shade_shadow`, an idea would be to share a single queue and fill consecutive empty items. With the idea of a ring buffer this can be made to work for two kernels. In the single-threaded case this is straightforward, however synchronization to avoid overwriting items from the input queue is not so obvious with many threads. In practice we'd likely need to allocate additional memory based on the number of blocks that are executed in parallel, and that can be significant. **Chunks** An idea to avoid that would be to use a chunked allocator for queues, with the size of the chunks being equal to the block size of kernels. When a kernel is about to write out queue items, it would allocate an additional chunk for the queue if the current chunk does not have enough space for all items. Queue items would then be written into 1 or 2 chunks. When executing a kernel memory reads would be coalesced, since each chunk matches the block size for that kernel and contains all the queue items in order. Sorting by shader would break coalescing for the `shade_surface` kernel though, and a separate queue per shader would likely waste too much memory. Significant memory could still be unused, for two reasons: * For input queue items to not overwrite output queue items, a simple implementation may need to double the memory. Chunks could become available as soon as a block finishes executing. However many blocks are executed in parallel, so the practical memory savings are not obvious. * Up to block size x number of kernels would need to allocated extra to handle memory lost by incompletely filled chunks.
Author
Owner

The megakernel was a clear win with CUDA when we added it to speed up the last 2% of paths. But benchmarking this with OptiX now it's a mixed bag:

                              no-megakernel                 cycles-x                      
bmw27.blend                   10.0372                       10.0984                       
classroom.blend               13.3341                       13.3774                       
pabellon.blend                7.34786                       7.3563                        
monster.blend                 8.51194                       9.12812                       
barbershop_interior.blend     9.55334                       9.77131                       
junkshop.blend                12.8547                       13.2737                       
pvt_flat.blend                14.8002                       13.0439                       

A possible explanation is that pvt_flat.blend uses adaptive sampling, where batch sizes are smaller and the megakernel helps more. Also this does not benchmark viewport rendering, where the megakernel helps more.

There's still room for optimization when we have few paths active, without using a megakernel. Given these numbers, that seems worth looking into.

The megakernel was a clear win with CUDA when we added it to speed up the last 2% of paths. But benchmarking this with OptiX now it's a mixed bag: ``` no-megakernel cycles-x bmw27.blend 10.0372 10.0984 classroom.blend 13.3341 13.3774 pabellon.blend 7.34786 7.3563 monster.blend 8.51194 9.12812 barbershop_interior.blend 9.55334 9.77131 junkshop.blend 12.8547 13.2737 pvt_flat.blend 14.8002 13.0439 ``` A possible explanation is that `pvt_flat.blend` uses adaptive sampling, where batch sizes are smaller and the megakernel helps more. Also this does not benchmark viewport rendering, where the megakernel helps more. There's still room for optimization when we have few paths active, without using a megakernel. Given these numbers, that seems worth looking into.
Author
Owner

Current progress on trying to eliminate the megakernel is in P2111, still not where I want it to be.

Compacting the state array seems not all that helpful.

What I did notice while working on that is that in the pvt_flat scene, the number of active paths often drops to a low number but is not refilled quickly. Reducing the tile size to avoid that not only avoids the performance regression, but actually speedups the rendering. However this slows down other scenes.

                              compact+tilesize/4    tilesize/4     compact       no-megakernel      megakernel
bmw27.blend                   8.34171               8.07403        7.54844       7.5659             7.71559                             
classroom.blend               11.2431               10.9255        10.6584       10.6141            10.8755                             
pabellon.blend                5.91434               5.96131        5.70281       5.73752            5.87296                             
monster.blend                 6.65674               6.52214        6.94078       7.00113            8.08601                             
barbershop_interior.blend     8.43154               7.90963        8.20633       8.16851            8.75592                             
junkshop.blend                10.2858               10.4201        10.5334       10.5217            11.1836                             
pvt_flat.blend                10.116                10.2103        12.115        12.4971            11.0911                             

There must be something that can be done to get closer to the best of both.

Looking at the kernel execution times of bmw27, it's clear that optimizing init_from_camera for multiple work tiles would help, but it's only a part of the performance gap. There's something else going on here that is harder to pin down.

compact+tilesize/4
6.71538s: integrator_shade_surface integrator_sorted_paths_array prefix_sum
1.47519s: integrator_intersect_closest integrator_queued_paths_array
0.53159s: integrator_intersect_shadow integrator_queued_shadow_paths_array
0.38923s: integrator_shade_shadow integrator_queued_shadow_paths_array
0.32022s: integrator_init_from_camera integrator_terminated_paths_array
0.16891s: integrator_shade_background integrator_queued_paths_array

no-megakernel
6.16877s: integrator_shade_surface integrator_sorted_paths_array prefix_sum
1.17981s: integrator_intersect_closest integrator_queued_paths_array
0.41735s: integrator_shade_shadow integrator_queued_shadow_paths_array
0.36235s: integrator_intersect_shadow integrator_queued_shadow_paths_array
0.21522s: integrator_shade_background integrator_queued_paths_array
0.17480s: integrator_init_from_camera integrator_terminated_paths_array
Current progress on trying to eliminate the megakernel is in [P2111](https://archive.blender.org/developer/P2111.txt), still not where I want it to be. Compacting the state array seems not all that helpful. What I did notice while working on that is that in the `pvt_flat` scene, the number of active paths often drops to a low number but is not refilled quickly. Reducing the tile size to avoid that not only avoids the performance regression, but actually speedups the rendering. However this slows down other scenes. ``` compact+tilesize/4 tilesize/4 compact no-megakernel megakernel bmw27.blend 8.34171 8.07403 7.54844 7.5659 7.71559 classroom.blend 11.2431 10.9255 10.6584 10.6141 10.8755 pabellon.blend 5.91434 5.96131 5.70281 5.73752 5.87296 monster.blend 6.65674 6.52214 6.94078 7.00113 8.08601 barbershop_interior.blend 8.43154 7.90963 8.20633 8.16851 8.75592 junkshop.blend 10.2858 10.4201 10.5334 10.5217 11.1836 pvt_flat.blend 10.116 10.2103 12.115 12.4971 11.0911 ``` There must be something that can be done to get closer to the best of both. Looking at the kernel execution times of bmw27, it's clear that optimizing `init_from_camera` for multiple work tiles would help, but it's only a part of the performance gap. There's something else going on here that is harder to pin down. ``` compact+tilesize/4 6.71538s: integrator_shade_surface integrator_sorted_paths_array prefix_sum 1.47519s: integrator_intersect_closest integrator_queued_paths_array 0.53159s: integrator_intersect_shadow integrator_queued_shadow_paths_array 0.38923s: integrator_shade_shadow integrator_queued_shadow_paths_array 0.32022s: integrator_init_from_camera integrator_terminated_paths_array 0.16891s: integrator_shade_background integrator_queued_paths_array no-megakernel 6.16877s: integrator_shade_surface integrator_sorted_paths_array prefix_sum 1.17981s: integrator_intersect_closest integrator_queued_paths_array 0.41735s: integrator_shade_shadow integrator_queued_shadow_paths_array 0.36235s: integrator_intersect_shadow integrator_queued_shadow_paths_array 0.21522s: integrator_shade_background integrator_queued_paths_array 0.17480s: integrator_init_from_camera integrator_terminated_paths_array ```
Author
Owner

Note about differentials: PBRT-v4 is not even passing differentials along with rays, but simply computing them using the camera information. This gives incorrect results through reflections and refractions, but may be close enough in practice.

Note about differentials: PBRT-v4 is not even passing differentials along with rays, but simply computing them using the camera information. This gives incorrect results through reflections and refractions, but may be close enough in practice.

This issue was referenced by blender/cycles@5db8d93df3

This issue was referenced by blender/cycles@5db8d93df3800ed9e90651ca305611c0612e606d

This issue was referenced by 0803119725

This issue was referenced by 08031197250aeecbaca3803254e6f25b8c7b7b37

This issue was referenced by blender/cycles@579813f4f8

This issue was referenced by blender/cycles@579813f4f8ed6c067d14d02c1d69dd37b69f2c45

This issue was referenced by 4d4113adc2

This issue was referenced by 4d4113adc2623c50888b63eaca3a055d8cdf3045
Author
Owner

Trying to figure out which parts of shade_surface_raytrace kernel are using most local memory:

zero max closures:                -64%
zero SVM stack size:              -11%
remove svm_eval_nodes:            -32%
remove direct light + AO pass:    -8%
remove voronoi node:              -1%
remove all texture nodes:         -2%
remove all closure nodes:         -2%
remove bevel + AO nodes:          -14%
remove all nodes but one:         -20%

So roughly:

  • 65% closures
  • 15% bevel + AO nodes
  • 10% SVM stack size
  • 5% other nodes
  • 5% other (including shader data)
Trying to figure out which parts of `shade_surface_raytrace` kernel are using most local memory: ``` zero max closures: -64% zero SVM stack size: -11% remove svm_eval_nodes: -32% remove direct light + AO pass: -8% remove voronoi node: -1% remove all texture nodes: -2% remove all closure nodes: -2% remove bevel + AO nodes: -14% remove all nodes but one: -20% ``` So roughly: * 65% closures * 15% bevel + AO nodes * 10% SVM stack size * 5% other nodes * 5% other (including shader data)

This issue was referenced by blender/cycles@6f7dd81db8

This issue was referenced by blender/cycles@6f7dd81db8f0bd09e033f34d03366711575e5c44

This issue was referenced by 001f548227

This issue was referenced by 001f548227c413a4fdbee275744ea8bea886081a

This issue was referenced by blender/cycles@9026179d2e

This issue was referenced by blender/cycles@9026179d2e43d0f5f7942d3b28776fb249d84117

This issue was referenced by 0c52eed863

This issue was referenced by 0c52eed863522b4dbac2704e8c88b5c009e886e7
Brecht Van Lommel changed title from Cycles X - GPU Performance to Cycles: GPU Performance 2021-10-28 14:54:21 +02:00
Brecht Van Lommel added this to the Module: Render & Cycles project 2023-02-07 19:08:06 +01:00
Philipp Oeser removed the
Interest
Render & Cycles
label 2023-02-09 14:02:28 +01:00
Alaska removed the
Status
Confirmed
label 2024-06-21 04:25:41 +02:00
Sign in to join this conversation.
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset System
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Code Documentation
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
FBX
Interest
Freestyle
Interest
Geometry Nodes
Interest
glTF
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Viewport & EEVEE
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Asset Browser Project
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Asset System
Module
Core
Module
Development Management
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Module
Viewport & EEVEE
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Severity
High
Severity
Low
Severity
Normal
Severity
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#87836
No description provided.