Cycles: GPU Performance #87836
Open
opened 2021-04-26 17:14:32 +02:00 by Brecht Van Lommel
·
56 comments
No Branch/Tag Specified
main
blender-v3.6-release
temp-sculpt-dyntopo-hive-alloc
temp-sculpt-dyntopo
tmp-usd-python-mtl
asset-browser-frontend-split
node-group-operators
brush-assets-project
asset-shelf
blender-v2.93-release
blender-v3.3-release
universal-scene-description
temp-sculpt-attr-api
blender-v3.5-release
realtime-clock
sculpt-dev
gpencil-next
bevelv2
microfacet_hair
blender-projects-basics
principled-v2
v3.3.7
v2.93.18
v3.5.1
v3.3.6
v2.93.17
v3.5.0
v2.93.16
v3.3.5
v3.3.4
v2.93.15
v2.93.14
v3.3.3
v2.93.13
v2.93.12
v3.4.1
v3.3.2
v3.4.0
v3.3.1
v2.93.11
v3.3.0
v3.2.2
v2.93.10
v3.2.1
v3.2.0
v2.83.20
v2.93.9
v3.1.2
v3.1.1
v3.1.0
v2.83.19
v2.93.8
v3.0.1
v2.93.7
v3.0.0
v2.93.6
v2.93.5
v2.83.18
v2.93.4
v2.93.3
v2.83.17
v2.93.2
v2.93.1
v2.83.16
v2.93.0
v2.83.15
v2.83.14
v2.83.13
v2.92.0
v2.83.12
v2.91.2
v2.83.10
v2.91.0
v2.83.9
v2.83.8
v2.83.7
v2.90.1
v2.83.6.1
v2.83.6
v2.90.0
v2.83.5
v2.83.4
v2.83.3
v2.83.2
v2.83.1
v2.83
v2.82a
v2.82
v2.81a
v2.81
v2.80
v2.80-rc3
v2.80-rc2
v2.80-rc1
v2.79b
v2.79a
v2.79
v2.79-rc2
v2.79-rc1
v2.78c
v2.78b
v2.78a
v2.78
v2.78-rc2
v2.78-rc1
v2.77a
v2.77
v2.77-rc2
v2.77-rc1
v2.76b
v2.76a
v2.76
v2.76-rc3
v2.76-rc2
v2.76-rc1
v2.75a
v2.75
v2.75-rc2
v2.75-rc1
v2.74
v2.74-rc4
v2.74-rc3
v2.74-rc2
v2.74-rc1
v2.73a
v2.73
v2.73-rc1
v2.72b
2.72b
v2.72a
v2.72
v2.72-rc1
v2.71
v2.71-rc2
v2.71-rc1
v2.70a
v2.70
v2.70-rc2
v2.70-rc
v2.69
v2.68a
v2.68
v2.67b
v2.67a
v2.67
v2.66a
v2.66
v2.65a
v2.65
v2.64a
v2.64
v2.63a
v2.63
v2.61
v2.60a
v2.60
v2.59
v2.58a
v2.58
v2.57b
v2.57a
v2.57
v2.56a
v2.56
v2.55
v2.54
v2.53
v2.52
v2.51
v2.50
v2.49b
v2.49a
v2.49
v2.48a
v2.48
v2.47
v2.46
v2.45
v2.44
v2.43
v2.42a
v2.42
v2.41
v2.40
v2.37a
v2.37
v2.36
v2.35a
v2.35
v2.34
v2.33a
v2.33
v2.32
v2.31a
v2.31
v2.30
v2.28c
v2.28a
v2.28
v2.27
v2.26
v2.25
Labels
Clear labels
Issues relating to security: https://wiki.blender.org/wiki/Process/Vulnerability_Reports
Apply labels
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
Eevee & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest/Import
Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest: Wayland
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Issues relating to security: https://wiki.blender.org/wiki/Process/Vulnerability_Reports
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
Eevee & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
Eevee & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest/Import
Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest: Wayland
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
Eevee & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
Milestone
Set milestone
Clear milestone
No items
No Milestone
Projects
Set Project
Clear projects
No project
Assignees
Assign users
Clear assignees
No Assignees
41 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.
No due date set.
Dependencies
No dependencies set.
Reference: blender/blender#87836
Reference in New Issue
There is no content yet.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may exist for a short time before cleaning up, in most cases it CANNOT be undone. Continue?
Memory usage
Auto detect good integrator state size depending on GPU (hardcoded to 1 million now)
Reduce size of IntegratorState
SoA
Reduce size of ShaderData
Reduce kernel local memory
shade_surface
orshade_surface_raytrace
for reserving memoryKernel Size
svm_eval_nodes
a templated function and specialize it forsvm_eval_nodes
(seems not)shade_background
Scheduling
Display
Render Algorithms
Tuning
Changed status from 'Needs Triage' to: 'Confirmed'
Added subscriber: @brecht
Added subscriber: @SteffenD
Added subscriber: @elias.andersson92
Added subscriber: @Diogo_Valadares
Added subscriber: @FilipPolbratt
Added subscriber: @Alaska
Added subscriber: @GeorgiaPacific
Added subscriber: @lacilaci
Added subscriber: @Harvester
Added subscriber: @pmoursnv
Added subscriber: @PiloeGAO
Added subscriber: @dabuxian
Added subscriber: @strangerman
Added subscriber: @semimetallic
Added subscriber: @JasonClarke
Added subscriber: @KenzieMac130
Added subscriber: @Roggii-4
In a simple queue based system with fixed queue size for each kernel, memory usage increases as the number of kernels goes up, and much of the queue memory remains unused. However it does have the benefit that memory reads/writes may be faster due to coalescing.
Using more paths and associated memory can significantly improve performance though. The question is if there is an implementation that can get us both coalescing and little memory waste. Some brainstorming.
Filling Gaps
The current initialization of main paths works by filling gaps in the state array. Before
init_from_camera
is called, an array is constructed with unused path indices which is then filled in.The same mechansim could be extended to the shadow queue, constructing an array of empty entries and using that in
shade_surface
, rather than forcing the shadow queue to be emptied beforeshade_surface
can be executed.This still leaves gaps until additional paths can be scheduled, and the fragmentation may cause incoherent paths to be grouped together.
Compaction
Rather than trying to write densely packed arrays, we could compact the kernel state to remove any empty gaps, using a dedicated kernel. This kernel causes additional memory reads and writes, which are hoped to pay off when multiple subsequent kernels can read memory more efficiently. It's not obvious that this is a win. In many cases there may only be 1 or 2 kernel executions before the next path iteration, and the total cost of memory access may increase.
We can do an experiment with a slow coalescing kernel before every kernel execution, to see how much kernel execution is impacted by non-coalesced memory access.
Ring Buffers
In simple cases where we ping-pong between two kernels like
intersect_shadow
andshade_shadow
, an idea would be to share a single queue and fill consecutive empty items. With the idea of a ring buffer this can be made to work for two kernels.In the single-threaded case this is straightforward, however synchronization to avoid overwriting items from the input queue is not so obvious with many threads. In practice we'd likely need to allocate additional memory based on the number of blocks that are executed in parallel, and that can be significant.
Chunks
An idea to avoid that would be to use a chunked allocator for queues, with the size of the chunks being equal to the block size of kernels.
When a kernel is about to write out queue items, it would allocate an additional chunk for the queue if the current chunk does not have enough space for all items. Queue items would then be written into 1 or 2 chunks. When executing a kernel memory reads would be coalesced, since each chunk matches the block size for that kernel and contains all the queue items in order. Sorting by shader would break coalescing for the
shade_surface
kernel though, and a separate queue per shader would likely waste too much memory.Significant memory could still be unused, for two reasons:
The megakernel was a clear win with CUDA when we added it to speed up the last 2% of paths. But benchmarking this with OptiX now it's a mixed bag:
A possible explanation is that
pvt_flat.blend
uses adaptive sampling, where batch sizes are smaller and the megakernel helps more. Also this does not benchmark viewport rendering, where the megakernel helps more.There's still room for optimization when we have few paths active, without using a megakernel. Given these numbers, that seems worth looking into.
Added subscriber: @ManuelAlbert
Added subscriber: @Maged_Afra
Added subscriber: @PetterLundh
Added subscriber: @rsgnz
Added subscriber: @AlexeyAdamitsky
Added subscriber: @LasseFoster-2
Added subscriber: @ephraimpauli
Removed subscriber: @ephraimpauli
Added subscriber: @salaivv
Current progress on trying to eliminate the megakernel is in P2111, still not where I want it to be.
Compacting the state array seems not all that helpful.
What I did notice while working on that is that in the
pvt_flat
scene, the number of active paths often drops to a low number but is not refilled quickly. Reducing the tile size to avoid that not only avoids the performance regression, but actually speedups the rendering. However this slows down other scenes.There must be something that can be done to get closer to the best of both.
Looking at the kernel execution times of bmw27, it's clear that optimizing
init_from_camera
for multiple work tiles would help, but it's only a part of the performance gap. There's something else going on here that is harder to pin down.Added subscriber: @Aeraglyx
Added subscriber: @machieb
Added subscriber: @Emi_Martinez
Added subscriber: @sander.c.vonk
Note about differentials: PBRT-v4 is not even passing differentials along with rays, but simply computing them using the camera information. This gives incorrect results through reflections and refractions, but may be close enough in practice.
Removed subscriber: @salaivv
Added subscriber: @KonstantinsVisnevskis
Added subscriber: @MilanJaros
Added subscriber: @PeterBaintner
Added subscriber: @ParallelMayhem
Added subscriber: @Nik-9
Added subscriber: @heini
Added subscriber: @cata_cg
Added subscriber: @Raimund58
Added subscriber: @johannes.wilde
This issue was referenced by blender/cycles@5db8d93df3
This issue was referenced by
0803119725
This issue was referenced by blender/cycles@579813f4f8
This issue was referenced by
4d4113adc2
Trying to figure out which parts of
shade_surface_raytrace
kernel are using most local memory:So roughly:
This issue was referenced by blender/cycles@6f7dd81db8
This issue was referenced by
001f548227
This issue was referenced by blender/cycles@9026179d2e
This issue was referenced by
0c52eed863
Cycles X - GPU Performanceto Cycles: GPU PerformanceAdded subscriber: @skyscapeparadise
Added subscriber: @Yuro