EEVEE: Improved deferred lighting efficiency #129268

Open
opened 2024-10-20 12:01:58 +02:00 by Clément Foucault · 4 comments

Motivations

The current deffered light evaluation implementation is done in a single shader that evaluates all closures for all lights affecting a pixel.
The shadowing is evaluated inline inside the light loop and increase the complexity of the shader drastically (since we now use shadow map ray-tracing).
This makes the shader super heavy (+6000 ALU, bad for occupancy) and very slow to compile (+2 sec each) which slows down startup time.

The entry point of the shader can be found in eevee_deferred_light_frag.glsl.
This task is about how to improve efficiency of this shader and reduce its compile time.

Current state

Here is the pseudo code of the current shader:

// Shader invocation
for closure in gbuffer :
    closure.radiance = 0

for light in lights :
    if shadow(light) != 0:
        for closure in gbuffer :
            closure.radiance += ltc(light, closure)

for closure in gbuffer :
    write closure.radiance

Observations

  • We already split visibility and shading in a non-physical way. There is no dependency between shadow and the BSDFs.
  • Closures can be processed independantly without much bandwidth increase as the input and output are per closure (only the gbuffer header needs to be loaded again).
  • Only Rectangular lights use the special LTC path for polygonal lights.
  • The acceleration structure is quite complex (2 inner loops) and adds quite some register pressure.

An initial investigation showed that removing the shadow eval, bypassing the acceleration structure, only loading one closure, and removing the Rectangular light reduce the ALU count to around ~1000 and improved compilation time by an order of magnitude (need to check exact timings). (FPS perf?)

Moving lights data to UBOs might be beneficial too in terms of memory access speed, especially on low end hardware.

Proposal

Here is the proposed split evaluation approach:

for light in lights :
    // Shader invocation
    compute shadow
    write shadow at light.shadow_index

for closure in gbuffer :
    // Shader invocation
    closure.radiance = 0
    for light in lights :
        load shadow at light.shadow_index
        if shadow != 0:
            closure.radiance += ltc(light, closure)
    write closure.radiance

Splitting closure evaluation seems the easiest, but relies on split shadow to avoid recomputing shadowing per closure.
Splitting shadow computation is quite involved and needs some design as it has many requirements.
Splitting Rectangular lights evaluation seems quite easy but needs adjustment in optimization structure.

Open Problems

How to fallback when we overflow the optimized pipeline with too many lights?

  • Send over-budget lights to the current (slow) pipeline, or to a new (slow but simpler) pipeline. Old pipeline then needs to always be ready if there is more than 32 lights in the scene, which defeats the compilation optimization.
  • Dispatch the fast pipeline N times. Needs to be known on CPU for either acceleration structure appropriate allocation and/or consecutive dispatches. Empty dispatch have some costs.
  • Global switch on CPU using scene light count. This limits the efficiency of the new technique.

Dependencies

graph TD;
    ShadowDenoising-->DeferredShadow;
    DeferredShadow-->SplitClosure;
    SplitLightLoop;
### Motivations The current deffered light evaluation implementation is done in a single shader that evaluates all closures for all lights affecting a pixel. The shadowing is evaluated inline inside the light loop and increase the complexity of the shader drastically (since we now use shadow map ray-tracing). This makes the shader super heavy (+6000 ALU, bad for occupancy) and very slow to compile (+2 sec each) which slows down startup time. The entry point of the shader can be found in `eevee_deferred_light_frag.glsl`. This task is about how to improve efficiency of this shader and reduce its compile time. ### Current state Here is the pseudo code of the current shader: ``` // Shader invocation for closure in gbuffer : closure.radiance = 0 for light in lights : if shadow(light) != 0: for closure in gbuffer : closure.radiance += ltc(light, closure) for closure in gbuffer : write closure.radiance ``` ### Observations - We already split visibility and shading in a non-physical way. There is no dependency between shadow and the BSDFs. - Closures can be processed independantly without much bandwidth increase as the input and output are per closure (only the gbuffer header needs to be loaded again). - Only Rectangular lights use the special LTC path for polygonal lights. - The acceleration structure is quite complex (2 inner loops) and adds quite some register pressure. An initial investigation showed that removing the shadow eval, bypassing the acceleration structure, only loading one closure, and removing the Rectangular light reduce the ALU count to around ~1000 and improved compilation time by an order of magnitude (need to check exact timings). (FPS perf?) Moving lights data to UBOs might be beneficial too in terms of memory access speed, especially on low end hardware. ### Proposal Here is the proposed split evaluation approach: ``` for light in lights : // Shader invocation compute shadow write shadow at light.shadow_index for closure in gbuffer : // Shader invocation closure.radiance = 0 for light in lights : load shadow at light.shadow_index if shadow != 0: closure.radiance += ltc(light, closure) write closure.radiance ``` Splitting closure evaluation seems the easiest, but relies on split shadow to avoid recomputing shadowing per closure. Splitting shadow computation is quite involved and needs some design as it has many requirements. Splitting Rectangular lights evaluation seems quite easy but needs adjustment in optimization structure. ### Open Problems How to fallback when we overflow the optimized pipeline with too many lights? - Send over-budget lights to the current (slow) pipeline, or to a new (slow but simpler) pipeline. Old pipeline then needs to always be ready if there is more than 32 lights in the scene, which defeats the compilation optimization. - Dispatch the fast pipeline N times. Needs to be known on CPU for either acceleration structure appropriate allocation and/or consecutive dispatches. Empty dispatch have some costs. - Global switch on CPU using scene light count. This limits the efficiency of the new technique. ### Dependencies ```mermaid graph TD; ShadowDenoising-->DeferredShadow; DeferredShadow-->SplitClosure; SplitLightLoop; ```
Clément Foucault added the
Type
Design
Module
Viewport & EEVEE
Interest
EEVEE
labels 2024-10-20 12:01:59 +02:00
Member

One note about the 32 lights limit.
If instead of having one global index for each shadow we do something like:

shadow_index = 0
for l in lights:
    if light_attenuation_surface(l, P, N) > 0.0:
        store_or_load_shadow(shadow_index)
        shadow_index++

Then we wouldn't be limited to a fixed number of lights in the scene, but to a fixed number of lights overlapping in any single pixel, which sounds like a much more reasonable limitation.

One note about the 32 lights limit. If instead of having one global index for each shadow we do something like: ``` shadow_index = 0 for l in lights: if light_attenuation_surface(l, P, N) > 0.0: store_or_load_shadow(shadow_index) shadow_index++ ``` Then we wouldn't be limited to a fixed number of lights in the scene, but to a fixed number of lights overlapping in any single pixel, which sounds like a much more reasonable limitation.
Author
Member

The goal is to reduce complexity of individual shaders to a maximum. If all shadows are evaluated in the same pixel invocation, then we don't gain much compared to the current situation. Also you get very bad data coherence if the set of light differ per pixels.

Also, even with this method we need to have a fallback, for more than 32 lights per pixel. 32 is not acceptable in production environment. Leaving either flickering (shadow? light?) when the limit is exceeded would be a major let down compared to now.

So we need a solution that can be dispatched N times or doesn't have the limitation (stochastic light?).

The goal is to reduce complexity of individual shaders to a maximum. If all shadows are evaluated in the same pixel invocation, then we don't gain much compared to the current situation. Also you get very bad data coherence if the set of light differ per pixels. Also, even with this method we need to have a fallback, for more than 32 lights per pixel. 32 is not acceptable in production environment. Leaving either flickering (shadow? light?) when the limit is exceeded would be a major let down compared to now. So we need a solution that can be dispatched N times or doesn't have the limitation (stochastic light?).
Member

Regardless of the method used to render the shadows, I don't think we have to be limited to any fixed number.
At 1 bit per shadow we could store a lot more than 32 without too much memory consumption/bandwidth.
We could use a screen space virtual texture (or store the tiles per light cluster), with CPU readback to allocate/deallocate tiles as needed.

I find preferable a single method that can be scaled over a split between fast/fallback paths.

About evaluating different light types on separate shaders, we could store the shadow_index inside an image that can be read by the next shader invocation. Although I'm not sure if there's a way to synchronize that without killing performance.
Another alternative could be to use the light index inside the pixel's light cluster. But the less fine grained the shadow selection is, the larger the memory cost of storing the masks.

Regardless of the method used to render the shadows, I don't think we have to be limited to any fixed number. At 1 bit per shadow we could store a lot more than 32 without too much memory consumption/bandwidth. We could use a screen space virtual texture (or store the tiles per light cluster), with CPU readback to allocate/deallocate tiles as needed. I find preferable a single method that can be scaled over a split between fast/fallback paths. About evaluating different light types on separate shaders, we could store the `shadow_index` inside an image that can be read by the next shader invocation. Although I'm not sure if there's a way to synchronize that without killing performance. Another alternative could be to use the light index inside the pixel's light cluster. But the less fine grained the shadow selection is, the larger the memory cost of storing the masks.
Author
Member

I find preferable a single method that can be scaled over a split between fast/fallback paths.

I would also prefer that. I just listed it here as it is a viable option.

At 1 bit per shadow we could store a lot more than 32 without too much memory consumption/bandwidth.

I am very worried about the memory consumption that EEVEE current has. I would strive to avoid more per pixel storage as this scales very badly. However I do acknowledge that this is an option.

with CPU readback to allocate/deallocate tiles as needed.

This is kind of similar with what I mentioned here: Needs to be known on CPU for either acceleration structure appropriate allocation and/or consecutive dispatches. Empty dispatch have some costs. But I believe virtual texture approach would not need a different dispatch. However indirection also has huge impact on performance (see Virtual Shadow Maps). Need to profile.

> I find preferable a single method that can be scaled over a split between fast/fallback paths. I would also prefer that. I just listed it here as it is a viable option. > At 1 bit per shadow we could store a lot more than 32 without too much memory consumption/bandwidth. I am very worried about the memory consumption that EEVEE current has. I would strive to avoid more per pixel storage as this scales very badly. However I do acknowledge that this is an option. > with CPU readback to allocate/deallocate tiles as needed. This is kind of similar with what I mentioned here: `Needs to be known on CPU for either acceleration structure appropriate allocation and/or consecutive dispatches. Empty dispatch have some costs. ` But I believe virtual texture approach would not need a different dispatch. However indirection also has huge impact on performance (see Virtual Shadow Maps). Need to profile.
Sign in to join this conversation.
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset System
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Code Documentation
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Viewport & EEVEE
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Asset Browser Project
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Module
Viewport & EEVEE
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Severity
High
Severity
Low
Severity
Normal
Severity
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#129268
No description provided.