EEVEE: Improved deferred lighting efficiency #129268

New Issue

Clément Foucault · 2024-10-20T12:01:58+02:00

Clément Foucault commented

2024-10-20 12:01:58 +02:00

Motivations

The current deffered light evaluation implementation is done in a single shader that evaluates all closures for all lights affecting a pixel.
The shadowing is evaluated inline inside the light loop and increase the complexity of the shader drastically (since we now use shadow map ray-tracing).
This makes the shader super heavy (+6000 ALU, bad for occupancy) and very slow to compile (+2 sec each) which slows down startup time.

The entry point of the shader can be found in eevee_deferred_light_frag.glsl.
This task is about how to improve efficiency of this shader and reduce its compile time.

Current state

Here is the pseudo code of the current shader:

// Shader invocation
for closure in gbuffer :
    closure.radiance = 0

for light in lights :
    if shadow(light) != 0:
        for closure in gbuffer :
            closure.radiance += ltc(light, closure)

for closure in gbuffer :
    write closure.radiance

Observations

We already split visibility and shading in a non-physical way. There is no dependency between shadow and the BSDFs.
Closures can be processed independantly without much bandwidth increase as the input and output are per closure (only the gbuffer header needs to be loaded again).
Only Rectangular lights use the special LTC path for polygonal lights.
The acceleration structure is quite complex (2 inner loops) and adds quite some register pressure.

An initial investigation showed that removing the shadow eval, bypassing the acceleration structure, only loading one closure, and removing the Rectangular light reduce the ALU count to around ~1000 and improved compilation time by an order of magnitude (need to check exact timings). (FPS perf?)

Moving lights data to UBOs might be beneficial too in terms of memory access speed, especially on low end hardware.

Proposal

Here is the proposed split evaluation approach:

for light in lights :
    // Shader invocation
    compute shadow
    write shadow at light.shadow_index

for closure in gbuffer :
    // Shader invocation
    closure.radiance = 0
    for light in lights :
        load shadow at light.shadow_index
        if shadow != 0:
            closure.radiance += ltc(light, closure)
    write closure.radiance

Splitting closure evaluation seems the easiest, but relies on split shadow to avoid recomputing shadowing per closure.
Splitting shadow computation is quite involved and needs some design as it has many requirements.
Splitting Rectangular lights evaluation seems quite easy but needs adjustment in optimization structure.

Open Problems

How to fallback when we overflow the optimized pipeline with too many lights?

Send over-budget lights to the current (slow) pipeline, or to a new (slow but simpler) pipeline. Old pipeline then needs to always be ready if there is more than 32 lights in the scene, which defeats the compilation optimization.
Dispatch the fast pipeline N times. Needs to be known on CPU for either acceleration structure appropriate allocation and/or consecutive dispatches. Empty dispatch have some costs.
Global switch on CPU using scene light count. This limits the efficiency of the new technique.

Dependencies

graph TD;
    ShadowDenoising-->DeferredShadow;
    DeferredShadow-->SplitClosure;
    SplitLightLoop;

### Motivations The current deffered light evaluation implementation is done in a single shader that evaluates all closures for all lights affecting a pixel. The shadowing is evaluated inline inside the light loop and increase the complexity of the shader drastically (since we now use shadow map ray-tracing). This makes the shader super heavy (+6000 ALU, bad for occupancy) and very slow to compile (+2 sec each) which slows down startup time. The entry point of the shader can be found in `eevee_deferred_light_frag.glsl`. This task is about how to improve efficiency of this shader and reduce its compile time. ### Current state Here is the pseudo code of the current shader: ``` // Shader invocation for closure in gbuffer : closure.radiance = 0 for light in lights : if shadow(light) != 0: for closure in gbuffer : closure.radiance += ltc(light, closure) for closure in gbuffer : write closure.radiance ``` ### Observations - We already split visibility and shading in a non-physical way. There is no dependency between shadow and the BSDFs. - Closures can be processed independantly without much bandwidth increase as the input and output are per closure (only the gbuffer header needs to be loaded again). - Only Rectangular lights use the special LTC path for polygonal lights. - The acceleration structure is quite complex (2 inner loops) and adds quite some register pressure. An initial investigation showed that removing the shadow eval, bypassing the acceleration structure, only loading one closure, and removing the Rectangular light reduce the ALU count to around ~1000 and improved compilation time by an order of magnitude (need to check exact timings). (FPS perf?) Moving lights data to UBOs might be beneficial too in terms of memory access speed, especially on low end hardware. ### Proposal Here is the proposed split evaluation approach: ``` for light in lights : // Shader invocation compute shadow write shadow at light.shadow_index for closure in gbuffer : // Shader invocation closure.radiance = 0 for light in lights : load shadow at light.shadow_index if shadow != 0: closure.radiance += ltc(light, closure) write closure.radiance ``` Splitting closure evaluation seems the easiest, but relies on split shadow to avoid recomputing shadowing per closure. Splitting shadow computation is quite involved and needs some design as it has many requirements. Splitting Rectangular lights evaluation seems quite easy but needs adjustment in optimization structure. ### Open Problems How to fallback when we overflow the optimized pipeline with too many lights? - Send over-budget lights to the current (slow) pipeline, or to a new (slow but simpler) pipeline. Old pipeline then needs to always be ready if there is more than 32 lights in the scene, which defeats the compilation optimization. - Dispatch the fast pipeline N times. Needs to be known on CPU for either acceleration structure appropriate allocation and/or consecutive dispatches. Empty dispatch have some costs. - Global switch on CPU using scene light count. This limits the efficiency of the new technique. ### Dependencies ```mermaid graph TD; ShadowDenoising-->DeferredShadow; DeferredShadow-->SplitClosure; SplitLightLoop; ```

Clément Foucault added the

Download

What's New

Blender Studio

Manual

Developers Blog

Documentation

Benchmark

Blender Conference

Development Fund

One-time Donations

EEVEE: Improved deferred lighting efficiency #129268