EEVEE-Next: Reduce longer compilation time #120100

Closed
opened 2024-03-31 00:34:24 +01:00 by Clément Foucault · 13 comments

It seems that nvidia drivers have a harder time with our new shaders for some reason (~3x from what I read).
First cold start of EEVEE-Next also takes quite longer than I would like to (a few seconds without any feedback), we could display a black frame with a waiting message instead, but ideally it should be less than 3 seconds.

So I think we have no choice but to optimize it at least a little for the first release.

The first ideas that comes to mind is to look at which part of the code is causing the most slowdown on the affected drivers and fix it.
My guess is that it is likely to be caused by aggressive loop unrolling like it was on Metal + M1 before the recent fix. However, it seems that working around it is quite tricky as there is no clear preprocessor directives for loop unrolling in GLSL and the extension that adds it looks quite unsupported.

The other approach would just be to reduce code size as much as possible. We could try to preprocess the GLSL string using our own obfuscator at compile time, but that looks unrealistic and the benefits are not quite clear.

Instead we should leverage SpirV. This can give several solutions:

  • Feed SpirV directly driver. Most of the hardware we target support the ARB_gl_spirv extension. This avoid the driver to do the parsing and most of the conversion.
  • Compile to SpirV (shaderc) then convert back to GLSL (spirvcross) to feed to the driver without deadcode, comments, and already optimized GLSL. This can work on older implementation that do not support the ARB_gl_spirv extension.
  • Multithread compilation. If we use shaderc (and optionally spirvcross) to compile the shader, then it becomes easier to precompile the shaders in many threads without needing GL contexts.

The GLSL interface might need a bit of tweaking to be able to be injected to shaderc but I am convinced this is worth the cost.

Note that all of these options should be profiled beforehand on a set of typical EEVEE-Next shaders to check what is the best way forward.

Note that this task is not proposing to ship precompiled SpirV shader sources.

The Vulkan backend would give us all these, but the timeline for it to become default is not aligning with the initial release nor the second release of EEVEE-Next.

Multithreaded compilation in OpenGL

We have to change our compilation model to accommodate for that.
The goal is to use the parallel shader compile extension. This doesn't need a different context for it to work. But we need to rework the interface with the GPU module for that to work.

https://forums.developer.nvidia.com/t/bugs-with-gl-arb-parallel-shader-compile/43715/8
https://www.reddit.com/r/opengl/comments/121j3q1/seeking_clarifications_on_multithreaded_shader/

It seems that nvidia drivers have a harder time with our new shaders for some reason (~3x from what I read). First cold start of EEVEE-Next also takes quite longer than I would like to (a few seconds without any feedback), we could display a black frame with a waiting message instead, but ideally it should be less than 3 seconds. So I think we have no choice but to optimize it at least a little for the first release. The first ideas that comes to mind is to look at which part of the code is causing the most slowdown on the affected drivers and fix it. My guess is that it is likely to be caused by aggressive loop unrolling like it was on Metal + M1 before the recent fix. However, it seems that working around it is quite tricky as there is no clear preprocessor directives for loop unrolling in GLSL and the extension that adds it looks quite unsupported. The other approach would just be to reduce code size as much as possible. We could try to preprocess the GLSL string using our own obfuscator at compile time, but that looks unrealistic and the benefits are not quite clear. Instead we should leverage SpirV. This can give several solutions: - Feed SpirV directly driver. Most of the hardware we target support the `ARB_gl_spirv` extension. This avoid the driver to do the parsing and most of the conversion. - Compile to SpirV (`shaderc`) then convert back to GLSL (`spirvcross`) to feed to the driver without deadcode, comments, and already optimized GLSL. This can work on older implementation that do not support the `ARB_gl_spirv` extension. - Multithread compilation. If we use `shaderc` (and optionally `spirvcross`) to compile the shader, then it becomes easier to precompile the shaders in many threads without needing GL contexts. The GLSL interface might need a bit of tweaking to be able to be injected to `shaderc` but I am convinced this is worth the cost. Note that all of these options should be profiled beforehand on a set of typical EEVEE-Next shaders to check what is the best way forward. Note that this task is not proposing to ship precompiled SpirV shader sources. The Vulkan backend would give us all these, but the timeline for it to become default is not aligning with the initial release nor the second release of EEVEE-Next. #### Multithreaded compilation in OpenGL We have to change our compilation model to accommodate for that. The goal is to use the parallel shader compile extension. This doesn't need a different context for it to work. But we need to rework the interface with the GPU module for that to work. https://forums.developer.nvidia.com/t/bugs-with-gl-arb-parallel-shader-compile/43715/8 https://www.reddit.com/r/opengl/comments/121j3q1/seeking_clarifications_on_multithreaded_shader/
Clément Foucault added this to the 4.2 LTS milestone 2024-03-31 00:34:24 +01:00
Clément Foucault added the
Interest
EEVEE
Type
To Do
labels 2024-03-31 00:34:24 +01:00
Member

I'm leaving here my GPU_shader_create_from_info times for reference:

eevee_shadow_tag_usage_opaque : 0.113712s
eevee_shadow_tag_usage_transparent : 0.878801s
eevee_deferred_planar_eval : 2.325970s
eevee_reflection_probe_remap : 0.046838s
eevee_reflection_probe_convolve : 0.030921s
eevee_reflection_probe_irradiance : 0.036377s
eevee_reflection_probe_select : 0.042256s
eevee_depth_of_field_setup : 0.027596s
eevee_depth_of_field_stabilize : 0.051766s
eevee_depth_of_field_downsample : 0.021206s
eevee_depth_of_field_reduce : 0.040375s
eevee_depth_of_field_tiles_flatten : 0.014965s
eevee_depth_of_field_tiles_dilate_minmax : 0.022597s
eevee_depth_of_field_tiles_dilate_minabs : 0.023608s
eevee_depth_of_field_gather_foreground_no_lut : 0.096645s
eevee_depth_of_field_gather_background_no_lut : 0.096074s
eevee_depth_of_field_filter : 0.016083s
eevee_depth_of_field_scatter : 0.033563s
eevee_depth_of_field_hole_fill : 0.068203s
eevee_depth_of_field_resolve_no_lut : 0.083184s
eevee_ray_tile_classify : 0.088027s
eevee_ray_tile_compact : 0.022269s
eevee_ray_generate : 0.104781s
eevee_ray_trace_planar : 0.098985s
eevee_ray_trace_screen : 0.109001s
eevee_ray_trace_fallback : 0.073635s
eevee_ray_denoise_spatial : 0.101829s
eevee_ray_denoise_temporal : 0.054006s
eevee_ray_denoise_bilateral : 0.198941s
eevee_horizon_setup : 0.082341s
eevee_horizon_scan : 0.055821s
eevee_horizon_denoise : 0.031025s
eevee_horizon_resolve : 0.221292s
eevee_hiz_update : 0.000483s
eevee_hiz_update_layer : 0.000378s
eevee_film_frag : 0.067119s
eevee_volume_scatter_with_lights : 0.358564s
eevee_volume_integration : 0.043462s
eevee_volume_resolve : 0.046383s
eevee_shadow_clipmap_clear : 0.009547s
eevee_shadow_tilemap_bounds : 0.039633s
eevee_shadow_tilemap_init : 0.016634s
eevee_shadow_tag_update : 0.033414s
eevee_shadow_tag_usage_volume : 0.316138s
eevee_shadow_page_mask : 0.043424s
eevee_shadow_page_free : 0.021147s
eevee_shadow_page_defrag : 0.029916s
eevee_shadow_page_allocate : 0.019219s
eevee_shadow_tilemap_finalize : 0.118569s
eevee_shadow_tilemap_amend : 0.041987s
eevee_shadow_page_clear : 0.027404s
eevee_light_culling_select : 0.059443s
eevee_light_culling_sort : 0.025876s
eevee_light_culling_zbin : 0.015785s
eevee_light_culling_tile : 0.053508s
eevee_subsurface_setup : 0.072863s
eevee_subsurface_convolve : 0.093999s
eevee_deferred_tile_classify : 0.028246s
eevee_deferred_light_double : 2.886378s
eevee_deferred_light_single : 2.820391s
eevee_deferred_combine : 0.101404s
eevee_lightprobe_irradiance_world : 0.012422s

It looks like light_eval might be the worst offender.

After some googling, I've found some Nvidia directives that help quite a bit:

#pragma optionNV(unroll none)
#pragma optionNV(inline none)
eevee_deferred_light_double : 0.879084s
eevee_deferred_light_single : 0.861953s

I haven't found any documentation related to these, though.
And I haven't checked runtime performance either.

Regardless, I agree that using SPIRV may be the best option moving forward.

I'm leaving here my `GPU_shader_create_from_info` times for reference: ``` eevee_shadow_tag_usage_opaque : 0.113712s eevee_shadow_tag_usage_transparent : 0.878801s eevee_deferred_planar_eval : 2.325970s eevee_reflection_probe_remap : 0.046838s eevee_reflection_probe_convolve : 0.030921s eevee_reflection_probe_irradiance : 0.036377s eevee_reflection_probe_select : 0.042256s eevee_depth_of_field_setup : 0.027596s eevee_depth_of_field_stabilize : 0.051766s eevee_depth_of_field_downsample : 0.021206s eevee_depth_of_field_reduce : 0.040375s eevee_depth_of_field_tiles_flatten : 0.014965s eevee_depth_of_field_tiles_dilate_minmax : 0.022597s eevee_depth_of_field_tiles_dilate_minabs : 0.023608s eevee_depth_of_field_gather_foreground_no_lut : 0.096645s eevee_depth_of_field_gather_background_no_lut : 0.096074s eevee_depth_of_field_filter : 0.016083s eevee_depth_of_field_scatter : 0.033563s eevee_depth_of_field_hole_fill : 0.068203s eevee_depth_of_field_resolve_no_lut : 0.083184s eevee_ray_tile_classify : 0.088027s eevee_ray_tile_compact : 0.022269s eevee_ray_generate : 0.104781s eevee_ray_trace_planar : 0.098985s eevee_ray_trace_screen : 0.109001s eevee_ray_trace_fallback : 0.073635s eevee_ray_denoise_spatial : 0.101829s eevee_ray_denoise_temporal : 0.054006s eevee_ray_denoise_bilateral : 0.198941s eevee_horizon_setup : 0.082341s eevee_horizon_scan : 0.055821s eevee_horizon_denoise : 0.031025s eevee_horizon_resolve : 0.221292s eevee_hiz_update : 0.000483s eevee_hiz_update_layer : 0.000378s eevee_film_frag : 0.067119s eevee_volume_scatter_with_lights : 0.358564s eevee_volume_integration : 0.043462s eevee_volume_resolve : 0.046383s eevee_shadow_clipmap_clear : 0.009547s eevee_shadow_tilemap_bounds : 0.039633s eevee_shadow_tilemap_init : 0.016634s eevee_shadow_tag_update : 0.033414s eevee_shadow_tag_usage_volume : 0.316138s eevee_shadow_page_mask : 0.043424s eevee_shadow_page_free : 0.021147s eevee_shadow_page_defrag : 0.029916s eevee_shadow_page_allocate : 0.019219s eevee_shadow_tilemap_finalize : 0.118569s eevee_shadow_tilemap_amend : 0.041987s eevee_shadow_page_clear : 0.027404s eevee_light_culling_select : 0.059443s eevee_light_culling_sort : 0.025876s eevee_light_culling_zbin : 0.015785s eevee_light_culling_tile : 0.053508s eevee_subsurface_setup : 0.072863s eevee_subsurface_convolve : 0.093999s eevee_deferred_tile_classify : 0.028246s eevee_deferred_light_double : 2.886378s eevee_deferred_light_single : 2.820391s eevee_deferred_combine : 0.101404s eevee_lightprobe_irradiance_world : 0.012422s ``` It looks like `light_eval` might be the worst offender. After some googling, I've found some Nvidia directives that help quite a bit: ``` #pragma optionNV(unroll none) #pragma optionNV(inline none) ``` ``` eevee_deferred_light_double : 0.879084s eevee_deferred_light_single : 0.861953s ``` I haven't found any documentation related to these, though. And I haven't checked runtime performance either. Regardless, I agree that using SPIRV may be the best option moving forward.
Miguel Pozo added the
Module
Viewport & EEVEE
label 2024-04-01 11:38:41 +02:00
Author
Member

@pragma37 This is only for cold startup of the engine. Even if this is good to optimize (and is neeeded) the main friction point is the cost of material compilation.

@pragma37 This is only for cold startup of the engine. Even if this is good to optimize (and is neeeded) the main friction point is the cost of material compilation.

@pragma37 This is only for cold startup of the engine. Even if this is good to optimize (and is neeeded) the main friction point is the cost of material compilation.

Add that to you're note !

Even with rtx 4070 8GO i have the same issue 🤷‍♂️🤷‍♂️ He took 4 - 8 seconds to run the first time ! Less for second time but still long waiting for showing something on the screen

> @pragma37 This is only for cold startup of the engine. Even if this is good to optimize (and is neeeded) the main friction point is the cost of material compilation. Add that to you're note ! Even with rtx 4070 8GO i have the same issue 🤷‍♂️🤷‍♂️ He took 4 - 8 seconds to run the first time ! Less for second time but still long waiting for showing something on the screen
Member

the main friction point is the cost of material compilation.

Doesn't seem to be the case for all materials:
https://devtalk.blender.org/t/blender-4-2-eevee-next-feedback/31813/416

My guess is that purely deferred materials are probably ok, but forward and ShaderToRGB materials must be getting hit by the `light_eval' overhead as well.

> the main friction point is the cost of material compilation. Doesn't seem to be the case for all materials: https://devtalk.blender.org/t/blender-4-2-eevee-next-feedback/31813/416 My guess is that purely deferred materials are probably ok, but forward and ShaderToRGB materials must be getting hit by the `light_eval' overhead as well.

Multithread compilation. If we use shaderc (and optionally spirvcross) to compile the shader, then it becomes easier to precompile the shaders in many threads without needing GL contexts.

I'm wondering if parts of that could be done without going to SPIR-V and friends. My impression is (that I have not validated/checked myself though), is that even in ye olde OpenGL it is possible to do "multi-threaded" shader creation/compilation, without resorting to multiple OpenGL contexts. Just the pattern of function calls has to be something along the lines of:

// possibly call glMaxShaderCompilerThreadsARB

for (shader : many_shaders) {
    glCompileProgram(shader.programs)
    glLinkProgram(shader.linkedresult)
    // crucially: do NOT check for errors or query uniforms
}
// now, once all shaders have started compiling,
for (shader : many_shaders) {
    check_result_of_linking(shader.linkedresult)
}

instead of the current code flow, which is:

for (shader : many_shaders) {
    glCompileProgram(shader.programs)
    glLinkProgram(shader.linkedresult)
    check_result_of_linking(shader.linkedresult)
}

The current way of doing things (which is "for each shader: fully compile said shader") does not allow multi-threading even for other APIs (like Metal or Vulkan) that could do it. So maybe something like "create many shaders" function would need to get added to the GPU backend, and the backend could decide how to best deal with it.

> Multithread compilation. If we use shaderc (and optionally spirvcross) to compile the shader, then it becomes easier to precompile the shaders in many threads without needing GL contexts. I'm wondering if parts of that could be done *without* going to SPIR-V and friends. My impression is (that I have not validated/checked myself though), is that even in *ye olde OpenGL* it is possible to do "multi-threaded" shader creation/compilation, without resorting to multiple OpenGL contexts. Just the pattern of function calls has to be something along the lines of: ``` // possibly call glMaxShaderCompilerThreadsARB for (shader : many_shaders) { glCompileProgram(shader.programs) glLinkProgram(shader.linkedresult) // crucially: do NOT check for errors or query uniforms } // now, once all shaders have started compiling, for (shader : many_shaders) { check_result_of_linking(shader.linkedresult) } ``` instead of the current code flow, which is: ``` for (shader : many_shaders) { glCompileProgram(shader.programs) glLinkProgram(shader.linkedresult) check_result_of_linking(shader.linkedresult) } ``` The current way of doing things (which is "for each shader: fully compile said shader") does not allow multi-threading even for other APIs (like Metal or Vulkan) that *could* do it. So maybe something like "create many shaders" function would need to get added to the GPU backend, and the backend could decide how to best deal with it.
Member

I think it's worth giving a try to the @aras_p suggestion, especially since we will have to do something along those lines regardless of the compilation method.

That said, that's not going to solve the issue with single shaders taking seconds to compile, which is pretty bad for the material editing UX.
For Nvidia, we could use the GPU_material_optimization system for disabling loop unrolling and inlining in the first compilation, but I'm not sure about the other GPU vendors (or if there's a problem with those in the first place).

I've just checked, and it looks like compile times took a big hit after #119713:

Shader Before After
eevee_deferred_light_double 1.39s 2.53s
eevee_deferred_light_single 1.37s 2.47s

Whith loop unrolling and inlining disabled the difference is not that bad, though:

Shader Before After
eevee_deferred_light_double 0.71s 0.87s
eevee_deferred_light_single 0.69s 0.86s
I think it's worth giving a try to the @aras_p suggestion, especially since we will have to do something along those lines regardless of the compilation method. That said, that's not going to solve the issue with single shaders taking seconds to compile, which is pretty bad for the material editing UX. For Nvidia, we could use the `GPU_material_optimization` system for disabling loop unrolling and inlining in the first compilation, but I'm not sure about the other GPU vendors (or if there's a problem with those in the first place). I've just checked, and it looks like compile times took a big hit after #119713: | Shader | Before | After | | --- | --- | --- | | eevee_deferred_light_double | 1.39s | 2.53s | | eevee_deferred_light_single | 1.37s | 2.47s | Whith loop unrolling and inlining disabled the difference is not that bad, though: | Shader | Before | After | | --- | --- | --- | | eevee_deferred_light_double | 0.71s | 0.87s | | eevee_deferred_light_single | 0.69s | 0.86s |
Member

That compile times regression has been fixed by 2d3368f5bf.

That compile times regression has been fixed by 2d3368f5bf.
Member

There's been a pretty bad regression in compile times recently:

eevee_deferred_light_double 5.80s
eevee_deferred_light_single 4.96s

We may need to start testing compile times regularly, or test automatically if possible.

There's been a pretty bad regression in compile times recently: | | | | --- | --- | | eevee_deferred_light_double | 5.80s | | eevee_deferred_light_single | 4.96s | We may need to start testing compile times regularly, or test automatically if possible.
Member

We can add --debug-gpu-compile-shader as a benchmark task so it will be tracked.

We can add `--debug-gpu-compile-shader` as a benchmark task so it will be tracked.
Member

I’ve been taking a look at ways to improve shader compilation times.

As a recap, there are 3 main issues on the user side (mainly on Nvidia):

  1. Engine startup. The first time EEVEE-Next is enabled for any given Blender version, it can make Blender freeze for over 30s even in high-end CPUs. Caused by static shader compilations.
  2. Scene startup. The first time EEVEE-Next loads a scene for a given Blender version, it needs to compile all its materials. The times here depend on the number and complexity of the materials, but it’s not rare to take up to several minutes in moderately complex scenes.
  3. Material edition. Any time a material is modified and requires a recompilation, this varies per material but it can take several seconds for materials that have inline lighting evaluation.

I don't think there's silver bullet to fix all these at once, and we may require multiple strategies with different levels of complexity:

Find the cause of the compile times slow-downs

AFAIK, there's no tooling for figuring out where the compiler is taking more time. We can infer by checking differences between shaders and sometimes we can find regressions, but fully fixing compile times this way would be extremely time consuming, and maybe not even fully possible.

The currently known main cause of compile times slow-downs is light and shadow evaluation:

  • The heavy regressions from #120329 should be fixed by implementing the TODOs from eevee_deferred_light_frag.glsl.
    (This only affects the internal static shader).
  • #119713 caused a regression in compile times (~1.4x).
  • Using push constants for shadow ray count/steps lowers compile times, but might affect render performance.

Test results done after the last detected regression (#120329) (Cumulative):

# remove refraction & subsurf (#120329 partial revert)
eevee_deferred_light_double : 1.962147s
eevee_deferred_light_single : 1.457944s

# shadow ray count/steps push constants
eevee_deferred_light_double : 1.521318s
eevee_deferred_light_single : 1.476455s

# closure count push constants
eevee_deferred_light_double : 1.489501s
eevee_deferred_light_single : 0.004543s (Same shader, so it's cached)

2-step compilation

As mentioned before, disabling loop unrolling can heavily improve compile times, but it can also heavily degrade render performance.
A possible solution could be to use the Material optimization system to compile materials without loop unrolling first and deferring the optimized (unrolled) version compilation.
The main issue with this approach is that even shader compilation on a separate OpenGL context freezes the Blender UI, so it makes things even worse in practice.

SPIRV

AFAIK, SPIRV compilation would be the equivalent to glCompileShader, while the main bottleneck comes from glLinkProgram, so optimizing shader compilation instead of program linking could yield minor improvements at best.
It's also worth noting that regular GLSL and SPIRV GLSL aren't fully compatible, so getting this to work even if just for testing is far from trivial.

Multithreaded Compilation

I've made a branch with GL_ARB/KHR_parallel_shader_compile support (#121093), and while compile times become around twice as fast, I wouldn't consider this good enough to mark the problem as solved.
It also has the problem of making the viewport even more unresponsive during material compilation.

I've made a standalone test (https://projects.blender.org/pragma37/test-parallel-shader-compilation) to verify I wasn't doing anything wrong on the Blender side and the performance difference was mostly on par.
I've managed to make it faster (from 2x up to 4x) after some tweaks, but I wasn't able to get the same results on Blender.
So it may be possible that the Blender implementation could be improved, but it may also be just a case of the standalone app being a much simpler context that the driver can manage more easily.

In any case, even a 4x improvement on a 24 threads CPU is far from optimal.
So far I got the best results (10x) by spawning a new process for each program compilation (the spawner.py script in the repo).
This actually manages to put my CPU at 100% usage, and it would also have the advantage of not blocking the Blender UI.

However, this would require:

  • Spawning new processes from Blender.
  • Storing the compiled binary cache to disk.
  • Detecting when the binary has been written and reading it from Blender.

Or maybe just use subprocesses and pipes?

While I think this might be the best way to go, it opens several cans of worms and it seems too risky to implement at the end of BCON1.

I’ve been taking a look at ways to improve shader compilation times. As a recap, there are 3 main issues on the user side (mainly on Nvidia): 1. Engine startup. The first time EEVEE-Next is enabled for any given Blender version, it can make Blender freeze for over 30s even in high-end CPUs. Caused by static shader compilations. 2. Scene startup. The first time EEVEE-Next loads a scene for a given Blender version, it needs to compile all its materials. The times here depend on the number and complexity of the materials, but it’s not rare to take up to several minutes in moderately complex scenes. 3. Material edition. Any time a material is modified and requires a recompilation, this varies per material but it can take several seconds for materials that have inline lighting evaluation. I don't think there's silver bullet to fix all these at once, and we may require multiple strategies with different levels of complexity: ### Find the cause of the compile times slow-downs AFAIK, there's no tooling for figuring out where the compiler is taking more time. We can infer by checking differences between shaders and sometimes we can find regressions, but fully fixing compile times this way would be extremely time consuming, and maybe not even fully possible. The currently known main cause of compile times slow-downs is light and shadow evaluation: - The heavy regressions from #120329 should be fixed by implementing the TODOs from `eevee_deferred_light_frag.glsl`. (This only affects the internal static shader). - #119713 caused a regression in compile times (~1.4x). - Using push constants for shadow ray count/steps lowers compile times, but might affect render performance. Test results done after the last detected regression (#120329) (Cumulative): ``` # remove refraction & subsurf (#120329 partial revert) eevee_deferred_light_double : 1.962147s eevee_deferred_light_single : 1.457944s # shadow ray count/steps push constants eevee_deferred_light_double : 1.521318s eevee_deferred_light_single : 1.476455s # closure count push constants eevee_deferred_light_double : 1.489501s eevee_deferred_light_single : 0.004543s (Same shader, so it's cached) ``` ### 2-step compilation As mentioned before, disabling loop unrolling can heavily improve compile times, but it can also heavily degrade render performance. A possible solution could be to use the Material optimization system to compile materials without loop unrolling first and deferring the optimized (unrolled) version compilation. The main issue with this approach is that even shader compilation on a separate OpenGL context freezes the Blender UI, so it makes things even worse in practice. ### SPIRV AFAIK, SPIRV compilation would be the equivalent to `glCompileShader`, while the main bottleneck comes from `glLinkProgram`, so optimizing shader compilation instead of program linking could yield minor improvements at best. It's also worth noting that regular GLSL and SPIRV GLSL aren't fully compatible, so getting this to work even if just for testing is far from trivial. ### Multithreaded Compilation I've made a branch with `GL_ARB/KHR_parallel_shader_compile` support (#121093), and while compile times become around twice as fast, I wouldn't consider this good enough to mark the problem as solved. It also has the problem of making the viewport even more unresponsive during material compilation. I've made a standalone test (https://projects.blender.org/pragma37/test-parallel-shader-compilation) to verify I wasn't doing anything wrong on the Blender side and the performance difference was mostly on par. I've managed to make it faster (from 2x up to 4x) after some tweaks, but I wasn't able to get the same results on Blender. So it may be possible that the Blender implementation could be improved, but it may also be just a case of the standalone app being a much simpler context that the driver can manage more easily. In any case, even a 4x improvement on a 24 threads CPU is far from optimal. So far I got the best results (10x) by spawning a new process for each program compilation (the `spawner.py` script in the repo). This actually manages to put my CPU at 100% usage, and it would also have the advantage of not blocking the Blender UI. However, this would require: - Spawning new processes from Blender. - Storing the compiled binary cache to disk. - Detecting when the binary has been written and reading it from Blender. Or maybe just use subprocesses and pipes? While I think this might be the best way to go, it opens several cans of worms and it seems too risky to implement at the end of BCON1.
Member

Your conclusion for SPIR-V is in line with our previous conclusion. So yeah.
About process spanning. We used to do that for cycles OpenCL in the past. far from ideal as it is also bound to the amount of RAM the user has and error logging is tricky very error prone.

Just to mention: a thing I am considering for Vulkan is to add support VK_EXT_graphics_pipeline_library This allows smaller shaders (vertex input, vertex, fragment, attachment out) and pick and chose when creating a pipeline.

Your conclusion for SPIR-V is in line with our previous conclusion. So yeah. About process spanning. We used to do that for cycles OpenCL in the past. far from ideal as it is also bound to the amount of RAM the user has and error logging is tricky very error prone. Just to mention: a thing I am considering for Vulkan is to add support `VK_EXT_graphics_pipeline_library` This allows smaller shaders (vertex input, vertex, fragment, attachment out) and pick and chose when creating a pipeline.
Member

I agree that relying on subprocesses is far from ideal, but I don't see any other alternative for significantly improving shader compile times.


I've left this out of my previous report since I was getting inconsistent readings, but today I've measured again and I'm getting pretty consistent results.

Skip unnecessary material passes

EEVEE-Next requires more pass types per material, this results in compiling up to 3x more shaders than EEVEE Legacy.
However, many of these shaders are functionally equivalent to the default one (when vertex displacement and transparency is disabled), so we can detect these cases and skip the compilation. (#121137)

I agree that relying on subprocesses is far from ideal, but I don't see any other alternative for significantly improving shader compile times. --- I've left this out of my previous report since I was getting inconsistent readings, but today I've measured again and I'm getting pretty consistent results. ### Skip unnecessary material passes EEVEE-Next requires more pass types per material, this results in compiling up to 3x more shaders than EEVEE Legacy. However, many of these shaders are functionally equivalent to the default one (when vertex displacement and transparency is disabled), so we can detect these cases and skip the compilation. (#121137)
Author
Member

With the new parallel and non-blocking compilation (optional) I consider this issue solved.

With the new parallel and non-blocking compilation (optional) I consider this issue solved.
Blender Bot added the
Status
Archived
label 2024-07-02 17:54:58 +02:00
Sign in to join this conversation.
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset System
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Viewport & EEVEE
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Asset Browser Project
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Module
Viewport & EEVEE
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Severity
High
Severity
Low
Severity
Normal
Severity
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
5 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#120100
No description provided.