Metal: EEVEE Next: Optimize Virtual shadow maps for Apple Silicon #111283
No reviewers
Labels
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: blender/blender#111283
Loading…
Reference in New Issue
No description provided.
Delete Branch "Jason-Fielder/blender:eevee_next_metal_shadow_opti_tbdr"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
WIP PR for optimization of EEVEE Next's Virtual Shadow Maps for
TBDRs. The core of these optimizations lie in eliminating use of
atomic shadow atlas writes and instead utilise tile memory to
perform depth accumulation as a secondary pass once all
geometry updates for a given shadow view have been updated.
This also allows use of fast on-tile depth testing/sorting, reducing
overdraw and redundant fragment operations, while also allowing
for tile indirection calculations to be offloaded into the vertex
shader to increase fragment storage efficiency and throughput.
Authored by Apple: Michael Parkin-White
Further work required:
Optimization: Consider replacing U32 shadow atlas with F32.
Add pass before shadow geometry raster to ensure only tiles with active geometry can raster.
- Initially clear all other tiles to zero instead.
- Use same quad generation shader to clear active update regions to 1.0 allowing depth contents to be rendered.
Determine if there is a way to reduce layer usage in framebuffer or accumulation quad geometry dispatch in order to reduce intermediate memory pressure of large virtual tile bin. (To be resolved in follow-up patch).
Benchmark other phases in shadow gen. Optimize workgroup sizes for each sub-phase where appropriate.
Remove redundant bindings for surface shadow shader (e.g. texture atlas, tile info bufs etc) as subtle static resource cost.
Remove SHADOW_PAGE_CLEAR pass in favour of storing cleared values within accumulation pass.
Other potential work:
I would say it is better to split this patch into :
This way it is easier to discuss about each separate topic. Some are easier to accept than others.
But I left already a few comment about code organization.
Seems there is some formatting differences on
mtl_shader_defines.msl
and on some other files.You can format using
make format
in theblender
source folder (the git repo root folder) so that it uses the clang-format from the lib folder.@ -573,6 +573,8 @@ set(GLSL_SRC
engines/eevee_next/shaders/eevee_surf_forward_frag.glsl
engines/eevee_next/shaders/eevee_surf_lib.glsl
engines/eevee_next/shaders/eevee_surf_shadow_frag.glsl
engines/eevee_next/shaders/eevee_surf_shadow_accum_vert.glsl
This should be
eevee_shadow_page_write_frag/vert.glsl
.The
eevee_surf_*
are fragment shaders that can match theeevee_geom_*
vertex shaders.@ -160,0 +165,4 @@
#ifdef WITH_METAL_BACKEND
/* Perform accumulation step for storing final tile depth results back to shadow atlas. */
accum_ps_.init();
This is not an accumulation pass. It just copies the result to the atlas. It should be renamed
page_write_ps
and moved to the shadow module instead of being in the pipeline.Also no need for the Camera UBO, the utility texture etc... verify all resources are needed.
@ -102,6 +102,9 @@ class ShadowPipeline {
Instance &inst_;
PassMain surface_ps_ = {"Shadow.Surface"};
#ifdef WITH_METAL_BACKEND
I wouldn't guard any of this behind the
WITH_METAL_BACKEND
as it would be easy to break.I would even go to the extend to support an emulated path for other platform for checking this feature works.
Maybe the only thing that need
ifdef
is the shader for now.Making it clear it is a separate technique in a subsection of the ShadowModule class would be good.
@ -1175,2 +1201,4 @@
shadow_multi_view_.compute_procedural_bounds();
#ifdef WITH_METAL_BACKEND
/* Specify load-store config right before render to ensure flags are retained. */
From an API perspective, wouldn't it be better to have this stored as a state of the framebuffer?
@ -259,2 +259,3 @@
/** Framebuffer with the atlas_tx attached. */
Framebuffer render_fb_;
#ifdef WITH_METAL_BACKEND
Framebuffer render_fb_ = {"shadow_write_framebuffer"};
This shouldn't be in the
#ifdef
block.@ -12,3 +12,3 @@
#pragma BLENDER_REQUIRE(eevee_shadow_tilemap_lib.glsl)
shared bool directional_range_changed;
shared uint directional_range_changed;
I did this change a few days ago. I did not remember this was forbidden. I now have ammended our GLSL style guide to include this.
@ -69,3 +69,3 @@
ivec2 tile_shifted = tile_co + tilemap.grid_shift;
/* Ensure value is shifted into positive range to avoid modulo on negative. */
ivec2 tile_wrapped = ivec2((ivec2(SHADOW_TILEMAP_RES) + tile_shifted) % SHADOW_TILEMAP_RES);
ivec2 tile_wrapped = ivec2((ivec2(SHADOW_TILEMAP_RES * SHADOW_TILEMAP_PER_ROW) + tile_shifted) %
This is unlikely to be correct and misleading. It might be prefereable to clamp
tilemap.grid_shift
to[-SHADOW_TILEMAP_RES, SHADOW_TILEMAP_RES]
range to computetile_shifted
. Higher values don't matter.@ -0,0 +74,4 @@
/** Interpolate output texel */
/* Using bitwise ops is way faster than integer ops. */
const int page_shift = SHADOW_PAGE_LOD;
Unused
@ -0,0 +94,4 @@
/* NOTE: To avoid redundant writes, we still enable the depth test and configure the accumulation
* pass quad depth such that only the fragments updated during the surface depth pass will run.
*
* We use `DRW_STATE_DEPTH_GREATER_EQUAL` such that we only run the fragment shader if the
I guess you can do that because you kept the clear dispatch. I'm pretty sure it's more efficient to just copy the whole tile and not clear the atlas.
Yeah, following the later update, clearing the full tile and writing this out in full and removing the compute-based clear pass is indeed quite a bit faster.
In the updated version, I've added a comment distinguishing the two options, as it may be important to have contextual information if ever anything needs changing, i.e. knowing that if the compute clear pass ever needs to be used for whatever reason, then the depth flag can be changed back to skip writes that are not needed.
@ -159,0 +214,4 @@
inst_.sampling.bind_resources(&tbdr_page_store_ps_);
inst_.bind_uniform_data(&tbdr_page_store_ps_);
tbdr_page_store_ps_.draw_procedural(
GPU_PRIM_TRIS, (SHADOW_TILEMAP_RES) * (SHADOW_TILEMAP_RES)*SHADOW_VIEW_MAX, 6);
SQUARE(SHADOW_TILEMAP_RES)
. That should avoid confusion of clang format.@ -53,2 +55,4 @@
{0, 0, 0.5, 1}};
/* Technique used for updating the virtual shadow map contents. */
enum eShadowUpdateTechnique {
Remove prefix
e
:ShadowUpdateTechnique
@ -148,2 +158,4 @@
}
#ifdef GPU_METAL
/* NOTE: Metal can support multiple shadow atlas formats. E.g. U32, F32. */
Create a define for the type
ShadowAtlasSamplerType
which changes type. This avoid duplicating all the functions declarations.@ -0,0 +1,64 @@
Rename file to
eevee_shadow_page_tile_frag/vert
. This path will eventually work on all hardware even if not efficient.@ -0,0 +3,4 @@
* Virtual Shadow map accumulation storage.
*
* On Apple silicon, we can use a three-pass method to perform virtual shadow map updates,
* leveraging efficient use of tile-based GPUs.Shadow updates rasterize geometry for each view in
Typo
GPUs.Shadow
@ -0,0 +57,4 @@
shadow_atlas_img.texture->write(u_depth, ushort2(out_texel_xy), out_page_z);
# endif
# ifdef SHADOW_ATLAS_F32
About this F32 option, I'm wondering if it wouldn't be good to have a U32 view and a F32 view of the atlas and always use the float one for sampling the shadow map. For rendering we could use the best format depending on the update technique.
Would that have any impact on performance because of texture layout?
@ -21,6 +21,12 @@ shared int global_max;
void main()
{
uint index = gl_GlobalInvocationID.x;
Same block of code is inside the
resource_len > 0
block.@ -99,3 +99,3 @@
/* Issue one view if there is an update in the LOD. */
if (all(equal(gl_LocalInvocationID, uvec3(0)))) {
bool lod_has_update = rect_min.x < rect_max.x;
bool lod_has_update = rect_min_x < rect_max_x;
Why is this needed?
rect_min
andrect_max
are defined above.@ -665,6 +665,64 @@ void DrawMultiBuf::bind(RecordingState &state,
GPU_debug_group_end();
}
void DrawMultiBuf::bind_no_comp(RecordingState &state,
@fclem I had added this function (and its use by the shadows) as a placeholder, although wanted to get your thoughts on how best to handle this scenario.
The intent of
bind_no_comp
is to setup the resource bindings for the subsequent draws, without re-invoking the visibility or draw_command_generate_* compute phase, such that we can execute the three separate passes of the TBDR optimization separately.Same goes for the other additions in this sub-folder:
and the actual implementation in the shadow engine:
It feels like there is probably a cleaner way of doing this without introducing any risks from code duplication. Although I had not invested time into this before wanting to get an opinion. One option could be to add a function parameter or template
DrawMultiBuf::bind
with a boolean switch on generate draw commands.Then for other functions, could be reasonable to refactor this to eliminate duplication. The one issue I was not super happy with was that the split functions required passing of the
command::RecordingState
outside of the function, as the latter submission required visibility on this.Thanks for any notes!
For some reason I thought I had written about this in my previous review but my comment got lost somehow.
So I think this is a wrong approach. All of these 3 passes should be inside the same
PassMain
and you should just use a differentdraw::PassMain::Sub
for each of the clear/render/store passes. That way there shouldn't be a need for any of this duplicated logic.WIP: Metal: EEVEE Next: Optimize Virtual shadow maps for Apple Siliconto Metal: EEVEE Next: Optimize Virtual shadow maps for Apple SiliconWe are getting closer! 😄
I just merged the
GPU_ARCHITECTURE
changes to main. Please merge these changes first before we can merge this patch.@ -18,2 +18,3 @@
/* Clear to FLT_MAX instead of 1 so the far plane doesn't cast shadows onto farther objects. */
/* Clear to FLT_MAX instead of 1 so the far plane doesn't cast shadows onto farther objects. */
#ifdef GPU_METAL
No need for the define since it is aliased for GLSL.
@ -479,2 +479,4 @@
Vector<FragOut> fragment_outputs_;
using FragTileIn = FragOut;
Vector<FragTileIn> fragment_tile_inputs_;
these have been replaced by
SubpassIn
which is the same but with the vulkan denomination. They should just be implemented as ROG on apple silicon. So removefragment_tile_inputs_
.@ -130,6 +130,8 @@ void MTLFrameBuffer::bind(bool enabled_srgb)
Shader::set_framebuffer_srgb_target(enabled_srgb && srgb_);
}
this->mark_dirty();
I guess this was only for testing. Shouldn't it be removed?
@ -22,3 +22,3 @@
#include "gpu_py_shader.h" /* own include */
//#define USE_PYGPU_SHADER_INFO_IMAGE_METHOD
// #define USE_PYGPU_SHADER_INFO_IMAGE_METHOD
Unrelated change
I went the extra mile to cleanup and implement all the TODOs.
Now both implementation are supported on both OpenGL and Metal. To avoid too much confusion, I tried to make the names more platform agnostic.
The last quirk that I'm not sure how deal with is the static
ShadowModule::shadow_technique
which prevents runtime switching between techniques (which could be beneficial in some cases).Also reducing the number of layer in the framebuffer doesn't seem to work correctly. Some shadow tiles will fail to update and never update. This is something happening in both method and on all platform and not related to this patch.
I'm really happy with the result. Both in terms of performance and in code quality.
I would just like for Michael to glance over my changes and to see if they all make sense.
Otherwise I think it's good to merge.
@blender-bot build
Have run the latest changes through a number of scenes and both paths appear to be functioning correctly. Performance is about where it should be, with some improvements due to the cleanup for indirect tile dispatch.
I think this is now in a good state for the initial merge. The implementation features discussed such as compute perf tuning and floating point shadow atlas are things we can address in later patches.
LGTM.