Compositor improvement plan #74491

Closed
opened 2020-03-06 12:46:11 +01:00 by Sergey Sharybin · 48 comments

Overview

This is an initial pass on formalizing proposal how to improve performance and user experience of Compositor.

There are three aspects which are aimed to be addressed:

  • Ease of use
  • Performance
  • Memory consumption

Ease of use

Currently there are many settings which needs to be set and tweak to have best performance: tile size, OpenCL, buffer use and so on. Settings also have some implicit dependencies under the hood: for example, OpenCL needs big tile size, but big tile size might make some nodes measurably slower, and bigger tile size "breaks" the initial intention of compositor design to show tiles appearing as quick as possible.

For OpenCL case it's also not clear when it's actually being engaged: it will just fail silently, falling back to CPU, giving false impression of having GPU accelerated compute.

Performance

Performance of compositor is not up to date. Partially due to its scheduler design, partially due to technical implementation which is per-pixel virtual call (which ruins all sort of coherency).

Memory usage

This is something what is absolutely out of artists control: some operations require memory buffers before/after the node and those are being created automatically. This makes it hard to predict how much memory node setup requires, and how much extra memory is needed when increasing final resolution.

Solution

Proposed end goal is: deliver final image as fast as possible.

This is different from being tile-based, where goal was to have first tiles to appear as quick as possible, with giving gradual updates. The downside of this is that overall frame time is higher than giving an entire frame at once. Additionally, tile-based nature complicates task scheduler a lot, and makes it required to keep track of more memory at a time.

It should be possible to transform current design to proposed one in incremental steps:

  • [Temporarily] Remove code which unnecessarily complicates scheduler and memory manager which is GPU support.
  • Convert all operations to operate in a relative space rather than pixel space (basically, make it possible to change final render resolution without changing compositor network setup).
  • Make compositor to operate on final resolution which closely matches resolution of the current "viewer": there is no need to do full 8K compositing when final result is viewed as a tiny backdrop on Full HD monitor.
  • Modify operations to operate on an entire frame (or on a given area).
  • Modify scheduler to do bottom-to-top scheduling, operating on the entire image.
  • Modify memory manager to allocate buffers once is needed and discard them as soon as possible.
  • Vectorize (SIMD) all operations where possible.

Look into GPU support with the following requirements:

  • Minimize memory throughput, which implies the following point.
  • Have all operations implemented on GPU, which again implies following point.
  • Share implementation between CPU and GPU as much as possible.

The steps can be gradual and formulated well-enough to happens as code quality days in #73586.

# Overview This is an initial pass on formalizing proposal how to improve performance and user experience of Compositor. There are three aspects which are aimed to be addressed: * Ease of use * Performance * Memory consumption ## Ease of use Currently there are many settings which needs to be set and tweak to have best performance: tile size, OpenCL, buffer use and so on. Settings also have some implicit dependencies under the hood: for example, OpenCL needs big tile size, but big tile size might make some nodes measurably slower, and bigger tile size "breaks" the initial intention of compositor design to show tiles appearing as quick as possible. For OpenCL case it's also not clear when it's actually being engaged: it will just fail silently, falling back to CPU, giving false impression of having GPU accelerated compute. # Performance Performance of compositor is not up to date. Partially due to its scheduler design, partially due to technical implementation which is per-pixel virtual call (which ruins all sort of coherency). # Memory usage This is something what is absolutely out of artists control: some operations require memory buffers before/after the node and those are being created automatically. This makes it hard to predict how much memory node setup requires, and how much extra memory is needed when increasing final resolution. # Solution Proposed end goal is: deliver final image as fast as possible. This is different from being tile-based, where goal was to have first tiles to appear as quick as possible, with giving gradual updates. The downside of this is that overall frame time is higher than giving an entire frame at once. Additionally, tile-based nature complicates task scheduler a lot, and makes it required to keep track of more memory at a time. It should be possible to transform current design to proposed one in incremental steps: * [Temporarily] Remove code which unnecessarily complicates scheduler and memory manager which is GPU support. * Convert all operations to operate in a relative space rather than pixel space (basically, make it possible to change final render resolution without changing compositor network setup). * Make compositor to operate on final resolution which closely matches resolution of the current "viewer": there is no need to do full 8K compositing when final result is viewed as a tiny backdrop on Full HD monitor. * Modify operations to operate on an entire frame (or on a given area). * Modify scheduler to do bottom-to-top scheduling, operating on the entire image. * Modify memory manager to allocate buffers once is needed and discard them as soon as possible. * Vectorize (SIMD) all operations where possible. Look into GPU support with the following requirements: * Minimize memory throughput, which implies the following point. * Have all operations implemented on GPU, which again implies following point. * Share implementation between CPU and GPU as much as possible. The steps can be gradual and formulated well-enough to happens as code quality days in #73586.
Author
Owner

Changed status from 'Needs Triage' to: 'Confirmed'

Changed status from 'Needs Triage' to: 'Confirmed'
Author
Owner

Added subscribers: @Sergey, @Jeroen-Bakker, @brecht

Added subscribers: @Sergey, @Jeroen-Bakker, @brecht
Member

Added subscriber: @LazyDodo

Added subscriber: @LazyDodo
Member

This is going to be a controversial suggestion given it's a chunky new dependency, however...halide is a DSL designed for exactly this kind of problem (takes in account locality, parallelism, vectorization etc) , it's mature (8+ years old, and still actively maintained and in use by both google and adobe) it supports CPU/GPU/SIMD/Threading/Metal/OpenGL out of the box, it is based on LLVM however which generally has a high startup cost, but they mitigated that by allowing to pre-build the kernels at compile time (rather than runtime) It has a python API so you can make kernels in python (still with all hardware support mentioned earlier) which could be good for us addon-wise (but you're back to a small run-time cost at that point).

If anything watch the video on the bottom of their home page to see what it is about, I feel this is a really good match for the compositor and should be considered.

This is going to be a controversial suggestion given it's a chunky new dependency, however...[halide ](https://halide-lang.org/) is a DSL designed for exactly this kind of problem (takes in account locality, parallelism, vectorization etc) , it's mature (8+ years old, and still actively maintained and in use by both google and adobe) it supports CPU/GPU/SIMD/Threading/Metal/OpenGL out of the box, it is based on LLVM however which generally has a high startup cost, but they mitigated that by allowing to pre-build the kernels at compile time (rather than runtime) It has a python API so you can make kernels in python (still with all hardware support mentioned earlier) which could be good for us addon-wise (but you're back to a small run-time cost at that point). If anything watch the video on the bottom of their home page to see what it is about, I feel this is a *really* good match for the compositor and should be considered.
Author
Owner

@LazyDodo Thanks for pointing it out. I am aware of Halide and was considering it for tracker/compositor a while. Nowadays not that much sold. With the function nodes it feels like we can fill in missing gaps and achieve same level of functionality with building blocks which are native to Blender. But will see.

@LazyDodo Thanks for pointing it out. I am aware of Halide and was considering it for tracker/compositor a while. Nowadays not that much sold. With the function nodes it feels like we can fill in missing gaps and achieve same level of functionality with building blocks which are native to Blender. But will see.
Member

I have experimented with it in the past, where it really shines is how easily you can change a schedule add/changes data-layout/threading/caching/vectorization or go, "now run it on the GPU! now do CUDA! no OpenCL! use DX12!" by changing a line of 2 of code. It is by far the best performing code I have written with the least amount of effort.

However not everything they offer is a good match for us, fusion across the whole nodegraph while theoretically awesome will just never work for us, the run-time cost to schedule and jit them is just too high (once done the perf is great.. however spending several seconds in schedule and jit so you can run a graph under 3ms just ruins any gains to be had) if we use it, it be best to thread every node as it's own block and optimize that at build time rather than trying to run-time optimize the graph the user puts together.

Funciton nodes is still to 'unknown' for me to make a call on, I think there's pro's and con's for both solutions, however gaining a dep as large as halide is definitely hanging out on the con side of things.

I have experimented with it in the past, where it really shines is how easily you can change a schedule add/changes data-layout/threading/caching/vectorization or go, "now run it on the GPU! now do CUDA! no OpenCL! use DX12!" by changing a line of 2 of code. It is by far the best performing code I have written with the least amount of effort. However not everything they offer is a good match for us, fusion across the whole nodegraph while theoretically awesome will just never work for us, the run-time cost to schedule and jit them is just too high (once done the perf is great.. however spending several seconds in schedule and jit so you can run a graph under 3ms just ruins any gains to be had) if we use it, it be best to thread every node as it's own block and optimize that at build time rather than trying to run-time optimize the graph the user puts together. Funciton nodes is still to 'unknown' for me to make a call on, I think there's pro's and con's for both solutions, however gaining a dep as large as halide is definitely hanging out on the con side of things.

Added subscriber: @frameshift

Added subscriber: @frameshift

Added subscriber: @BintangPratama

Added subscriber: @BintangPratama
Member

Added subscriber: @EAW

Added subscriber: @EAW

Added subscriber: @DarkKnight

Added subscriber: @DarkKnight

Added subscriber: @MD.FahadHassan

Added subscriber: @MD.FahadHassan

If Possible:
Updating only the nodes that has been changed and later connected nodes in the chain, avoiding the upstream.

If Possible: Updating only the nodes that has been changed and later connected nodes in the chain, avoiding the upstream.
Member

In #74491#888273, @MD.FahadHassan wrote:
If Possible:
Updating only the nodes that has been changed and later connected nodes in the chain, avoiding the upstream.

That assumes you keep the output buffers of each of the nodes, which if you are compositing at high resolution or you have a massive amount of nodes gets expensive memory wise real fast.

> In #74491#888273, @MD.FahadHassan wrote: > If Possible: > Updating only the nodes that has been changed and later connected nodes in the chain, avoiding the upstream. That assumes you keep the output buffers of each of the nodes, which if you are compositing at high resolution or you have a massive amount of nodes gets expensive memory wise real fast.

In #74491#889620, @LazyDodo wrote:
That assumes you keep the output buffers of each of the nodes, which if you are compositing at high resolution or you have a massive amount of nodes gets expensive memory wise real fast.

That's true as well.

> In #74491#889620, @LazyDodo wrote: >That assumes you keep the output buffers of each of the nodes, which if you are compositing at high resolution or you have a massive amount of nodes gets expensive memory wise real fast. That's true as well.

Added subscriber: @Pipeliner

Added subscriber: @Pipeliner

Added subscriber: @laurelkeys

Added subscriber: @laurelkeys

Added subscriber: @3di

Added subscriber: @3di

To avoid keeping the output buffers of each node, perhaps a cache node would be a good first step, so the user could decide which points in the tree to cache (after a denoise node for example). Something similar to the houdini approach, additionally a freeze toggle on each node in case the user doesn't want to use the output of a file cache node in a different branch (or different compositor if multiple compositors become possible with a compositor node). This way anything below the file cache node would not need to be recalculated each time an upstream parameter change occurs.

{F8485301}

To avoid keeping the output buffers of each node, perhaps a cache node would be a good first step, so the user could decide which points in the tree to cache (after a denoise node for example). Something similar to the houdini approach, additionally a freeze toggle on each node in case the user doesn't want to use the output of a file cache node in a different branch (or different compositor if multiple compositors become possible with a compositor node). This way anything below the file cache node would not need to be recalculated each time an upstream parameter change occurs. {F8485301}

Added subscriber: @BartekMoniewski

Added subscriber: @BartekMoniewski

Added subscriber: @JasonClarke

Added subscriber: @JasonClarke
Member

Added subscriber: @SeanKennedy

Added subscriber: @SeanKennedy

Added subscriber: @MrJoMo

Added subscriber: @MrJoMo

Added subscriber: @MichaelHermann

Added subscriber: @MichaelHermann

Added subscriber: @Mantissa

Added subscriber: @Mantissa

Added subscriber: @bent

Added subscriber: @bent

Added subscriber: @semimetallic

Added subscriber: @semimetallic

Added subscriber: @monique

Added subscriber: @monique

Regarding the first bullet point in this list of improvements, relative space, I've came up with a possible solution: https://devtalk.blender.org/t/compositor-improvement-plan-relative-space/14874

Your input would be much appreciated.

Regarding the first bullet point in this list of improvements, relative space, I've came up with a possible solution: https://devtalk.blender.org/t/compositor-improvement-plan-relative-space/14874 Your input would be much appreciated.

It'd be nice if nodes could have Gizmos on the preview, when they are active. I'm mostly thinking about the transform gizmo for mask and transform nodes, but a point gizmo would be nice too, for example for the sun glare node.
If OFX plugins were to be supported in the VSE and / or in the Compositor gizmos would be needed too.

It'd be nice if nodes could have Gizmos on the preview, when they are active. I'm mostly thinking about the transform gizmo for mask and transform nodes, but a point gizmo would be nice too, for example for the sun glare node. If OFX plugins were to be supported in the VSE and / or in the Compositor gizmos would be needed too.

My suggestion would be:
For Viewer:

  • Discard the backdrop first. It's a cluttered feature and probably will not go with relative space.
  • Instead creating a new viewer explicitly or making the Movie Editor as the standard viewer for compositing. Which has:
    • Masking system (fetching from existing Movie Editor)
    • A timeline and cache system (possible to fetch from Movie Editor)
    • Trackers (Fetch from Movie Editor)
    • Relative space render buffer (Which is used by Natron as well and it is efficient for playback memory cache too)
    • Should have a switcher to render entire pixel too. Because of anti aliasing like hair splits or texts or GPU based nodes
    • A color-space switcher (Linear/sRGB/REC709/Filmic)
      For Nodes:
    • Making all nodes relative space aware
    • Focusing on Property panel more than nodes channels
    • Transferring procedural textures into compositor from texture editor
    • Giving a GLSL authoring node like this (https://github.com/bitsawer/blender-custom-nodes). This is the fastest possible node in the entire Blender compositor after input nodes. And as GLSL is just like scripting language the growth of scripts by the community will be massive. If anything is not in Blender, there will be.
    • Giving a Python script based node is also cool for AI based future compositing industry built with pytorch, numpy and openCV. (https://github.com/bitsawer/blender-custom-nodes)
    • A Gizmo API is needed for nodes. Specially Transform and Screen dependent nodes. Probably a simple 2D transform gizmo and a single point gizmo is more than enough.

For Node Cache system:

  • After doing some pipeline dev I emulated a cache system with "File node". Where I plugged it after doing denoising and using those denoised file image for later operations. Similarly after doing some heavy blur I plugged another file cache. >>
  • Instead of ram cache for nodes can we usedisc cache instead? In this way viewer will get all the ram for playback. If so then Natron like "value change aware cache" system can be put in. (Updating only the nodes that has been changed and later connected nodes in the chain, avoiding the upstream.)

Hope it helps. Thanks.

My suggestion would be: For Viewer: - Discard the backdrop first. It's a cluttered feature and probably will not go with relative space. - Instead creating a new viewer explicitly or making the Movie Editor as the standard viewer for compositing. Which has: - Masking system (fetching from existing Movie Editor) - A timeline and cache system (possible to fetch from Movie Editor) - Trackers (Fetch from Movie Editor) - Relative space render buffer (Which is used by Natron as well and it is efficient for playback memory cache too) - Should have a switcher to render entire pixel too. Because of anti aliasing like hair splits or texts or GPU based nodes - A color-space switcher (Linear/sRGB/REC709/Filmic) For Nodes: - Making all nodes relative space aware - Focusing on Property panel more than nodes channels - Transferring procedural textures into compositor from texture editor - Giving a GLSL authoring node like this (https://github.com/bitsawer/blender-custom-nodes). This is the fastest possible node in the entire Blender compositor after input nodes. And as GLSL is just like scripting language the growth of scripts by the community will be massive. If anything is not in Blender, there will be. - Giving a Python script based node is also cool for AI based future compositing industry built with pytorch, numpy and openCV. (https://github.com/bitsawer/blender-custom-nodes) - A Gizmo API is needed for nodes. Specially Transform and Screen dependent nodes. Probably a simple 2D transform gizmo and a single point gizmo is more than enough. For Node Cache system: - After doing some pipeline dev I emulated a cache system with "File node". Where I plugged it after doing denoising and using those denoised file image for later operations. Similarly after doing some heavy blur I plugged another file cache. >> - Instead of **ram cache** for nodes can we use**disc cache** instead? In this way viewer will get all the ram for playback. If so then Natron like "value change aware cache" system can be put in. (Updating only the nodes that has been changed and later connected nodes in the chain, avoiding the upstream.) Hope it helps. Thanks.
Author
Owner

@monique, do you mind moving the design/code proposal from the devtalk.b.o to the developer.b.o as a subtask of this one? Will make it easier to keep track of it.

@monique, do you mind moving the design/code proposal from the devtalk.b.o to the developer.b.o as a subtask of this one? Will make it easier to keep track of it.

Added subscriber: @AndreaMonzini

Added subscriber: @AndreaMonzini

Hello, just to inform ( if you not already know it ) that there is an experimental and unofficial Blender branch by "Manuel Castilla" to improve the performance of the compositor.

Repository:
https://github.com/m-castilla/blender/tree/compositor-up

Blender Chat channel :
https://blender.chat/channel/compositor-up

Hello, just to inform ( if you not already know it ) that there is an experimental and unofficial Blender branch by "Manuel Castilla" to improve the performance of the compositor. Repository: https://github.com/m-castilla/blender/tree/compositor-up Blender Chat channel : https://blender.chat/channel/compositor-up

Added subscriber: @DanielVesterbaekJensen

Added subscriber: @DanielVesterbaekJensen

Added subscriber: @SirPigeonz

Added subscriber: @SirPigeonz
Member

Added subscriber: @Wahooney

Added subscriber: @Wahooney
Member

I don't know if this has been proposed, but moving Compositor data to a data-block structure would have benefits: including moving comps easily between files/projects, switching out comps in batch renders (my current use case), allowing scratch pads for comps, and probably more I can't think of.

I don't know if this has been proposed, but moving Compositor data to a data-block structure would have benefits: including moving comps easily between files/projects, switching out comps in batch renders (my current use case), allowing scratch pads for comps, and probably more I can't think of.

Added subscriber: @ChristophWerner

Added subscriber: @ChristophWerner

Added subscriber: @Aeraglyx

Added subscriber: @Aeraglyx

Hi @Jeroen-Bakker,

Looks like some work might be happening on the compositor (exciting!). I started collecting some ideas in a DevTalk thread a while ago, but unfortunately got busy and mostly abandoned it. It might be worth checking it out again.

https://devtalk.blender.org/t/compositor-improvements/13264

Good luck with the improvements and, as always, thank you for the hard work!

Hi @Jeroen-Bakker, Looks like some work might be happening on the compositor (exciting!). I started collecting some ideas in a DevTalk thread a while ago, but unfortunately got busy and mostly abandoned it. It might be worth checking it out again. https://devtalk.blender.org/t/compositor-improvements/13264 Good luck with the improvements and, as always, thank you for the hard work!

Added subscriber: @Tasch

Added subscriber: @Tasch

Added subscriber: @Robonnet

Added subscriber: @Robonnet

Added subscriber: @ParallelMayhem

Added subscriber: @ParallelMayhem

Added subscriber: @Low_Polygon42

Added subscriber: @Low_Polygon42

Added subscriber: @Nurb2Kea

Added subscriber: @Nurb2Kea

Added subscriber: @Emi_Martinez

Added subscriber: @Emi_Martinez
Contributor

Added subscriber: @ok_what

Added subscriber: @ok_what
Philipp Oeser removed the
Interest
VFX & Video
label 2023-02-10 09:32:03 +01:00
Member

I'm going to take the liberty of assuming this task is redundant after the work on the realtime/GPU compositor and the new CPU compositor work #125968.

I'm going to take the liberty of assuming this task is redundant after the work on the realtime/GPU compositor and the new CPU compositor work #125968.
Blender Bot added
Status
Archived
and removed
Status
Confirmed
labels 2024-11-07 22:42:53 +01:00
Sign in to join this conversation.
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset System
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Code Documentation
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Viewport & EEVEE
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Asset Browser Project
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Asset System
Module
Core
Module
Development Management
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Module
Viewport & EEVEE
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Severity
High
Severity
Low
Severity
Normal
Severity
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
33 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#74491
No description provided.