New compositor system proposal #81650
Open
opened 2020-10-12 14:55:20 +02:00 by Manuel Castilla
·
41 comments
No Branch/Tag Specified
blender-v4.0-release
main
temp-sculpt-dyntopo
blender-v3.6-release
universal-scene-description
blender-v3.3-release
asset-browser-frontend-split
brush-assets-project
asset-shelf
anim/armature-drawing-refactor-3
temp-sculpt-dyntopo-hive-alloc
tmp-usd-python-mtl
tmp-usd-3.6
blender-v3.5-release
blender-projects-basics
blender-v2.93-release
temp-sculpt-attr-api
realtime-clock
sculpt-dev
gpencil-next
bevelv2
microfacet_hair
xr-dev
principled-v2
v3.6.4
v3.6.3
v3.3.11
v3.6.2
v3.3.10
v3.6.1
v3.3.9
v3.6.0
v3.3.8
v3.3.7
v2.93.18
v3.5.1
v3.3.6
v2.93.17
v3.5.0
v2.93.16
v3.3.5
v3.3.4
v2.93.15
v2.93.14
v3.3.3
v2.93.13
v2.93.12
v3.4.1
v3.3.2
v3.4.0
v3.3.1
v2.93.11
v3.3.0
v3.2.2
v2.93.10
v3.2.1
v3.2.0
v2.83.20
v2.93.9
v3.1.2
v3.1.1
v3.1.0
v2.83.19
v2.93.8
v3.0.1
v2.93.7
v3.0.0
v2.93.6
v2.93.5
v2.83.18
v2.93.4
v2.93.3
v2.83.17
v2.93.2
v2.93.1
v2.83.16
v2.93.0
v2.83.15
v2.83.14
v2.83.13
v2.92.0
v2.83.12
v2.91.2
v2.83.10
v2.91.0
v2.83.9
v2.83.8
v2.83.7
v2.90.1
v2.83.6.1
v2.83.6
v2.90.0
v2.83.5
v2.83.4
v2.83.3
v2.83.2
v2.83.1
v2.83
v2.82a
v2.82
v2.81a
v2.81
v2.80
v2.80-rc3
v2.80-rc2
v2.80-rc1
v2.79b
v2.79a
v2.79
v2.79-rc2
v2.79-rc1
v2.78c
v2.78b
v2.78a
v2.78
v2.78-rc2
v2.78-rc1
v2.77a
v2.77
v2.77-rc2
v2.77-rc1
v2.76b
v2.76a
v2.76
v2.76-rc3
v2.76-rc2
v2.76-rc1
v2.75a
v2.75
v2.75-rc2
v2.75-rc1
v2.74
v2.74-rc4
v2.74-rc3
v2.74-rc2
v2.74-rc1
v2.73a
v2.73
v2.73-rc1
v2.72b
2.72b
v2.72a
v2.72
v2.72-rc1
v2.71
v2.71-rc2
v2.71-rc1
v2.70a
v2.70
v2.70-rc2
v2.70-rc
v2.69
v2.68a
v2.68
v2.67b
v2.67a
v2.67
v2.66a
v2.66
v2.65a
v2.65
v2.64a
v2.64
v2.63a
v2.63
v2.61
v2.60a
v2.60
v2.59
v2.58a
v2.58
v2.57b
v2.57a
v2.57
v2.56a
v2.56
v2.55
v2.54
v2.53
v2.52
v2.51
v2.50
v2.49b
v2.49a
v2.49
v2.48a
v2.48
v2.47
v2.46
v2.45
v2.44
v2.43
v2.42a
v2.42
v2.41
v2.40
v2.37a
v2.37
v2.36
v2.35a
v2.35
v2.34
v2.33a
v2.33
v2.32
v2.31a
v2.31
v2.30
v2.28c
v2.28a
v2.28
v2.27
v2.26
v2.25
Labels
Clear labels
This issue affects/is about backward or forward compatibility
Issues relating to security: https://wiki.blender.org/wiki/Process/Vulnerability_Reports
Apply labels
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
This issue affects/is about backward or forward compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest: Wayland
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Issues relating to security: https://wiki.blender.org/wiki/Process/Vulnerability_Reports
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest: Wayland
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
Milestone
Set milestone
Clear milestone
No items
No Milestone
Projects
Set Project
Clear projects
No project
Assignees
Assign users
Clear assignees
No Assignees
24 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.
No due date set.
Dependencies
No dependencies set.
Reference: blender/blender#81650
Reference in New Issue
There is no content yet.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may exist for a short time before cleaning up, in most cases it CANNOT be undone. Continue?
This is an idea/proposal for a new compositor system implementation. I've been implementing it in a branch called compositor-up . I consider it in experimental state yet. My intention is to keep working on it, it's not trying to push this into blender now. It's with the perspective of being integrated in the future. I'm aware it's hard for this to be accepted as I've done it out of whim without consulting anything with blender developers, and if accepted a lot of requirements are going to come. I'm willing to do any requirements once I explained the reasoning behind what I did so everyone can better decide.
I'd like to clarify that I haven't taken this idea or solution from anywhere. For good or for bad this solution comes from my own experience developing software in general and blender itself, as I'm still using some ideas and code from current compositor and cycles.
Main Goals
Improve performance overall being efficient in memory usage (for caching user can choose how much RAM to use).
Being able to cache at any point of the node tree and uniquely identify the cached image of the node tree state it has saved. So that if the node tree changes and anytime gets back to the state when the cache was saved it may be retrieved.
Make it easier to implement image algorithms and provide a way to write code that is compatible for both CPU and GPU execution, abstracting the computing system being used for GPU.
General Overview
This is a class diagram of the main classes

Buffering, writing and reading pixels
Execution phases
The compositor tree execution is done in two phases:
Using hashes as part of operations tree state IDs (OpKeys)
Using hashes as IDs in most cases is a bad idea because of hash collisions. Still this whole system is based on this idea because the benefits it brings for this case scenario are by far greater than the downside of accepting that there is a minimal chance of getting an "OpKey collision" (OpKey has hashes) that would result in a bad render. OpKeys contains a representation of an operation tree state taking into account all the parameters and operations types involved that produces the operation image result. All operations generate their own OpKey on compositor execution initialization.
Benefits of using hashes for building OpKeys:
They're auto regenerated on every compositor execution, and if nothing has changed respect previous execution they will be the same. No need to save them between executions.
Being able to cache at any point of the tree without having to implement UI logic that would have to tell the compositor system "I have modified this part of the tree so you should invalidate caches ahead". Or either having to save previous node tree graph with all parameters and having to check them against new node trees graphs to see if there has been a change and where.
If 2 or more operations have the exact same tree with same parameters from their point to root input (same tree node state), they will be written only once, doesn't matter that they are in different branches.
It's an easy way to uniquely identify operations results in current and between compositor executions. So for example to save disk cache files I may just base encode all the OpKey fields values in the filename and on cache request an OpKey will be given and from it the cache filename can be known without having to create an index.
As caches are uniquely identified with values that represent the operations node tree state at the moment of save, they don't need to ever be invalidated (deleted) because of a change in the tree. If the tree is ever back to the same state as when the cache was saved it will be retrieved, doesn't matter the changes in between. The following video shows the working implementation, keep a look at the progress bar:
cache_showcase.mp4
These are the fields of an OpKey:

Every operation has an OpKey which uniquely identifies the operation node tree state in current and between compositor executions. For an OpKey collision to happen all OpKey fields must be the same for 2 operations that should render a different image result (because they have different node tree states). This is different from 2 operations in different node tree branches that have same node tree states and produce same image result. In that case, the fact that the 2 operations have the same key is what should happen and the image result for both operations will be written only once. The following video demonstrates this last case in the working implementation (when duplicating the nodes with same parameters it should take more time on rendering if they were written twice):
hashes_write_once.mp4
For the system to be updated correctly on any change, it's required to implement the "hashParams" method on all operations and call in it "hashParam(param)" on all parameters of an operation that if changed would produce a change in the operation image result. Otherwise the system won't be updated on changing that unhashed parameter because output views are cached and they would have the same OpKey than previous execution. So you'd easily realize sooner or later.
OpKey collisions probabilities
I've found this website that shows a simple table of hash collisions probabilities, assuming the hash method is very good:
https://preshing.com/20110504/hash-collision-probabilities/#small-collision-probabilities
Let's say OpKey would only have the "op_tree_state_hash" field which is what really represents the operation tree state (the other hashes are added for making it much harder for an OpKey collision to happen):
Example 1: No cache nodes. Compositor tree with 950 nodes and an average of 2 operations per node = 1900 OpKeys. The closer value in the table for 64 bits is 1921 hashes -> 1 bad render per 10 american trillion renders (10 european billion renders)
Example 2: With cache nodes and disk cache. Compositor tree with 950 nodes and an average of 2 operations per node + 600 thousand disk cache files = 601900 OpKeys. The closer value in the table for 64 bits is 607401 hashes -> 1 bad render per 100 million renders
But for an OpKey collision to happen the other hashes would have to be the same too, so chances of collision are much much lower.
Let's get in the worst case scenario. The user has gotten a bad render because of an OpKey collision. If he realizes about the bad render, he would probably try to select the frame where it occurred and try to re-render it. For some inputs it would automatically produce different hashes as RenderLayers node does because it re-renders the scene again, that would fix the frame. If doesn't fix it, he may try to tweak anything in the compositor tree and fixed. Even when he doesn't know why it has happened, it's very possible that he may try to do these things.
Implementing image algorithms in operations
Implementing compatible CPU/GPU code for computing systems (OpenCL right now)
As you may have seen in operations kernel code, it's using some macros. It's needed for abstracting between CPU (c++) code and the computing system code (OpenCL). It helps to simplify the implementation too as you always work with image coordinates and don't need to know about buffer offsets or paddings. Other reason is that the image being read may come from a SingleElemOperation and would contain a single pixel, the macros are aware of this and they just read always that pixel. Thanks to this abstractions up until now the code is 100% shared between C++ and OpenCL for operations that support computing. Taking into account the quantity of operations being implemented, it helps maintainability a lot.
For kernel abstractions I took code from cycles. I've been modifying a lot of it as I was seeing the needs of the compositor so even if it seems I'm duplicating a lot of code it's not that much.
As everything is very abstracted it should be possible to add other computing systems in the future like CUDA without much work but I don't think is a priority.
I did a tool "defmerger" similar to "datatoc", so that I can write each kernel in the same cpp file as the operation. I just surround the kernel code by "#define OPENCL_CODE" and "#undef OPENCL_CODE". Later defmerger will look through all the operations files for the code between OPENCL_CODE tags and merge it into a single file or string. The header "#include "COM_kernel_opencl.h"" is added and it uses a method I took from from cycles that resolves all the includes and preprocessing stuff so that is ready for reading by OpenCL. The tool also allows to write specific code for only C++ or OpenCL in between kernel code, but I never needed it and should be avoided. It would be like this:
Vectorizing
Now there is the possibility of using vectors. Vectors types implementation taken from Cycles, just modified or added what was needed. I've been vectorizing operations code where possible, not only for improving performance, many times makes the code more simple and readable.
Sampling
Now any kind of sampling is always done over the result of the operation being read. In current blender implementation, due to not all operations being buffered, sampling is done over the last buffered operation, which may be the last operation (the operation being read) or not. It affects very little to the output result but it probably does slightly. To do sampling over an operation behind the operation you want to sample and execute the algorithm of the operation being read over it, it's not desirable I think. And the behavior wasn't consistent, depended on which operations were buffered. So now everything behaves more consistently, as all operations are buffered.
But as now all operations are buffered, simple distort operations that in current blender implementation need sampling but are not buffered (scale and rotate only) cannot be concatenated one after another without sampling being applied for each one of the operations. So every time you use a scale or rotate node the operation takes effect immediately and is sampled. Depending on the point of view this might be a disadvantage or an advantage, but I think a new/non-expert user would always expect that the operation takes effect immediately. It can be very confusing that you scale down and then scale up and you get the exact same image quality, if you insert in between non buffered nodes it'll be the same, but if you insert a buffered node suddenly the image appears pixelated (of course he doesn't know which ones are buffered nor what is a buffer). Some users might understand the behaviour but most of the time is confusing for the average user.
I implemented the transform node as a single operation so that rotate and scale can be combined in a single operation and sampling is only done once -> TransformOperation
Changes / New features
Removed options from UI
Performance tests
The tests are done with a 1920 x 1080 image.
Operating system: Windows-10-10.0.18362-SP0 64 Bits
CPU: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Graphics card: GeForce GTX 1060 6GB/PCIe/SSE2 NVIDIA Corporation 4.5.0 NVIDIA 456.38
Test 1 - A bunch of nodes that are not currently buffered in blender compositor but in compositor-up all of them are. With CPU:
test1.mp4
test1.blend
Blender 2.91 master: 1.15 seconds / 221.62 MB (peak memory)
Blender 2.91 compositor-up: 0.36 seconds / 309.31 MB (peak memory)
Even when there is no buffered operation other that inputs and outputs in current blender implementation and in compositor-up all operations are buffered, there is no big difference in memory. And that's because buffer recycling it's actually working, otherwise peak memory would be much worse. Doesn't matter how many nodes you add linearly, they are recycled all the time. Peak memory would depend more on how much branching there is from operations outputs because the BufferManager has to keep the buffer until all the branches finish reading the operation result. So it has to keep it longer, and it may need more memory.
Test 2 - A few nodes that are most of them buffered in current blender compositor, this nodes have more complex operations. With CPU:
test2.mp4
test2.blend
Result:
Blender 2.91 master: 1.66 seconds / 724.65 MB (peak memory)
Blender 2.91 compositor-up: 0.72 seconds / 305.82 MB (peak memory)
In compositor-up memory usage is about the same as test 1 for complex operations as they are recycled in the same way. Current blender implementation keeps the buffers from start to end of the execution so the more nodes you add with complex operations the more memory it consumes.
Test 3 - Same as test 2 but with OpenCL:
test3.mp4
test3.blend
Result:
Blender 2.91 master: 16.45 seconds / 763.45 MB (peak memory)
Blender 2.91 compositor-up: 0.21 seconds / 153.97 MB (peak memory)
About the time, I know very few nodes are implemented with OpenCL in current blender implementation, but there is clearly something wrong. It may happen when mixing nodes that can be executed with OpenCL and others that don't. Bokeh blur I'm sure it's implemented with OpenCL and I think it works fine if you use only that node.
About the peak memory RAM in compositor-up it's because now is mostly using and recycling GPU memory in the same way as was done before for Memory RAM, so it needs much less RAM.
In compositor-up every time you enable OpenCL it'll take some time to do the first render because it has to load the OpenCL program, but from there it should go smooth. It's in my TODO list to make it compile and save it in cache.
As you may have noticed, image results are slightly different, I'm pretty sure is related to sampling. In test2 too but even less noticeable. As I've already explained current blender implementation sampling is done over last buffered operation which may be the last operation or not, and in compositor-up is always last operation. That could be part of it, but in this test3 other thing is that I'm using native OpenCL sampling and the algorithm or precision may be slightly different. I may just use always blender sampling implementation for both CPU and GPU in the future to avoid differences. My intention is to do proper render tests per node, compare it with current blender implementation and fix differences where possible.
Important TODOs:
GitHub repository: https://github.com/m-castilla/blender-compositor-up
Binaries: https://github.com/m-castilla/blender-compositor-up/releases
Added subscribers: @manzanilla, @dfelinto
Added subscriber: @AditiaA.Pratama
Added subscriber: @MD.FahadHassan
Linking as may relevant
https://developer.blender.org/T74491
Added subscriber: @tintwotin
Added subscriber: @GeorgiaPacific
Added subscriber: @Blendify
Added subscriber: @AndyCuccaro
Added subscriber: @pistolario
Added subscriber: @EAW
Added subscriber: @SeanKennedy
Added subscriber: @higgsas
Added subscriber: @Sergey
Thanks for such a detailed presentation of your proposal.
There will be some points to discuss/agree on :)
Think it does make sense to go back to entire-image-as-fast-as-possible approach. It does simplify a lot of things for both users and developers. On a seemingly related topic, one thing we always were looking forward is to become more of a canvas-based. On of the examples of that is to be able to undistort image, perform modifications, distort it back and not have any black areas around the "barrel". Is this something what is still possible with the buffers design you've described?
On the optimization step, you're talking about number of reads. Is it number of read images, or number of read pixels? I'm actually a bit lost in this step. Do you have some simple example to visualize how this step works?
The cache keys I'm not sure is the right direction here. Think we already have all bits needed for fully reliable system: we have SessionUUID concept, we have update tagging and dependency graph. Think reliable cache system can work the following way:
The GPU support sounds interesting, but in practice it might easily become more of a maintenance burden. Not sure it should be a priority from the day 0, but in longer term I think its interesting to get GPU acceleration.
One thing which strikes my attention in the design is the need of
defmerger
tool, and the fact that host and device code is so interleaved. It should be possible to have a separation level between host and device, and avoid need in extra tools, and keep code nice and readable.Another thing which I'm not sure is how the operation buffers are managed for GPU? You wouldn't want to have data transfer back and forth for every node, so you kind of want to keep everything on GPU as much as possible, ideally have entire process happening on GPU. Without pushing it out of VRAM somehow. How do you manage this?
Thew performance tests shows very nice speedup! Just to confirm, it is not caused by the caching, right?
Added subscriber: @iss
Added subscriber: @Imaginer
Thanks for looking into it :)
As GPU support is not a priority we could consider it not part of the proposal. Same for caching/OpKeys and new features. Would simplify things a lot. Just improving CPU performance and memory consumption for the time being. If anytime it's decided to support things I've talked about in the proposal, we can always come back to it.
About canvas-based:
Yes, buffer recycler will create or get a buffer that has enough memory for the resolution of the operation. The resolution is set in NodeOperation::determineResolution as always, it can be any. If in the operation you need more buffers to do intermediate modifications, from operation code you can request recycler more buffers for any resolution and once finished give them back. The "fit" option of lens distortion node is an example right? I've just tried it, works fine.
About optimization step:
It's the number of image reads (whole operation buffer result). An example node tree:
ImageOperation -> GammaOperation -> ViewerOperation
First phase (Optimization): The operations call getPixels() on their input operations to get their output buffers, like this:
ViewerOperation::getPixels() -> GammaOperation::getPixels(this) -> ImageOperation::getPixels(this)
With "this" argument the caller operation is telling the read operation that he is the reader. Then read operation reports to ReadOptimizer it has received a read. ReadOptimizer will count the read and register the operation that did it. ReadOptimizer actions log in order:
ImageOperation -> BrightnessOperation -> CacheOperation -> GammaOperation -> ViewerOperation
· SessionUUID: Cache retrieved
· OpKey : Cache retrieved
· SessionUUID: Previous cache is invalidated, new cache is saved.
· OpKey: New cache saved. Previous caches are kept.
· SessionUUID: Previous cache is invalited, new cache is saved.
· OpKey: Cache retrieved
· SessionUUID: Previous cache is invalited, new cache is saved.
· OpKey : New cache saved. Previous caches are kept.
· SessionUUID: Previous cache is invalited, new cache is saved.
· OpKey: Cache retrieved
· SessionUUID: Previous cache is invalited.
· OpKey: Do nothing.
· SessionUUID: Has no cache, save new cache.
· OpKey: Cache retrieved```
OpKey approach works for after and before cache node modifications as it won't be calculated twice. Just by using cache node at the end of the tree, whole tree would benefit. As a user I switch between same values a lot to compare results, with cache once calculated is fast to compare values anywhere. It would work for animations too if they were cached. But of course by doing all this I'm adding a lot of code. I understand if you consider it's not worth maintaining it. In such case I would do it as you said but it worries me that if in any operation you are using a value that is not calculated/gotten from nodes and it affects the image result, dependency graph won't know about it and cache won't be invalidated. With the OpKey approach you can hash anything in hashParams() as in that context you know and can access anything the operation uses. It could happen because there are operations that use other blender libraries.
About GPU separation layer:
If you mean separating GPU/CPU code in operations, it would be cleaner and I wouldn't mind doing it but would have to maintain two versions of the same image algorithm, one for the cpu in a .cpp file and other for the gpu in a .cl file using opencl specific code. Gives room to introduce differences in the algorithm between CPU and GPU, and it needs bug fixing/debugging for both versions. In my case I haven't found any OpenCL debugger that works for me. "printf" only.
As I've abstracted what is needed from OpenCL, I only have to maintain one version.
The "defmerger" tool allows me to write code once in the cpp file only. It looks through all the operations cpp files and merge the code into an OpenCL file on compiling. If code works for CPU, it automatically works for GPU. The benefit I get by debugging CPU code translates to GPU even if I'm not really debugging "GPU" code. It's rare the case that code won't work for both, just need to take into account that CPU code executes a loop, GPU don't. I rarely need to change code in defmerger or GPU abstraction. Adding new operations and fixing bugs in them is much more frequent and all operations code is much large and complex, cannot afford two versions of it. If in the future want to replace OpenCL with Vulkan, it may be enough adding new code to GPU abstraction, don't need to change any operations code.
It has the downside that I have to use some macros.
One thing to improve is that BufferManager and BufferRecycler have to ask a lot about if a buffer is from CPU or GPU, that needs more thought.
About how the operation buffers are managed for GPU:
GPU buffers are recycled the same way as CPU buffers. But it has to take into account that there are operations that has compute (GPU) implementation and others not. If an operation is computed it'll create/recycle a GPU buffer. If next operation is computed too it uses GPU buffers from previous operations keeping them in GPU memory. If next operation doesn't have compute support it has to map the input buffers to host memory, there is no choice. To be able to execute tree from start to finish in GPU only, I would have to get rid of all nodes that have no GPU support. There are nodes that are impossible to implement in GPU, for example denoise node as it uses a library. I think is good to have the flexibility to support all kinds of nodes but try to implement as many with compute support as possible. From the profiling I've done, the performance impact of "queueing job, execute it" several times instead of queueing all jobs or do a single job and execute it together is minimum. Here a job/task is an operation write algorithm on a whole image, is a lot of work. What has a big impact in performance is map/copy gpu buffers to host memory and viceversa. I try to avoid it as much as possible, only do it when needed.
Added subscriber: @Gilberto.R
Putting main focus on getting CPU to work nice, fast, with reasonable amount of memory, seems like a good choice to me.
Further development can then happen on top of it.
Not sure why there is a need of any support from the operation itself to provide information about reads. To me clear and robust approach would be to handle it from the scheduler side, based on nodes and links in the compositor graph.
The re-usable memory buffers is another complexity which is not immediately clear to me why is it required. Doing fame-sized allocation is fast enough compared to logic of node. So to me it is important to free buffers as soon as possible, to keep memory usage low, but attempts to avoid allocation calls doesn't seem to be that important. One thing which can be here is to make it so all buffer (de)allocation happens in centralized place, so that if allocation ever pops in a profiler it can be implemented in one single place.
Curious to know if you already made benchmarks like this and saw allocation to be a bottleneck.
To me depsgraph-based invalidation seems more reliable and clear. While it might be not the fastest in all all cases (as your examples showed) it deals with cases like changed property of a generated image used in Image Input node. Or when one changes distortion model of Movie Clip. Having all those settings hashes from compositor seems quite fragile thing to do.
Wasn't really talking about two different codepath of kernels. Just from the code example it seemed that host and device code are interleaved and are controlled by preprocessor directives. This is something what oneAPI is trying to solve: allow easy interfacing of device and host code. But in the case of CUDA/OpenCL is still cleanest approach to have kernel code separate from device code. Surely, there could be some ifdefs in the kernel to fine tune specific cases, but is more like an exception than a rule.
To me it's more robust to do it from the operation because there is where the reading is actually executed. An operation may decide to read or not an input operation based on his own logic. Makes sense to me that an operation avoids to read input operations when it doesn't need it. I wouldn't rely on node graph for anything related to operations execution, it doesn't represent what is actually executed in the end.
I did benchmarks. Only remember with a lot of color operations which are usually very simple (even 1 line code in loop), around 80 nodes, without recycling 2 seconds and with recycling 1 second approximately for CPU. For GPU even more important. I think it's even worth it if all operations were complex. I'll try to do a simple test that you can easily execute and decide yourself if it's worth it or not. I'll do it other day and post it here.
The read can also be conditional. For example, if factor of mix node is controlled by an image, the mix operation might decide to skip reading of either of image inputs based on the mix factor. If you want to check number of reads in this case reliably you'd need to fully execute the operation, at which point you can as well just use the output rather than try to optimize something and run second round of execution.
More straightforward usually process is:
What is the configuration of your machine (CPU, OS, RAM)?
Operating system: Windows-10-10.0.18362-SP0 64 Bits
CPU: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Graphics card: GeForce GTX 1060 6GB/PCIe/SSE2 NVIDIA Corporation 4.5.0 NVIDIA 456.38
RAM: DDR4 16 GB
I did a test for buffer recycling. I can only test it on Windows and I get similar results as what I said . On linux on a VM, but not sure if the results are reliable, it's just very slow with both approaches, but the difference in linux seems to be much lower than windows -> https://developer.blender.org/P1830
What does this part do? I don't understand what it means.
Constant folding is something from compilers optimization. The long version on Wikipedia .
In the context of nodes you can think of it as:
If you have mix node with factor 0 (which is not connected to anything) you can always use first input as output (so you never need second input). Similarly, if the factor is 1 you never need first input, and can use first input as the output.
You do such optimizations on the graph itself, by removing node which got "constant folded" and re-routing links. In this example
becomes
Similarly you can remove blur nodes with blur size 0. You can pre-calculate math nodes, things like that.
We do have constant folding in Cycles (
ShaderGraph::constant_fold
). Worth a look!I looked at it. I get the idea now, I'll do it as you said. Constant folding logic may be added to each node "convertToOperations".
A few questions/suggestions:
I've made a plan taking into account everything we've talked. I still have to register the reads before executing operations, it's not a pass, think of it like determineResolution method. We cannot predict resolutions just by looping operations tree. Here I cannot predict rendered rects and they affect the readings count that we need to know when we can free output buffers.
The following starts from current compositor implemention, let's forget about my branch. Let's see if it's ok:
Some diagrams, they just have most important things:



Removed subscriber: @higgsas
Added subscriber: @higgsas
Added subscriber: @zazizizou
Added subscriber: @Pipeliner
Added subscriber: @TravisOBrien
Added subscriber: @Aeraglyx
Added subscriber: @AdamEarle
Taking a look at all of this quickly and honestly I can't help but think blender compositor needs to be completely separated from blender and the whole thing just needs to follow some industry commonalities.
There are huge bottleneck problems for a Blender when it comes to rendering and comping if you're not doing something straight out of the box from Blender studios workflow.
They're so much to discuss here I'm not sure where to start. On top of that, I am pretty sure blender devs reject most of it and pass it off as too much.
I can't help but think that this is waste of time same as the dev work on the VSE.
I understand the need for these VSE and the Compositor being deved out 10 or 20 years ago. Though doesn't make sense anymore unless the pipeline becomes more integrated with itself and most importantly the rest of the world.
From what I know at the moment the current code doesn't fit with the rest of the world. I'm pretty sure if we continue on this path blender still won't be able to use the entry level for compositors "Open Effects". So it might be time to drop it.
Blender IMO would honestly be better off taking some of its devs & having them focus on building a better integration & improving Kdenlive & Natron. This would actually be more beneficial for Studios and opensource other than just the Blender studio typical workflow.
I know this is pretty cut-throat, honestly, we need to be addressing serious workflow bottleneck issues that cost weeks if not months of lost time and money.
Open to chat about it.
Added subscriber: @Jeroen-Bakker
@AdamEarle please use appropriate channels for such an open discussion.
If you have comments about the design itself, please do it in the ticket.
Your comment does not help us or this design. I am open to any discussion as long as we can do it in a respectable way.
I have no idea what to say to you? What I can say is that I genuinely have good intentions.
So honestly, do you just want to tinker with Blender? or do you really want to help the industry? Which is it?
The best thing I can suggest is to organize to talk with some fast operating compositors from the industry. Compers that use NUKE Fusion After Effects and so on. I mean really sit down and listen and take notes on things like
What are they doing every day?
What are the choke points of projects?
What's taking up the most time?
How do you midgate to with rendering, animation, lighting, tracking, alignment problems?
How do you have a life outside of work? (that's a joke).
I'm sure if you do that things here may just change a little.
If you can work at this level working with the compositor, then honestly you can make a huge shift in the world for the better.
Here's the but though it's up to you mate, and how much you want to be a part of the composting community.
so are you a Tinker or are you a Lifer like me. The rest is up to you and your research. Good luck mate I honestly don't know what else to say to you.
@AdamEarle thanks for the feedback. I think it's better to continue this discussion on devtalk. As you can see in this nice post it's more helpful to have concrete complaints or suggestions.
For example why is the current state bad? What are you trying to achieve and what are you complaints using the compositor: miss a specific node? Blender too slow? wrong place for a button? etc...
Added subscriber: @Raimund58
Added subscriber: @Robonnet
Removed subscriber: @AdamEarle
Added subscriber: @ok_what