Realtime Compositor: Implement Classic Kuwahara #109292

Merged
Omar Emara merged 11 commits from OmarEmaraDev/blender:classic-kuwahara-filter-gpu into main 2023-07-19 14:04:25 +02:00
Member

This patch implements the Classic Kuwahara node for the Realtime Compositor.

A naive O(radius^2) implementation is used for radii up to 5 pixels, and a
constant O(1) implementation based on summed area tables is used for higher
radii at the cost of building and storing the tables.

The SAT implementation is based on that described in:

Nehab, Diego, et al. "GPU-efficient recursive filtering and summed-area tables."

Additionally, the Result class now allows full precision texture allocation, was
was necessary for storing the SAT tables.

This patch implements the Classic Kuwahara node for the Realtime Compositor. A naive O(radius^2) implementation is used for radii up to 5 pixels, and a constant O(1) implementation based on summed area tables is used for higher radii at the cost of building and storing the tables. The SAT implementation is based on that described in: Nehab, Diego, et al. "GPU-efficient recursive filtering and summed-area tables." Additionally, the Result class now allows full precision texture allocation, was was necessary for storing the SAT tables.
Omar Emara added the
Interest
Compositing
label 2023-06-23 15:25:37 +02:00
Omar Emara added 1 commit 2023-06-23 15:25:49 +02:00
047b1b31dc Realtime Compositor: Implement Classic Kuwahara
This patch implements the Classic Kuwahara node for the Realtime
Compositor. This is still a naive implementation and can probably be
accelerated using a SAT table in the future.
Omar Emara added 2 commits 2023-06-28 17:39:10 +02:00
Omar Emara added 5 commits 2023-07-10 17:04:21 +02:00
Omar Emara changed title from WIP: Realtime Compositor: Implement Classic Kuwahara to Realtime Compositor: Implement Classic Kuwahara 2023-07-10 17:04:44 +02:00
Omar Emara requested review from Sergey Sharybin 2023-07-10 17:21:11 +02:00
Omar Emara requested review from Clément Foucault 2023-07-10 17:21:12 +02:00
Omar Emara requested review from Habib Gahbiche 2023-07-10 17:21:12 +02:00

@blender-bot package

@blender-bot package
Member

Package build started. Download here when ready.

Package build started. [Download here](https://builder.blender.org/download/patch/PR109292) when ready.
Omar Emara added 1 commit 2023-07-10 18:42:46 +02:00
Sergey Sharybin reviewed 2023-07-12 15:03:20 +02:00
Sergey Sharybin left a comment
Owner

This feels so nice now to play with the filter :)

The result looks different between CPU and GPU, but guess this is expected because of the different algorithm used?

A naive O(radius^2) implementation is used for radii up to 5 pixels, and a constant O(1) implementation based on summed area tables is used for higher radii at the cost of building and storing the tables.

It would be great to explicitly state why is it so. Can speculate that smaller radius works faster than doing pre-computation. Is it correct?

The I did not cross-check with the paper, but locally it looks nice and readable and documented :) Only have some minor comment to make it easier to find out what the equations are referring to from the comments.

P.S. As a side note, would it make sense to switch the CPU implementation to SAT as well?

This feels so nice now to play with the filter :) The result looks different between CPU and GPU, but guess this is expected because of the different algorithm used? > A naive O(radius^2) implementation is used for radii up to 5 pixels, and a constant O(1) implementation based on summed area tables is used for higher radii at the cost of building and storing the tables. It would be great to explicitly state why is it so. Can speculate that smaller radius works faster than doing pre-computation. Is it correct? The I did not cross-check with the paper, but locally it looks nice and readable and documented :) Only have some minor comment to make it easier to find out what the equations are referring to from the comments. P.S. As a side note, would it make sense to switch the CPU implementation to SAT as well?
@ -0,0 +28,4 @@
}
}
/* Computes the horizontal and vertical incomplete prologues from the given input using equations

The paper needs to be stated here, as it is unclear where the equation numbers are referring to. Can be a comment at the top the .cc file.

The paper needs to be stated here, as it is unclear where the equation numbers are referring to. Can be a comment at the top the .cc file.
OmarEmaraDev marked this conversation as resolved
@ -0,0 +181,4 @@
/* An implementation of the summed area table algorithm from the paper:
*
* Nehab, Diego, et al. "GPU-efficient recursive filtering and summed-area tables."

Move this to the top of the file. Will solve the note from above :)

Move this to the top of the file. Will solve the note from above :)
OmarEmaraDev marked this conversation as resolved
Sergey Sharybin added this to the Compositing project 2023-07-12 15:53:28 +02:00
Author
Member

@Sergey The difference between the CPU and GPU is due to the fact that the CPU uses the variance of the luminance of pixel colors, while the GPU uses the sum of variance of pixel color channels. I completely forgot to mention that. I made that decision because it saves us from computing yet another SAT for luminance, the original algorithm for Kuwahara used that method, and the difference was not significant for low radii and didn't make sense for high radii. We can, however, weight the variance of channels using luminance coefficients to get a result closer to CPU if this is something that we want to do. Either way, we should adapt the CPU implementation to be the same.

So the difference is not due to the different algorithm.

Yes, correct, small radii are faster to compute without SAT, that's why I have a naive implementation for radii <= 5.

And yes, it would make sense to use an SAT on the CPU as well, and it's implementation should be trivial since it can be done using a parallel loop on rows and columns.

@Sergey The difference between the CPU and GPU is due to the fact that the CPU uses the variance of the luminance of pixel colors, while the GPU uses the sum of variance of pixel color channels. I completely forgot to mention that. I made that decision because it saves us from computing yet another SAT for luminance, the original algorithm for Kuwahara used that method, and the difference was not significant for low radii and didn't make sense for high radii. We can, however, weight the variance of channels using luminance coefficients to get a result closer to CPU if this is something that we want to do. Either way, we should adapt the CPU implementation to be the same. So the difference is not due to the different algorithm. Yes, correct, small radii are faster to compute without SAT, that's why I have a naive implementation for radii <= 5. And yes, it would make sense to use an SAT on the CPU as well, and it's implementation should be trivial since it can be done using a parallel loop on rows and columns.

The difference between the CPU and GPU is due to the fact that the CPU uses the variance of the luminance of pixel colors, while the GPU uses the sum of variance of pixel color channels.

This effectively un-does the #108858 ?

> The difference between the CPU and GPU is due to the fact that the CPU uses the variance of the luminance of pixel colors, while the GPU uses the sum of variance of pixel color channels. This effectively un-does the #108858 ?
Author
Member

@Sergey Not exactly, this seems to be about computing the direction from luminance in the anisotropic variant. In the classic variant, we essentially compute the average variance of channels and choose a quadrant based on that, while the CPU versipon compute the variance of luminance.

@Sergey Not exactly, this seems to be about computing the direction from luminance in the anisotropic variant. In the classic variant, we essentially compute the average variance of channels and choose a quadrant based on that, while the CPU versipon compute the variance of luminance.

Ah, indeed that PR was about Anisotropic case.

I am still confused though where the difference is coming from.

I am comparing Classic Kuwahara node with different radius on a frame from the Tears of Steel movie A014C005_12051000000.exr

With the current state of patch at the Full Frame and GPU compositors seem to give matched results (at least no difference that an eye cat catch, I did not run the idiff or anything like that on the actual output result).

However, going size 8 starts to show the difference. Changing const float lum = IMB_colormanagement_get_luminance(color); to const float lum = (color[0] + color[1] + color[2]) / 3.0f; in the COM_KuwaharaClassicOperation.cc did not seem to make results to match either.

Am I missing something, or is the SAT actually gives different result from the naive implementation?
I am looking at from the perspective of: what would it take to make the CPU and GPU implementations to give matching results. To me it is not yet clear in this patch, and we really need to make it a priority to have the results matching as close as possible (just like we do it in Cycles).

Ah, indeed that PR was about Anisotropic case. I am still confused though where the difference is coming from. I am comparing Classic Kuwahara node with different radius on a frame from the Tears of Steel movie [A014C005_12051000000.exr](https://download.blender.org/ftp/sergey/attic/A014C005_12051000000.exr) With the current state of patch at the Full Frame and GPU compositors seem to give matched results (at least no difference that an eye cat catch, I did not run the idiff or anything like that on the actual output result). However, going size 8 starts to show the difference. Changing `const float lum = IMB_colormanagement_get_luminance(color);` to `const float lum = (color[0] + color[1] + color[2]) / 3.0f;` in the `COM_KuwaharaClassicOperation.cc` did not seem to make results to match either. Am I missing something, or is the SAT actually gives different result from the naive implementation? I am looking at from the perspective of: what would it take to make the CPU and GPU implementations to give matching results. To me it is not yet clear in this patch, and we really need to make it a priority to have the results matching as close as possible (just like we do it in Cycles).
Author
Member

I will investigate that and create a patch that makes the classic implementation on CPU inline with the GPU implementation, hopefully that should clarify the difference.

And to clarify const float lum = (color[0] + color[1] + color[2]) / 3.0f; is not how the GPU implementation works. I will clarify that in my patch.

I will investigate that and create a patch that makes the classic implementation on CPU inline with the GPU implementation, hopefully that should clarify the difference. And to clarify `const float lum = (color[0] + color[1] + color[2]) / 3.0f;` is not how the GPU implementation works. I will clarify that in my patch.
Omar Emara added 2 commits 2023-07-18 13:16:06 +02:00
Author
Member

@Sergey Here is a patch to make the Tiled implementation similar to the GPU implementation. We simply compute the mean per color channel then compute the final variance as the average variance of channels. Hopefully this should clarify the difference.

Note that this is just a demonstrative patch, and is only for the tiled implementation, not the the full frame one.

I was also wrong about the second difference, boundary handing is identical in both implementations.

@Sergey Here is a patch to make the Tiled implementation similar to the GPU implementation. We simply compute the mean per color channel then compute the final variance as the average variance of channels. Hopefully this should clarify the difference. Note that this is just a demonstrative patch, and is only for the tiled implementation, not the the full frame one. I was also wrong about the second difference, boundary handing is identical in both implementations.
Clément Foucault approved these changes 2023-07-18 20:19:30 +02:00
Clément Foucault left a comment
Member

I'm pretty much following Sergey here. Patch is nicely documented. I tested it on M1 macbook pro and it works. As for compatibility with the CPU compositor, I leave that to Sergey to decide.

I'm pretty much following Sergey here. Patch is nicely documented. I tested it on M1 macbook pro and it works. As for compatibility with the CPU compositor, I leave that to Sergey to decide.
Sergey Sharybin approved these changes 2023-07-19 11:19:13 +02:00
Sergey Sharybin left a comment
Owner

The CPU story would definitely need to be prioritized.

I am still not fully sure what the different handling of luma means on terms of quality. Form testing the the behavior of GPU implementation (from the point of view of the artistic output, not technical comparison to CPU) it all seems to work well. So if that is the algorithm which gives the best performance with the least memory overhead I think we should stick to it. For the CPU it would mean aligning it to the GPU algorithm. Now with the explanation and even code from Omar it seems easy to do in a separate PR :)

The CPU story would definitely need to be prioritized. I am still not fully sure what the different handling of luma means on terms of quality. Form testing the the behavior of GPU implementation (from the point of view of the artistic output, not technical comparison to CPU) it all seems to work well. So if that is the algorithm which gives the best performance with the least memory overhead I think we should stick to it. For the CPU it would mean aligning it to the GPU algorithm. Now with the explanation and even code from Omar it seems easy to do in a separate PR :)
Omar Emara merged commit 940558f9ac into main 2023-07-19 14:04:25 +02:00
Omar Emara deleted branch classic-kuwahara-filter-gpu 2023-07-19 14:04:27 +02:00
Sergey Sharybin removed this from the Compositing project 2023-08-08 16:40:15 +02:00
Sign in to join this conversation.
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
4 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#109292
No description provided.