Realtime Compositor: Implement Classic Kuwahara #109292

Merged

Omar Emara merged 11 commits from OmarEmaraDev/blender:classic-kuwahara-filter-gpu into main

2023-07-19 14:04:25 +02:00

Omar Emara commented

2023-06-23 15:25:37 +02:00

Member

This patch implements the Classic Kuwahara node for the Realtime Compositor.

A naive O(radius^2) implementation is used for radii up to 5 pixels, and a
constant O(1) implementation based on summed area tables is used for higher
radii at the cost of building and storing the tables.

The SAT implementation is based on that described in:

Nehab, Diego, et al. "GPU-efficient recursive filtering and summed-area tables."

Additionally, the Result class now allows full precision texture allocation, was
was necessary for storing the SAT tables.

This patch implements the Classic Kuwahara node for the Realtime Compositor. A naive O(radius^2) implementation is used for radii up to 5 pixels, and a constant O(1) implementation based on summed area tables is used for higher radii at the cost of building and storing the tables. The SAT implementation is based on that described in: Nehab, Diego, et al. "GPU-efficient recursive filtering and summed-area tables." Additionally, the Result class now allows full precision texture allocation, was was necessary for storing the SAT tables.

Omar Emara added the

label 2023-06-23 15:25:37 +02:00

Omar Emara added 1 commit 2023-06-23 15:25:49 +02:00

047b1b31dc Realtime Compositor: Implement Classic Kuwahara

This patch implements the Classic Kuwahara node for the Realtime
Compositor. This is still a naive implementation and can probably be
accelerated using a SAT table in the future.

Omar Emara added 2 commits 2023-06-28 17:39:10 +02:00

2462630da7 Merge branch 'main' into classic-kuwahara-filter-gpu

1018704d9c Initial implementation of Summed Area Table algorithm

Omar Emara added 5 commits 2023-07-10 17:04:21 +02:00

259a114616 Merge branch 'main' into classic-kuwahara-filter-gpu

a0ed9f4151 Correct final phase dispatch domain

f101bc3b15 Correct X prologue sum

b9eb72f552 Finish Classic Kuwahara implementation

buildbot/vexp-code-patch-coordinator Build done. Details

827c139ba3 Merge branch 'main' into classic-kuwahara-filter-gpu

Omar Emara changed title from ~~WIP: Realtime Compositor: Implement Classic Kuwahara~~ to Realtime Compositor: Implement Classic Kuwahara

2023-07-10 17:04:44 +02:00

Omar Emara requested review from Sergey Sharybin 2023-07-10 17:21:11 +02:00

Omar Emara requested review from Clément Foucault 2023-07-10 17:21:12 +02:00

Omar Emara requested review from Habib Gahbiche 2023-07-10 17:21:12 +02:00

Sergey Sharybin commented

2023-07-10 17:35:30 +02:00

Owner

@blender-bot package

@blender-bot package

Blender Bot commented

2023-07-10 17:35:33 +02:00

Member

Package build started. Download here when ready.

Package build started. [Download here](https://builder.blender.org/download/patch/PR109292) when ready.

Omar Emara added 1 commit 2023-07-10 18:42:46 +02:00

ffd462a5ba Assert after exhaustive enum switch case

Sergey Sharybin reviewed 2023-07-12 15:03:20 +02:00

Sergey Sharybin left a comment

Owner

This feels so nice now to play with the filter :)

The result looks different between CPU and GPU, but guess this is expected because of the different algorithm used?

A naive O(radius^2) implementation is used for radii up to 5 pixels, and a constant O(1) implementation based on summed area tables is used for higher radii at the cost of building and storing the tables.

It would be great to explicitly state why is it so. Can speculate that smaller radius works faster than doing pre-computation. Is it correct?

The I did not cross-check with the paper, but locally it looks nice and readable and documented :) Only have some minor comment to make it easier to find out what the equations are referring to from the comments.

P.S. As a side note, would it make sense to switch the CPU implementation to SAT as well?

This feels so nice now to play with the filter :) The result looks different between CPU and GPU, but guess this is expected because of the different algorithm used? > A naive O(radius^2) implementation is used for radii up to 5 pixels, and a constant O(1) implementation based on summed area tables is used for higher radii at the cost of building and storing the tables. It would be great to explicitly state why is it so. Can speculate that smaller radius works faster than doing pre-computation. Is it correct? The I did not cross-check with the paper, but locally it looks nice and readable and documented :) Only have some minor comment to make it easier to find out what the equations are referring to from the comments. P.S. As a side note, would it make sense to switch the CPU implementation to SAT as well?

source/blender/compositor/realtime_compositor/algorithms/intern/summed_area_table.cc Outdated

						
				@ -0,0 +28,4 @@

				  }

				}

				/* Computes the horizontal and vertical incomplete prologues from the given input using equations

Sergey Sharybin commented

2023-07-12 14:56:49 +02:00

Owner

The paper needs to be stated here, as it is unclear where the equation numbers are referring to. Can be a comment at the top the .cc file.

The paper needs to be stated here, as it is unclear where the equation numbers are referring to. Can be a comment at the top the .cc file.

OmarEmaraDev marked this conversation as resolved

source/blender/compositor/realtime_compositor/algorithms/intern/summed_area_table.cc

						
				@ -0,0 +181,4 @@

				/* An implementation of the summed area table algorithm from the paper:

				 *

				 *   Nehab, Diego, et al. "GPU-efficient recursive filtering and summed-area tables."

Sergey Sharybin commented

2023-07-12 14:58:24 +02:00

Owner

Move this to the top of the file. Will solve the note from above :)

Move this to the top of the file. Will solve the note from above :)

OmarEmaraDev marked this conversation as resolved

Sergey Sharybin added this to the Compositing project 2023-07-12 15:53:28 +02:00

Omar Emara commented

2023-07-12 19:12:17 +02:00

Author

Member

@Sergey The difference between the CPU and GPU is due to the fact that the CPU uses the variance of the luminance of pixel colors, while the GPU uses the sum of variance of pixel color channels. I completely forgot to mention that. I made that decision because it saves us from computing yet another SAT for luminance, the original algorithm for Kuwahara used that method, and the difference was not significant for low radii and didn't make sense for high radii. We can, however, weight the variance of channels using luminance coefficients to get a result closer to CPU if this is something that we want to do. Either way, we should adapt the CPU implementation to be the same.

So the difference is not due to the different algorithm.

Yes, correct, small radii are faster to compute without SAT, that's why I have a naive implementation for radii <= 5.

And yes, it would make sense to use an SAT on the CPU as well, and it's implementation should be trivial since it can be done using a parallel loop on rows and columns.

@Sergey The difference between the CPU and GPU is due to the fact that the CPU uses the variance of the luminance of pixel colors, while the GPU uses the sum of variance of pixel color channels. I completely forgot to mention that. I made that decision because it saves us from computing yet another SAT for luminance, the original algorithm for Kuwahara used that method, and the difference was not significant for low radii and didn't make sense for high radii. We can, however, weight the variance of channels using luminance coefficients to get a result closer to CPU if this is something that we want to do. Either way, we should adapt the CPU implementation to be the same. So the difference is not due to the different algorithm. Yes, correct, small radii are faster to compute without SAT, that's why I have a naive implementation for radii <= 5. And yes, it would make sense to use an SAT on the CPU as well, and it's implementation should be trivial since it can be done using a parallel loop on rows and columns.

Sergey Sharybin commented

2023-07-13 14:14:29 +02:00

Owner

The difference between the CPU and GPU is due to the fact that the CPU uses the variance of the luminance of pixel colors, while the GPU uses the sum of variance of pixel color channels.

This effectively un-does the #108858 ?

> The difference between the CPU and GPU is due to the fact that the CPU uses the variance of the luminance of pixel colors, while the GPU uses the sum of variance of pixel color channels. This effectively un-does the #108858 ?

Omar Emara commented

2023-07-18 10:35:09 +02:00

Author

Member

@Sergey Not exactly, this seems to be about computing the direction from luminance in the anisotropic variant. In the classic variant, we essentially compute the average variance of channels and choose a quadrant based on that, while the CPU versipon compute the variance of luminance.

@Sergey Not exactly, this seems to be about computing the direction from luminance in the anisotropic variant. In the classic variant, we essentially compute the average variance of channels and choose a quadrant based on that, while the CPU versipon compute the variance of luminance.

Sergey Sharybin commented

2023-07-18 11:47:09 +02:00

Owner

Ah, indeed that PR was about Anisotropic case.

I am still confused though where the difference is coming from.

I am comparing Classic Kuwahara node with different radius on a frame from the Tears of Steel movie A014C005_12051000000.exr

With the current state of patch at the Full Frame and GPU compositors seem to give matched results (at least no difference that an eye cat catch, I did not run the idiff or anything like that on the actual output result).

However, going size 8 starts to show the difference. Changing const float lum = IMB_colormanagement_get_luminance(color); to const float lum = (color[0] + color[1] + color[2]) / 3.0f; in the COM_KuwaharaClassicOperation.cc did not seem to make results to match either.

Am I missing something, or is the SAT actually gives different result from the naive implementation?
I am looking at from the perspective of: what would it take to make the CPU and GPU implementations to give matching results. To me it is not yet clear in this patch, and we really need to make it a priority to have the results matching as close as possible (just like we do it in Cycles).

Ah, indeed that PR was about Anisotropic case. I am still confused though where the difference is coming from. I am comparing Classic Kuwahara node with different radius on a frame from the Tears of Steel movie [A014C005_12051000000.exr](https://download.blender.org/ftp/sergey/attic/A014C005_12051000000.exr) With the current state of patch at the Full Frame and GPU compositors seem to give matched results (at least no difference that an eye cat catch, I did not run the idiff or anything like that on the actual output result). However, going size 8 starts to show the difference. Changing `const float lum = IMB_colormanagement_get_luminance(color);` to `const float lum = (color[0] + color[1] + color[2]) / 3.0f;` in the `COM_KuwaharaClassicOperation.cc` did not seem to make results to match either. Am I missing something, or is the SAT actually gives different result from the naive implementation? I am looking at from the perspective of: what would it take to make the CPU and GPU implementations to give matching results. To me it is not yet clear in this patch, and we really need to make it a priority to have the results matching as close as possible (just like we do it in Cycles).

Omar Emara commented

2023-07-18 11:59:21 +02:00

Author

Member

I will investigate that and create a patch that makes the classic implementation on CPU inline with the GPU implementation, hopefully that should clarify the difference.

And to clarify const float lum = (color[0] + color[1] + color[2]) / 3.0f; is not how the GPU implementation works. I will clarify that in my patch.

I will investigate that and create a patch that makes the classic implementation on CPU inline with the GPU implementation, hopefully that should clarify the difference. And to clarify `const float lum = (color[0] + color[1] + color[2]) / 3.0f;` is not how the GPU implementation works. I will clarify that in my patch.

Omar Emara added 2 commits 2023-07-18 13:16:06 +02:00

5cbb70eab3 Merge branch 'main' into classic-kuwahara-filter-gpu

ba97bcb8e8 Move general comment at the top of CC file

Omar Emara commented

2023-07-18 14:22:06 +02:00

Author

Member

@Sergey Here is a patch to make the Tiled implementation similar to the GPU implementation. We simply compute the mean per color channel then compute the final variance as the average variance of channels. Hopefully this should clarify the difference.

Note that this is just a demonstrative patch, and is only for the tiled implementation, not the the full frame one.

I was also wrong about the second difference, boundary handing is identical in both implementations.

@Sergey Here is a patch to make the Tiled implementation similar to the GPU implementation. We simply compute the mean per color channel then compute the final variance as the average variance of channels. Hopefully this should clarify the difference. Note that this is just a demonstrative patch, and is only for the tiled implementation, not the the full frame one. I was also wrong about the second difference, boundary handing is identical in both implementations.

classic-kuwahara-tiled-gpu-compatible.diff

4.4 KiB

Clément Foucault approved these changes 2023-07-18 20:19:30 +02:00

Clément Foucault left a comment

Member

I'm pretty much following Sergey here. Patch is nicely documented. I tested it on M1 macbook pro and it works. As for compatibility with the CPU compositor, I leave that to Sergey to decide.

I'm pretty much following Sergey here. Patch is nicely documented. I tested it on M1 macbook pro and it works. As for compatibility with the CPU compositor, I leave that to Sergey to decide.

Sergey Sharybin approved these changes 2023-07-19 11:19:13 +02:00

Sergey Sharybin left a comment

Owner

The CPU story would definitely need to be prioritized.

I am still not fully sure what the different handling of luma means on terms of quality. Form testing the the behavior of GPU implementation (from the point of view of the artistic output, not technical comparison to CPU) it all seems to work well. So if that is the algorithm which gives the best performance with the least memory overhead I think we should stick to it. For the CPU it would mean aligning it to the GPU algorithm. Now with the explanation and even code from Omar it seems easy to do in a separate PR :)

The CPU story would definitely need to be prioritized. I am still not fully sure what the different handling of luma means on terms of quality. Form testing the the behavior of GPU implementation (from the point of view of the artistic output, not technical comparison to CPU) it all seems to work well. So if that is the algorithm which gives the best performance with the least memory overhead I think we should stick to it. For the CPU it would mean aligning it to the GPU algorithm. Now with the explanation and even code from Omar it seems easy to do in a separate PR :)

Omar Emara merged commit 940558f9ac into main

2023-07-19 14:04:25 +02:00

Omar Emara referenced this issue from a commit

2023-07-19 14:04:26 +02:00

Realtime Compositor: Implement Classic Kuwahara

Omar Emara deleted branch classic-kuwahara-filter-gpu

2023-07-19 14:04:27 +02:00

William Leeson referenced this issue from a commit

2023-07-21 10:01:54 +02:00

Realtime Compositor: Implement Classic Kuwahara

Sergey Sharybin removed this from the Compositing project 2023-08-08 16:40:15 +02:00

Sign in to join this conversation.

No reviewers

No Label

Animation & Rigging

Asset Browser Project Overview

Automated Testing

Blender Asset Bundle

Dependency Graph

Development Management

EEVEE & Viewport

Images & Movies

Motion Tracking

Nodes & Physics

Pipeline, Assets & IO

Platforms, Builds & Tests

Render & Cycles

Render Pipeline

Sculpt, Paint & Texture

Video Sequencer

Virtual Reality

Blender 2.8 Project

Milestone 1: Basic, Local Asset Browser

Good First Issue

Animation & Rigging

Development Management

EEVEE & Viewport

Nodes & Physics

Pipeline, Assets & IO

Platforms, Builds & Tests

Render & Cycles

Sculpt, Paint & Texture

Needs Info from Developers

Needs Information from User

No Milestone

No project

No Assignees

4 Participants

Notifications

Due Date

The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#109292

No description provided.