Speedup classic Kuwahara filter by summed area table #111150
No reviewers
Labels
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: blender/blender#111150
Loading…
Reference in New Issue
No description provided.
Delete Branch "zazizizou/blender:com-kuwahara-sat"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Implemented SAT for CPU.
Pros:
Cons:
Following image shows the different methods divided by the naive implementation (black is perfect, color/white is deviation from perfect):
@ -0,0 +207,4 @@
ASSERT_EQ(sum[0], 4);
}
TEST_F(SummedTableAreaSumTest, RightLine)
@OmarEmaraDev such cases (area of 1 line or column at image border) yield wrong results using SAT implementation ported from GPU.
I don't think it makes sense to support for Kuwhara filter because filter size must be at least 4. Do you agree? If so, I will add
BLI_assert
to only allow areas larger than 2x2I feel like this should work. Is it wrong in the GPU implementation as well?
I looked into the GPU implementation. The problem was the definition of "area" for SAT. In my implementation I assumed the last element of area is not inclusive (the same way
image[height]
is out of bounds).This is just a matter of definition though, so I will keep the implementation and update the tests.
I looked into the GPU implementation. The problem was the definition of "area" for SAT. In my implementation I assumed the last element of area is not inclusive (the same way
image[height]
is out of bounds).This is just a matter of definition though, so I will keep the implementation and update the tests.
@blender-bot build
@blender-bot build
@blender-bot build
@blender-bot build
WIP: Speedup classic Kuwahara filter by summed area tableto Speedup classic Kuwahara filter by summed area table@blender-bot build
The High Precision option shpould be used in the early return of the
ConvertKuwaharaOperation::execute_classic
method.@ -32,0 +58,4 @@
* Note: best results are achieved using this optimization as well as the running error
* compensation in SummedAreaTableOperation.
*/
CalculateMeanOperation *mean = new CalculateMeanOperation();
I feel like we can just subtract 0.5 here and not the actual mean, since it will be faster and I suspect will achieve the desired functionality if not do it better.
If we have an image with mostly values in the
[0, 1]
range and a region of very large values as highlights, the mean will be skewed to the higher end. And subtracting the mean from this higher value region will not be beneficial anyways.As discussed, I tried this and it didn't work. The reason is the small differences of 0.2 - 0.3 (2x - 3x actual mean) cause very large differences in the squared SAT, making it asymmetric and therefore cancelling all benefits.
@ -172,0 +223,4 @@
int xx = x + dx;
int yy = y + dy;
if (xx >= 0 && yy >= 0 && xx < image->get_width() && yy < image->get_height()) {
Reverse condition and continue to reduce indentation.
@ -0,0 +123,4 @@
/* Track floating point error. See below. */
float4 running_compensation = {0.0f, 0.0f, 0.0f, 0.0f};
for (BuffersIterator<float> it = result->iterate_with({}, *rect); !it.is_end(); ++it) {
I think we should attempt to multithread the SAT computation. Not sure if there is anything stopping us from doing that, but a two pass prefix sum should be easy to implement and efficient to parallelize on the CPU.
Each of the prefix sums can simply be a parallel loop over rows/columns.
As discussed in the meeting, my concern was using
SingleThreadedOperation
for a multi-threaded execution. I will upload a patch using TBB.@ -8631,2 +8631,4 @@
"Controls how directional the filter is. 0 means the filter is completely omnidirectional "
"while 2 means it is maximally directed along the edges of the image");
prop = RNA_def_property(srna, "fast", PROP_BOOLEAN, PROP_NONE);
I think this should be called
High Precision
, as it conveys the meaning better to the user.Option will be removed.
@ -8633,0 +8634,4 @@
prop = RNA_def_property(srna, "fast", PROP_BOOLEAN, PROP_NONE);
RNA_def_property_boolean_sdna(prop, nullptr, "fast", 1);
RNA_def_property_ui_text(
prop, "Fast", "Use faster computation. Might produce artefacts for large images.");
Extra period at the end of description.
I am not entire sold on the idea of having it an option, let alone disabled by default.
From my understanding bigger the resolution less accurate the result is. Testing with 4K images here there is surely difference compared with the ground-truth, but I do not think those are deal breakers. Now, if for bigger images this becomes deal breaker, I don't think 30x slowdown will be an acceptable trade-off for artists. So from user perspective it does not seem to be a practical option.
For the development purposes it could be interesting to have a ground-truth implementation, there are better ways of doing so:
Thoughts?
P.S. On a functional level the fast method is so much more fun now :)
There are a number of points I would like to note:
While I initially proposed the option to resolve the SAT artifacts, it seems the offset SAT implementation is now satisfactory enough, so the option might not be strictly necessary. Either way, I think the SAT should definitely be the default if we decide to have the option.
@blender-bot build
Using
TBB
was not much faster thanopenMP
but multithreading in general helped in reducing the error (see updated images in description).As agreed, I removed the
fast
. The filter still uses the naive implementation for kernel size < 4, which is around where SAT becomes accurate and fast enough.@ -32,0 +35,4 @@
kuwahara_classic->set_use_sat(false);
}
else {
SummedAreaTableOperation *sat = new SummedAreaTableOperation();
Add a
kuwahara_classic->set_use_sat(true);
just for clarity.@ -0,0 +64,4 @@
MemoryBuffer *image = inputs[0];
/* First pass: copy values from input to output and square values if necessary. */
threading::parallel_for(IndexRange(area.ymin, area.ymax), 1, [&](const IndexRange range_y) {
It is sufficient to have a single parallel loop over rows and a serial loop over columns, too much parallelism will hurt performance.
This copy loop can be fused with the horizontal pass.
@ -0,0 +102,4 @@
threading::parallel_for(IndexRange(area.ymin, area.ymax), 1, [&](const IndexRange range_y) {
for (int64_t y = *range_y.begin(); y < *range_y.end(); y++) {
/* Track floating point error. See below. */
float4 running_compensation = {0.0f, 0.0f, 0.0f, 0.0f};
This can be more compact.
for (const int y : sub_y_range) {
get_elem
function.Will do, thanks for the tip :)
Some findings from profiling:
Are there locks down the the road in the
get_elem
perhaps which gets in a way of better parralelism?If algorithmically the patch is correct then I would suggest moving forward with it. Makes it easier to look deeper into performance, while bringing artists benefits early on.
@ -0,0 +130,4 @@
area.ymin = 1;
area.ymax = 3;
float4 sum = summed_area_table_sum(sat_.get(), area);
ASSERT_EQ(sum[0], 9);
Any specific reason to use
ASSERT_EQ
instead ofEXPECT_EQ
? The ASSERT will stop the test. It is typically used for cases when the rest of the test will be impossible. For example, when you expect function to give you a pointer to an object and you check for it be non-nullptr before looking into its properties.No specific reason, but it doesn't make a difference here because there is a single assert per test. I can update it in a later patch for clarity
@blender-bot build
It looks like memory access is the bottleneck.
read_elem_checked
is the function causing the most waits, especially in the functionsummed_area_table_sum()
@ -7,11 +7,13 @@
#include "COM_KuwaharaNode.h"
#include "COM_CalculateMeanOperation.h"
Unnecessary include.
@ -0,0 +13,4 @@
SummedAreaTableOperation::SummedAreaTableOperation()
{
this->add_input_socket(DataType::Color);
this->add_input_socket(DataType::Value);
What is this Value input?
This was needed to subtract the mean from image. Not needed anymore, will remove.
@ -0,0 +79,4 @@
threading::parallel_for(IndexRange(area.xmin, area.xmax), 1, [&](const IndexRange range_x) {
for (const int x : range_x) {
for (const int y : IndexRange(area.ymin, area.ymax)) {
float4 color;
Use a temporary
accumulated_color
variable and avoid reading the buffer again just like the above loop. Then, useget_elem
instead ofread_elem_checked
.@ -0,0 +101,4 @@
for (const int x : IndexRange(area->xmin, area->xmax)) {
float4 color;
image_reader_->read_sampled(&color.x, x, y, sampler);
Use
read
instead ofread_sampled
. Same applies for allread_sampled
calls below.@ -0,0 +113,4 @@
for (const int x : range_x) {
for (const int y : IndexRange(area->ymin, area->ymax)) {
float4 color;
output->read_elem_checked(x, y - 1, &color.x);
Same as above.
@ -0,0 +163,4 @@
float4 a, b, c, d, addend, substrahend;
buffer->read_sampled(
&a.x, corrected_upper_bound[0], corrected_upper_bound[1], PixelSampler::Nearest);
Use
UNPACK2
.@ -0,0 +9,4 @@
namespace blender::compositor {
/**
* \brief base class of CalculateMean, implementing the simple CalculateMean
Update comment.
The latest update
Improve performance by 5-10% by using fewer buffer reads
seems have broken the filter. The result is all empty, and the following tests are failing:Nailed it down a bit more. The issue is in the change:
Note how the
corrected_lower_bound[0], corrected_upper_bound[1]
is replaced (after macro expansion) with corrected_lower_bound[1]`. There seems to be another chunk in the patch which lead to the same issue.On a related topic, we should not be using
UNPACK2
in the new code. Instead, overload theread_sampled
which acceptsconst int2& pos
and do expansion to individual x, y there (explicitly, without use of macro).Thanks @Sergey, really need more discipline to run tests for every single commit...
@blender-bot build
@blender-bot build
@OmarEmaraDev @Sergey thank you for the quick and helpful review!