Speedup classic Kuwahara filter by summed area table #111150

Habib Gahbiche · 2023-08-15T22:09:34+02:00

Habib Gahbiche commented

2023-08-15 22:09:34 +02:00

Implemented SAT for CPU.
Pros:

Filter runtime becomes independent of filter size
Up to 30x faster for 4k images

Cons:

Loss of precision because sum values of whole image is stored as float. Following two images show the effect of loss of precision on results. Multi-threading yields comparable results to 2sum method for filter size > 4.

Following image shows the different methods divided by the naive implementation (black is perfect, color/white is deviation from perfect):

Implemented SAT for CPU. Pros: - Filter runtime becomes independent of filter size - Up to 30x faster for 4k images Cons: - Loss of precision because sum values of whole image is stored as float. Following two images show the effect of loss of precision on results. Multi-threading yields comparable results to 2sum method for filter size > 4. ![picture](https://i.ibb.co/s1Lj9j5/Naive-vs-sat.jpg) Following image shows the different methods divided by the naive implementation (black is perfect, color/white is deviation from perfect): ![picture](https://i.ibb.co/qFqS8f0/naive-vs-sat-div.jpg)

🚀 1

Habib Gahbiche added 2 commits 2023-08-15 22:09:48 +02:00

c44f442d03 Speedup classical kuwahara filter using summed area table

9e48b16983 Merge remote-tracking branch 'origin/main' into com-kuwahara-sat

Habib Gahbiche added 3 commits 2023-08-20 10:45:50 +02:00

4ad34194ed Finalize SAT for tiled compositor

f9a3362023 Merge remote-tracking branch 'origin/main' into com-kuwahara-sat

162d7d500b Fix build error caused by merge

Habib Gahbiche reviewed 2023-08-22 13:01:36 +02:00

source/blender/compositor/tests/COM_ComputeSummedAreaTableOperation_test.cc Outdated

						
				@ -0,0 +207,4 @@

				  ASSERT_EQ(sum[0], 4);

				}

				TEST_F(SummedTableAreaSumTest, RightLine)

Habib Gahbiche commented

2023-08-22 13:01:36 +02:00

@OmarEmaraDev such cases (area of 1 line or column at image border) yield wrong results using SAT implementation ported from GPU.

I don't think it makes sense to support for Kuwhara filter because filter size must be at least 4. Do you agree? If so, I will add BLI_assert to only allow areas larger than 2x2

@OmarEmaraDev such cases (area of 1 line or column at image border) yield wrong results using SAT implementation ported from GPU. I don't think it makes sense to support for Kuwhara filter because filter size must be at least 4. Do you agree? If so, I will add `BLI_assert` to only allow areas larger than 2x2

Omar Emara commented

2023-08-22 14:20:51 +02:00

I feel like this should work. Is it wrong in the GPU implementation as well?

Habib Gahbiche commented

2023-09-03 12:25:44 +02:00

I looked into the GPU implementation. The problem was the definition of "area" for SAT. In my implementation I assumed the last element of area is not inclusive (the same way image[height] is out of bounds).

This is just a matter of definition though, so I will keep the implementation and update the tests.

I looked into the GPU implementation. The problem was the definition of "area" for SAT. In my implementation I assumed the last element of area is not inclusive (the same way `image[height]` is out of bounds). This is just a matter of definition though, so I will keep the implementation and update the tests.

Habib Gahbiche commented

2023-09-03 12:25:46 +02:00

I looked into the GPU implementation. The problem was the definition of "area" for SAT. In my implementation I assumed the last element of area is not inclusive (the same way image[height] is out of bounds).

This is just a matter of definition though, so I will keep the implementation and update the tests.

I looked into the GPU implementation. The problem was the definition of "area" for SAT. In my implementation I assumed the last element of area is not inclusive (the same way `image[height]` is out of bounds). This is just a matter of definition though, so I will keep the implementation and update the tests.

zazizizou marked this conversation as resolved

Habib Gahbiche added 1 commit 2023-09-03 12:52:41 +02:00

de0f57b22a Fix failing unit tests

Habib Gahbiche added 3 commits 2023-09-03 23:17:14 +02:00

5f9149fa37 Correct declaration in header file

996eec154d Merge remote-tracking branch 'origin/main' into com-kuwahara-sat

buildbot/vexp-code-patch-coordinator Build done. Details

396f034689 cleanup: comments

Habib Gahbiche commented

2023-09-03 23:19:30 +02:00

@blender-bot build

Habib Gahbiche added 3 commits 2023-09-05 19:01:45 +02:00

279b981def Merge remote-tracking branch 'origin/main' into com-kuwahara-sat

d1aed80385 Merge remote-tracking branch 'origin/main' into com-kuwahara-sat

buildbot/vexp-code-patch-coordinator Build done. Details

1850383c77 Fix tests not passing in release mode

Habib Gahbiche commented

2023-09-05 19:01:53 +02:00

@blender-bot build

Habib Gahbiche added 2 commits 2023-09-05 20:39:49 +02:00

9554d05bad Fix build error on windows

buildbot/vexp-code-patch-coordinator Build done. Details

919636ab74 Merge remote-tracking branch 'origin/main' into com-kuwahara-sat

Habib Gahbiche commented

2023-09-05 20:40:02 +02:00

@blender-bot build

Habib Gahbiche added 1 commit 2023-09-09 11:32:18 +02:00

5a30ae384d Improve SAT accuracy by subtracting mean of image

Habib Gahbiche added 1 commit 2023-09-09 12:25:00 +02:00

e784c2fe04 revert changes to tiled implementation

Habib Gahbiche added 1 commit 2023-09-09 13:34:35 +02:00

buildbot/vexp-code-patch-coordinator Build done. Details

e23ae0022b Merge remote-tracking branch 'origin/main' into com-kuwahara-sat

Habib Gahbiche commented

2023-09-09 13:34:54 +02:00

@blender-bot build

Habib Gahbiche requested review from Sergey Sharybin 2023-09-09 13:50:01 +02:00

Habib Gahbiche requested review from Omar Emara 2023-09-09 13:50:02 +02:00

Habib Gahbiche changed title from ~~WIP: Speedup classic Kuwahara filter by summed area table~~ to Speedup classic Kuwahara filter by summed area table

2023-09-09 13:50:13 +02:00

Habib Gahbiche added 1 commit 2023-09-09 21:16:35 +02:00

buildbot/vexp-code-patch-coordinator Build done. Details

2818cad5f8 Use Kahan summation algorithm to further reduce floating point error

Habib Gahbiche commented

2023-09-09 22:02:59 +02:00

@blender-bot build

Omar Emara requested changes 2023-09-11 09:57:15 +02:00

Omar Emara left a comment

The High Precision option shpould be used in the early return of the ConvertKuwaharaOperation::execute_classic method.

The High Precision option shpould be used in the early return of the `ConvertKuwaharaOperation::execute_classic` method.

source/blender/compositor/nodes/COM_KuwaharaNode.cc Outdated

						
				@ -32,0 +58,4 @@

				         * Note: best results are achieved using this optimization as well as the running error

				         * compensation in SummedAreaTableOperation.

				         */

				        CalculateMeanOperation *mean = new CalculateMeanOperation();

Omar Emara commented

2023-09-11 09:25:12 +02:00

I feel like we can just subtract 0.5 here and not the actual mean, since it will be faster and I suspect will achieve the desired functionality if not do it better.

If we have an image with mostly values in the [0, 1] range and a region of very large values as highlights, the mean will be skewed to the higher end. And subtracting the mean from this higher value region will not be beneficial anyways.

I feel like we can just subtract 0.5 here and not the actual mean, since it will be faster and I suspect will achieve the desired functionality if not do it better. If we have an image with mostly values in the `[0, 1]` range and a region of very large values as highlights, the mean will be skewed to the higher end. And subtracting the mean from this higher value region will not be beneficial anyways.

Habib Gahbiche commented

2023-09-23 14:33:31 +02:00

As discussed, I tried this and it didn't work. The reason is the small differences of 0.2 - 0.3 (2x - 3x actual mean) cause very large differences in the squared SAT, making it asymmetric and therefore cancelling all benefits.

source/blender/compositor/operations/COM_KuwaharaClassicOperation.cc Outdated

						
				@ -172,0 +223,4 @@

				          int xx = x + dx;

				          int yy = y + dy;

				          if (xx >= 0 && yy >= 0 && xx < image->get_width() && yy < image->get_height()) {

Omar Emara commented

2023-09-11 09:32:23 +02:00

Reverse condition and continue to reduce indentation.

zazizizou marked this conversation as resolved

source/blender/compositor/operations/COM_SummedAreaTableOperation.cc Outdated

						
				@ -0,0 +123,4 @@

				  /* Track floating point error. See below. */

				  float4 running_compensation = {0.0f, 0.0f, 0.0f, 0.0f};

				  for (BuffersIterator<float> it = result->iterate_with({}, *rect); !it.is_end(); ++it) {

Omar Emara commented

2023-09-11 09:55:31 +02:00

I think we should attempt to multithread the SAT computation. Not sure if there is anything stopping us from doing that, but a two pass prefix sum should be easy to implement and efficient to parallelize on the CPU.

| Image | -> Prefix sum from left to right -> | Horizontal Pass Result | -> Prefix sum from bottom to top -> | Desired SAT |

Each of the prefix sums can simply be a parallel loop over rows/columns.

I think we should attempt to multithread the SAT computation. Not sure if there is anything stopping us from doing that, but a two pass prefix sum should be easy to implement and efficient to parallelize on the CPU. ``` | Image | -> Prefix sum from left to right -> | Horizontal Pass Result | -> Prefix sum from bottom to top -> | Desired SAT | ``` Each of the prefix sums can simply be a parallel loop over rows/columns.

Habib Gahbiche commented

2023-09-16 15:55:23 +02:00

As discussed in the meeting, my concern was using SingleThreadedOperation for a multi-threaded execution. I will upload a patch using TBB.

As discussed in the meeting, my concern was using `SingleThreadedOperation` for a multi-threaded execution. I will upload a patch using TBB.

zazizizou marked this conversation as resolved

source/blender/makesrna/intern/rna_nodetree.cc Outdated

						
				@ -8631,2 +8631,4 @@

				      "Controls how directional the filter is. 0 means the filter is completely omnidirectional "

				      "while 2 means it is maximally directed along the edges of the image");

				  prop = RNA_def_property(srna, "fast", PROP_BOOLEAN, PROP_NONE);

Omar Emara commented

2023-09-11 09:29:20 +02:00

I think this should be called High Precision, as it conveys the meaning better to the user.

I think this should be called `High Precision`, as it conveys the meaning better to the user.

Habib Gahbiche commented

2023-09-16 15:56:13 +02:00

Option will be removed.

zazizizou marked this conversation as resolved

source/blender/makesrna/intern/rna_nodetree.cc Outdated

						
				@ -8633,0 +8634,4 @@

				  prop = RNA_def_property(srna, "fast", PROP_BOOLEAN, PROP_NONE);

				  RNA_def_property_boolean_sdna(prop, nullptr, "fast", 1);

				  RNA_def_property_ui_text(

				      prop, "Fast", "Use faster computation. Might produce artefacts for large images.");

Omar Emara commented

2023-09-11 09:29:51 +02:00

Extra period at the end of description.

zazizizou marked this conversation as resolved

Sergey Sharybin commented

2023-09-11 11:59:46 +02:00

I am not entire sold on the idea of having it an option, let alone disabled by default.

From my understanding bigger the resolution less accurate the result is. Testing with 4K images here there is surely difference compared with the ground-truth, but I do not think those are deal breakers. Now, if for bigger images this becomes deal breaker, I don't think 30x slowdown will be an acceptable trade-off for artists. So from user perspective it does not seem to be a practical option.

For the development purposes it could be interesting to have a ground-truth implementation, there are better ways of doing so:

The simplest one is to enable the option by default
Can also follow what Cycles does and only expose it and take into account if a developer options are enabled (this is how Cycles exposes all the fine-tuning knobs in its Debug panel).

Thoughts?

P.S. On a functional level the fast method is so much more fun now :)

I am not entire sold on the idea of having it an option, let alone disabled by default. From my understanding bigger the resolution less accurate the result is. Testing with 4K images here there is surely difference compared with the ground-truth, but I do not think those are deal breakers. Now, if for bigger images this becomes deal breaker, I don't think 30x slowdown will be an acceptable trade-off for artists. So from user perspective it does not seem to be a practical option. For the development purposes it could be interesting to have a ground-truth implementation, there are better ways of doing so: * The simplest one is to enable the option by default * Can also follow what Cycles does and only expose it and take into account if a developer options are enabled (this is how Cycles exposes all the fine-tuning knobs in its Debug panel). Thoughts? P.S. On a functional level the fast method is so much more fun now :)

Omar Emara commented

2023-09-11 13:00:26 +02:00

There are a number of points I would like to note:

I believe we can solve the precision issues of the SAT implementation while maintaining similar level of performance. Therefore, I suspect that the option would only be temporary until we implement that improvement.
A 30x slowdown is only for the single threaded version. A multithreaded version would mean a slowdown that is orders of magnitude higher. :)
The nature of filter lends itself well to a downsample-filter-upsample strategy. So artists can use that if precision is still an issue.

While I initially proposed the option to resolve the SAT artifacts, it seems the offset SAT implementation is now satisfactory enough, so the option might not be strictly necessary. Either way, I think the SAT should definitely be the default if we decide to have the option.

There are a number of points I would like to note: - I believe we can solve the precision issues of the SAT implementation while maintaining similar level of performance. Therefore, I suspect that the option would only be temporary until we implement that improvement. - A 30x slowdown is only for the single threaded version. A multithreaded version would mean a slowdown that is orders of magnitude higher. :) - The nature of filter lends itself well to a downsample-filter-upsample strategy. So artists can use that if precision is still an issue. While I initially proposed the option to resolve the SAT artifacts, it seems the offset SAT implementation is now satisfactory enough, so the option might not be strictly necessary. Either way, I think the SAT should definitely be the default if we decide to have the option.

Habib Gahbiche added 4 commits 2023-09-23 14:35:23 +02:00

1fda4efc22 Multi-threaded SAT implementation. Offset and mean not needed anymore

5240765672 Merge remote-tracking branch 'origin/main' into com-kuwahara-sat-zazizizou

84bdba3922 Fix build error after merge

buildbot/vexp-code-patch-coordinator Build done. Details

360f2a5c6a Invert condition to reduce indentation

Habib Gahbiche commented

2023-09-23 14:42:23 +02:00

@blender-bot build

Habib Gahbiche commented

2023-09-23 14:42:47 +02:00

Using TBB was not much faster than openMP but multithreading in general helped in reducing the error (see updated images in description).

As agreed, I removed the fast. The filter still uses the naive implementation for kernel size < 4, which is around where SAT becomes accurate and fast enough.

Using `TBB` was not much faster than `openMP` but multithreading in general helped in reducing the error (see updated images in description). As agreed, I removed the `fast`. The filter still uses the naive implementation for kernel size < 4, which is around where SAT becomes accurate and fast enough.

Omar Emara requested changes 2023-09-25 12:36:15 +02:00

source/blender/compositor/nodes/COM_KuwaharaNode.cc Outdated

						
				@ -32,0 +35,4 @@

				        kuwahara_classic->set_use_sat(false);

				      }

				      else {

				        SummedAreaTableOperation *sat = new SummedAreaTableOperation();

Omar Emara commented

2023-09-25 12:34:08 +02:00

Add a kuwahara_classic->set_use_sat(true); just for clarity.

Add a `kuwahara_classic->set_use_sat(true);` just for clarity.

zazizizou marked this conversation as resolved

source/blender/compositor/operations/COM_SummedAreaTableOperation.cc Outdated

						
				@ -0,0 +64,4 @@

				  MemoryBuffer *image = inputs[0];

				  /* First pass: copy values from input to output and square values if necessary. */

				  threading::parallel_for(IndexRange(area.ymin, area.ymax), 1, [&](const IndexRange range_y) {

Omar Emara commented

2023-09-25 12:28:43 +02:00

It is sufficient to have a single parallel loop over rows and a serial loop over columns, too much parallelism will hurt performance.

This copy loop can be fused with the horizontal pass.

It is sufficient to have a single parallel loop over rows and a serial loop over columns, too much parallelism will hurt performance. This copy loop can be fused with the horizontal pass.

zazizizou marked this conversation as resolved

source/blender/compositor/operations/COM_SummedAreaTableOperation.cc Outdated

						
				@ -0,0 +102,4 @@

				  threading::parallel_for(IndexRange(area.ymin, area.ymax), 1, [&](const IndexRange range_y) {

				    for (int64_t y = *range_y.begin(); y < *range_y.end(); y++) {

				      /* Track floating point error. See below. */

				      float4 running_compensation = {0.0f, 0.0f, 0.0f, 0.0f};

Omar Emara commented

2023-09-25 12:32:39 +02:00

This can be more compact.

Use a for each loop on ranges. for (const int y : sub_y_range) {
Accumulate a color instead of reading the previous output.
Use the get_elem function.
Use a copy function.

  threading::parallel_for(IndexRange(area.ymin, area.ymax), 1, [&](const IndexRange sub_y_range) {
    for (const int y : sub_y_range) {
      float4 accumulated_color = float4(0.0f);
      for (const int x : IndexRange(area.xmin, area.xmax)) {
        const float4 color = float4(image->get_elem(x, y));
        accumulated_color += color * color;
        copy_v4_v4(output->get_elem(x, y), accumulated_color);
      }
    }
  });

This can be more compact. - Use a for each loop on ranges. `for (const int y : sub_y_range) {` - Accumulate a color instead of reading the previous output. - Use the `get_elem` function. - Use a copy function. ```cpp threading::parallel_for(IndexRange(area.ymin, area.ymax), 1, [&](const IndexRange sub_y_range) { for (const int y : sub_y_range) { float4 accumulated_color = float4(0.0f); for (const int x : IndexRange(area.xmin, area.xmax)) { const float4 color = float4(image->get_elem(x, y)); accumulated_color += color * color; copy_v4_v4(output->get_elem(x, y), accumulated_color); } } }); ```

Habib Gahbiche commented

2023-10-21 21:49:20 +02:00

Will do, thanks for the tip :)

zazizizou marked this conversation as resolved

Habib Gahbiche added 3 commits 2023-10-22 19:43:42 +02:00

fad4cffc41 More compact code for threading::parallel_for

689f5fde28 Merge remote-tracking branch 'origin/main' into com-kuwahara-sat

Conflicts:
	source/blender/compositor/nodes/COM_KuwaharaNode.cc
	source/blender/compositor/operations/COM_KuwaharaClassicOperation.cc
	source/blender/compositor/operations/COM_KuwaharaClassicOperation.h

buildbot/vexp-code-patch-coordinator Build done. Details

6b46a68886 SAT is now always computed. Keep naive implementation for small kernel sizes for better accuracy

Habib Gahbiche commented

2023-10-22 20:00:25 +02:00

Some findings from profiling:

SAT multithreaded implementation is 30-40% faster than single threaded SAT. On my machine, fullframe is now 3-4x slower than GPU
SAT operation is the bottleneck (as expected)
CPU usage is at around 80%. I experimented with different grain sizes but I didn't observe any significant differences.

Some findings from profiling: - SAT multithreaded implementation is 30-40% faster than single threaded SAT. On my machine, fullframe is now 3-4x slower than GPU - SAT operation is the bottleneck (as expected) - CPU usage is at around 80%. I experimented with different grain sizes but I didn't observe any significant differences.

Screenshot 2023-10-22 at 19.53.56.png

385 KiB

Sergey Sharybin reviewed 2023-10-23 09:52:58 +02:00

Sergey Sharybin left a comment

Are there locks down the the road in the get_elem perhaps which gets in a way of better parralelism?

If algorithmically the patch is correct then I would suggest moving forward with it. Makes it easier to look deeper into performance, while bringing artists benefits early on.

Are there locks down the the road in the `get_elem` perhaps which gets in a way of better parralelism? If algorithmically the patch is correct then I would suggest moving forward with it. Makes it easier to look deeper into performance, while bringing artists benefits early on.

source/blender/compositor/tests/COM_ComputeSummedAreaTableOperation_test.cc Outdated

						
				@ -0,0 +130,4 @@

				  area.ymin = 1;

				  area.ymax = 3;

				  float4 sum = summed_area_table_sum(sat_.get(), area);

				  ASSERT_EQ(sum[0], 9);

Sergey Sharybin commented

2023-10-23 09:49:20 +02:00

Any specific reason to use ASSERT_EQ instead of EXPECT_EQ ? The ASSERT will stop the test. It is typically used for cases when the rest of the test will be impossible. For example, when you expect function to give you a pointer to an object and you check for it be non-nullptr before looking into its properties.

Any specific reason to use `ASSERT_EQ ` instead of `EXPECT_EQ `? The ASSERT will stop the test. It is typically used for cases when the rest of the test will be impossible. For example, when you expect function to give you a pointer to an object and you check for it be non-nullptr before looking into its properties.

Habib Gahbiche commented

2023-10-26 06:10:14 +02:00

No specific reason, but it doesn't make a difference here because there is a single assert per test. I can update it in a later patch for clarity

zazizizou marked this conversation as resolved

Habib Gahbiche commented

2023-10-26 06:42:28 +02:00

@blender-bot build

Habib Gahbiche commented

2023-10-26 06:46:03 +02:00

Are there locks down the the road in the get_elem perhaps which gets in a way of better parralelism?

It looks like memory access is the bottleneck. read_elem_checked is the function causing the most waits, especially in the function summed_area_table_sum()

> Are there locks down the the road in the `get_elem` perhaps which gets in a way of better parralelism? It looks like memory access is the bottleneck. `read_elem_checked` is the function causing the most waits, especially in the function `summed_area_table_sum()`

Omar Emara requested changes 2023-10-27 16:34:40 +02:00

source/blender/compositor/nodes/COM_KuwaharaNode.cc Outdated

						
				@ -7,11 +7,13 @@

				#include "COM_KuwaharaNode.h"

				#include "COM_CalculateMeanOperation.h"

Omar Emara commented

2023-10-27 15:45:33 +02:00

Unnecessary include.

zazizizou marked this conversation as resolved

source/blender/compositor/operations/COM_SummedAreaTableOperation.cc Outdated

						
				@ -0,0 +13,4 @@

				SummedAreaTableOperation::SummedAreaTableOperation()

				{

				  this->add_input_socket(DataType::Color);

				  this->add_input_socket(DataType::Value);

Omar Emara commented

2023-10-27 15:49:28 +02:00

What is this Value input?

Habib Gahbiche commented

2023-10-29 11:07:37 +01:00

This was needed to subtract the mean from image. Not needed anymore, will remove.

OmarEmaraDev marked this conversation as resolved

source/blender/compositor/operations/COM_SummedAreaTableOperation.cc Outdated

						
				@ -0,0 +79,4 @@

				  threading::parallel_for(IndexRange(area.xmin, area.xmax), 1, [&](const IndexRange range_x) {

				    for (const int x : range_x) {

				      for (const int y : IndexRange(area.ymin, area.ymax)) {

				        float4 color;

Omar Emara commented

2023-10-27 15:53:28 +02:00

Use a temporary accumulated_color variable and avoid reading the buffer again just like the above loop. Then, use get_elem instead of read_elem_checked.

Use a temporary `accumulated_color` variable and avoid reading the buffer again just like the above loop. Then, use `get_elem` instead of `read_elem_checked`.

zazizizou marked this conversation as resolved

source/blender/compositor/operations/COM_SummedAreaTableOperation.cc Outdated

						
				@ -0,0 +101,4 @@

				      for (const int x : IndexRange(area->xmin, area->xmax)) {

				        float4 color;

				        image_reader_->read_sampled(&color.x, x, y, sampler);

Omar Emara commented

2023-10-27 15:55:56 +02:00

Use read instead of read_sampled. Same applies for all read_sampled calls below.

Use `read` instead of `read_sampled`. Same applies for all `read_sampled` calls below.

zazizizou marked this conversation as resolved

source/blender/compositor/operations/COM_SummedAreaTableOperation.cc

						
				@ -0,0 +113,4 @@

				    for (const int x : range_x) {

				      for (const int y : IndexRange(area->ymin, area->ymax)) {

				        float4 color;

				        output->read_elem_checked(x, y - 1, &color.x);

Omar Emara commented

2023-10-27 15:56:16 +02:00

Same as above.

zazizizou marked this conversation as resolved

source/blender/compositor/operations/COM_SummedAreaTableOperation.cc Outdated

						
				@ -0,0 +163,4 @@

				  float4 a, b, c, d, addend, substrahend;

				  buffer->read_sampled(

				      &a.x, corrected_upper_bound[0], corrected_upper_bound[1], PixelSampler::Nearest);

Omar Emara commented

2023-10-27 15:56:46 +02:00

Use UNPACK2.

Use `UNPACK2`.

zazizizou marked this conversation as resolved

source/blender/compositor/operations/COM_SummedAreaTableOperation.h Outdated

						
				@ -0,0 +9,4 @@

				namespace blender::compositor {

				/**

				 * \brief base class of CalculateMean, implementing the simple CalculateMean

Omar Emara commented

2023-10-27 15:47:42 +02:00

Update comment.

zazizizou marked this conversation as resolved

Habib Gahbiche added 2 commits 2023-10-29 12:12:41 +01:00

e5340d12a7 Merge remote-tracking branch 'origin/main' into com-kuwahara-sat

4cf2afcbb3 Improve performance by 5-10% by using fewer buffer reads

Habib Gahbiche added this to the Compositing project 2023-10-29 12:13:09 +01:00

Omar Emara approved these changes 2023-10-30 09:58:03 +01:00

Sergey Sharybin requested changes 2023-10-30 11:32:50 +01:00

Sergey Sharybin left a comment

The latest update Improve performance by 5-10% by using fewer buffer reads seems have broken the filter. The result is all empty, and the following tests are failing:

         51 - bf_compositor_tests (Failed)
        145 - compositor_filter_cpu_test (Failed)

The latest update `Improve performance by 5-10% by using fewer buffer reads` seems have broken the filter. The result is all empty, and the following tests are failing: ``` 51 - bf_compositor_tests (Failed) 145 - compositor_filter_cpu_test (Failed) ```

Sergey Sharybin commented

2023-10-30 13:46:37 +01:00

Nailed it down a bit more. The issue is in the change:

-  buffer->read_sampled(
-      &b.x, corrected_lower_bound[0], corrected_upper_bound[1], PixelSampler::Nearest);
-  buffer->read_sampled(
-      &c.x, corrected_upper_bound[0], corrected_lower_bound[1], PixelSampler::Nearest);
+  buffer->read_sampled(&b.x, UNPACK2(corrected_lower_bound), PixelSampler::Nearest);
+  buffer->read_sampled(&c.x, UNPACK2(corrected_upper_bound), PixelSampler::Nearest);

Note how the corrected_lower_bound[0], corrected_upper_bound[1] is replaced (after macro expansion) with corrected_lower_bound[1]`. There seems to be another chunk in the patch which lead to the same issue.

On a related topic, we should not be using UNPACK2 in the new code. Instead, overload the read_sampled which accepts const int2& pos and do expansion to individual x, y there (explicitly, without use of macro).

Nailed it down a bit more. The issue is in the change: ``` - buffer->read_sampled( - &b.x, corrected_lower_bound[0], corrected_upper_bound[1], PixelSampler::Nearest); - buffer->read_sampled( - &c.x, corrected_upper_bound[0], corrected_lower_bound[1], PixelSampler::Nearest); + buffer->read_sampled(&b.x, UNPACK2(corrected_lower_bound), PixelSampler::Nearest); + buffer->read_sampled(&c.x, UNPACK2(corrected_upper_bound), PixelSampler::Nearest); ``` Note how the `corrected_lower_bound[0], corrected_upper_bound[1]` is replaced (after macro expansion) with corrected_lower_bound[1]`. There seems to be another chunk in the patch which lead to the same issue. On a related topic, we should not be using `UNPACK2` in the new code. Instead, overload the `read_sampled` which accepts `const int2& pos` and do expansion to individual x, y there (explicitly, without use of macro).

Habib Gahbiche commented

2023-10-30 18:45:39 +01:00

Thanks @Sergey, really need more discipline to run tests for every single commit...

Thanks @Sergey, really need more discipline to run tests for _every single_ commit...

Habib Gahbiche added 1 commit 2023-10-30 18:46:16 +01:00