VSE: reduce effects code duplication, making gaussian blur faster in the process #116089

Aras Pranckevicius · 2023-12-12T10:49:20+01:00

2023-12-12 10:49:20 +01:00

Now that the code is in C++, quite some duplication between "byte" and "float" effect code paths can be reduced (easier than it was in C times). So I did that, removing about 400 lines of code.

In that process I accidentally made Gaussian Blur faster, since while reducing the amount of code I noticed it was doing some things sub-optimally (calculated kernel tables for each job, etc.). Applying 100x100 gaussian blur on 4K UHD resolution image strip on Ryzen 5950X went 630ms -> 450ms.

Now that the code is in C++, quite some duplication between "byte" and "float" effect code paths can be reduced (easier than it was in C times). So I did that, removing about 400 lines of code. In that process I accidentally made Gaussian Blur faster, since while reducing the amount of code I noticed it was doing some things sub-optimally (calculated kernel tables for each job, etc.). Applying 100x100 gaussian blur on 4K UHD resolution image strip on Ryzen 5950X went 630ms -> 450ms.

❤️ 3

Aras Pranckevicius added 2 commits 2023-12-12 10:49:29 +01:00

82182bce3d VSE: reduce code duplication in some effects

Logic between byte vs float effects was the same in many places, so now
that the source is in C++, we can reduce that duplication. Effects:
alpha over, alpha under, gamma cross, wipe, apply blend function util.

buildbot/vexp-code-patch-windows-amd64 Build done. Details

buildbot/vexp-code-patch-linux-x86_64 Build done. Details

buildbot/vexp-code-patch-lint Build done. Details

buildbot/vexp-code-patch-darwin-x86_64 Build done. Details

buildbot/vexp-code-patch-darwin-arm64 Build done. Details

buildbot/vexp-code-patch-coordinator Build done. Details

bac0481697 VSE: make Gaussian Blur effect both faster and with less code

Similar to previous commit, there was a lot of code duplication
between "byte" and "float" gaussian blur code variants. Also:

- Use direct parallel_for threading instead of very roundabout
  way that was coming from C times. Way less code, and allows loading
  CPU cores better (parallel_for seems better than task pool, also
  gaussian blur is relatively expensive so use smaller grain sizes).
- Calculate gaussian kernel tables once per effect, instead of once
  per CPU job.
- Do "sample from this to that pixel" including boundary conditions
  calculation explicitly, instead of checking boundary condition
  at each iteration inside the inner loop.

Applying gaussian blur effect of 100x100 size on 4K UHD sequencer
strip input, on Windows / Ryzen 5950X: 630ms -> 450ms

And with 220 fewer lines of code :)

Aras Pranckevicius commented

2023-12-12 10:51:50 +01:00

@blender-bot build

Aras Pranckevicius changed title from ~~WIP: VSE: reduce effects code duplication, making gaussian blur faster in the process~~ to VSE: reduce effects code duplication, making gaussian blur faster in the process

2023-12-12 11:13:01 +01:00

Aras Pranckevicius added this to the Video Sequencer project 2023-12-12 11:13:07 +01:00

Aras Pranckevicius requested review from Richard Antalik 2023-12-12 11:13:21 +01:00

Richard Antalik reviewed 2023-12-12 21:55:53 +01:00

Richard Antalik left a comment

This is really nice change, I really appreciate it! I would like to discuss the inline stuff, perhaps it would be good idea to ask on chat to better understand the intentions behind it.

source/blender/sequencer/intern/effects.cc

						
				@ -946,4 +863,0 @@

				using IMB_blend_func_byte = void (*)(uchar *dst, const uchar *src1, const uchar *src2);

				using IMB_blend_func_float = void (*)(float *dst, const float *src1, const float *src2);

				BLI_INLINE void apply_blend_function_byte(float fac,

Richard Antalik commented

2023-12-12 21:28:35 +01:00

From what I read, BLI_INLINE has an effect mostly on how the code compiles, not necessarily on performance. I have seen some benchmark linked from random stackoverflow article, but not sure if they are benchmarking code performance, or compiler performance. https://indico.cern.ch/event/386232/sessions/159923/attachments/771039/1057534/always_inline_performance.pdf

The benchmark also suggests effects on binary size, but guessing it won't have any major effect here.

So the question is, why did this code use this attribute/qualifier and whether it really needs to be there. From what I have read, I would guess it does not really need to be there, but would like to hear your opinion on this.

From what I read, `BLI_INLINE` has an effect mostly on how the code compiles, not necessarily on performance. I have seen some benchmark linked from random stackoverflow article, but not sure if they are benchmarking code performance, or compiler performance. https://indico.cern.ch/event/386232/sessions/159923/attachments/771039/1057534/always_inline_performance.pdf The benchmark also suggests effects on binary size, but guessing it won't have any major effect here. So the question is, why did this code use this attribute/qualifier and whether it really needs to be there. From what I have read, I would guess it does not really need to be there, but would like to hear your opinion on this.

Aras Pranckevicius commented

2023-12-13 05:58:58 +01:00

I think in this particular case it does not matter at all.

"inline" is a hint to the compiler, saying like "oh whenever anything calls this function, please to try to just literally paste the whole function into the call site, instead of actually calling the function". It's only a hint, but generally it makes sense to do that for functions that are extremely short, think one or two lines of code with a handful of operations. This one is not, it's two nested loops, a bunch of operations inside the loop, and call into yet another blend function. The amount of overhead from "call the function" compared to the amount of work done inside the function (working on thousands or tends of thousands of pixels!) is absolutely tiny.

So I have no idea why would anyone mark it as inline. My only guess is that maybe back at some point the function used to not work on many pixels, but rather on one pixel at a time, and it was marked inline, but then it got changed to have these two loops, and the inline stayed by accident. But this theory is hard to prove since the code seems to have been (re)added by Ton 15 years ago with "blender 2.5 is back with sequencer code!", so presumably the code existed from even before that, just without a trace in git.

I think in this particular case it does not matter at all. "inline" is a hint to the compiler, saying like "oh whenever anything calls this function, please to try to just literally paste the whole function into the call site, instead of actually calling the function". It's only a hint, but generally it makes sense to do that for functions that are extremely short, think one or two lines of code with a handful of operations. This one is not, it's two nested loops, a bunch of operations inside the loop, and call into yet another blend function. The amount of overhead from "call the function" compared to the amount of work done inside the function (working on thousands or tends of thousands of pixels!) is absolutely tiny. So I have no idea why would anyone mark it as inline. My only guess is that maybe back at some point the function used to not work on many pixels, but rather on one pixel at a time, and it was marked inline, but then it got changed to have these two loops, and the inline stayed by accident. But this theory is hard to prove since the code seems to have been (re)added by Ton 15 years ago with "blender 2.5 is back with sequencer code!", so presumably the code existed from even before that, just without a trace in git.

source/blender/sequencer/intern/effects.cc

						
				@ -373,0 +321,4 @@

				  for (int y = 0; y < height; y++) {

				    for (int x = 0; x < width; x++) {

				      float4 col2 = load_premul_pixel(src2);

				      if (col2.w <= 0.0f) {

Richard Antalik commented

2023-12-12 21:38:31 +01:00

Later in the code you access float4 values array by index, here by it's member. Personally I like index more, since pixel[4] is more familiar than pixel.w

It's a matter of getting used to it I guess, so it's a nitpick really.

Later in the code you access `float4` values array by index, here by it's member. Personally I like index more, since `pixel[4]` is more familiar than `pixel.w` It's a matter of getting used to it I guess, so it's a nitpick really.

Aras Pranckevicius commented

2023-12-13 06:00:39 +01:00

I don't have an opinion either way. .w for me personally is a bit easier to read than [3], but that's a taste/familiarity thing. And yeah I could have used .x ... .w below too. Do you want me to change it one or another way?

I don't have an opinion either way. `.w` for me personally is a bit easier to read than `[3]`, but that's a taste/familiarity thing. And yeah I could have used `.x` ... `.w` below too. Do you want me to change it one or another way?

iss marked this conversation as resolved

Richard Antalik approved these changes 2023-12-14 15:25:32 +01:00

Richard Antalik left a comment

Pressed the wrong button on x/y/z/w or index discussion, I think it should stay as it is. Would be perhaps nicer to have rgba members for pixels, but that's just cosmetics really..

Aras Pranckevicius merged commit 5cac8e2bb4 into main

2023-12-14 17:31:16 +01:00

Aras Pranckevicius referenced this issue from a commit

2023-12-14 17:31:17 +01:00

VSE: reduce effects code duplication, making gaussian blur faster in the process

Aras Pranckevicius deleted branch vse-fx-cleanup

2023-12-14 17:31:19 +01:00

Brian Savery (AMD) referenced this issue from a commit

2023-12-28 20:07:45 +01:00

VSE: reduce effects code duplication, making gaussian blur faster in the process

Sign in to join this conversation.

No reviewers

No Label

Download

What's New

Blender Studio

Manual

Developers Blog

Documentation

Benchmark

Blender Conference

Development Fund

One-time Donations

VSE: reduce effects code duplication, making gaussian blur faster in the process #116089