Similar to previous commit, there was a lot of code duplication
between "byte" and "float" gaussian blur code variants. Also:
- Use direct parallel_for threading instead of very roundabout
way that was coming from C times. Way less code, and allows loading
CPU cores better (parallel_for seems better than task pool, also
gaussian blur is relatively expensive so use smaller grain sizes).
- Calculate gaussian kernel tables once per effect, instead of once
per CPU job.
- Do "sample from this to that pixel" including boundary conditions
calculation explicitly, instead of checking boundary condition
at each iteration inside the inner loop.
Applying gaussian blur effect of 100x100 size on 4K UHD sequencer
strip input, on Windows / Ryzen 5950X: 630ms -> 450ms
And with 220 fewer lines of code :)
Logic between byte vs float effects was the same in many places, so now
that the source is in C++, we can reduce that duplication. Effects:
alpha over, alpha under, gamma cross, wipe, apply blend function util.