Cycles optimization: move srgb/alpha conversion out of cycles kernel #38034

Closed
opened 2014-01-02 21:29:33 +01:00 by Sv. Lockal · 27 comments
Member

Looking at top 5 functions in profiler for pavillon_barcelone_v1.2 (Ubuntu 14.04, CPU Intel Core i7 4771, compiled with gcc --march=native):

Function Name CPU Time by Utilization Instructions Retired CPI Rate CPU Frequency Ratio
ccl::bvh_intersect_instancing 9987.16s 26759572000000 1.38332 1.05899
> __ieee754_powf 1933.63s 6645649500000 1.0751 1.05571
> ccl::svm_image_texture 1078.16s 1737641500000 2.29578 1.05716
ccl::kernel_path_integrate 824.122s 2536936500000 1.20303 1.0581
ccl::shader_setup_from_ray 809.146s 1627696000000 1.84461 1.06019

As you can see from table, powf calls are too expensive even for Haswell.

Each time cycles kernel fetches an interpolated color for pixel (x, y), it applies alpha (if use_alpha flag from SVM stack is set) and converts the result from srgb to linear (if srgb flag from SVM stack is set) -- see svm_image_texture. Therefore cycles kernel produces billions of color_srgb_to_scene_linear calls, which use powf. Is far as I can see, both use_alpha and srgb flags are seem to be constants: only EnvironmentTextureNode::compile/ImageTextureNode::compile set them and only svm_node_tex_environment, svm_node_tex_image_box and svm_node_tex_image decode them from SVM stack.

If Cycles internals work only in linear space, can we convert images to linear space before starting raytracer? This could give a noticeable boost for textured objects.

Few notes:

  1. In theory, interpolation between pixels gives different results in linear space (right now cycles interpolates in srgb space). This difference is tiny and only noticeable for extremely lowres textures.
  2. interpolate(premultiply(image), x, y) ≡ premultiply(interpolate(image, x, y)), AFAIK
  3. If user places the same image in the node tree, but with different settings (e. g. Color and Non-color data), then a copy of image should be created.

No patch yet, waiting for Brecht's comment.

Looking at top 5 functions in profiler for [pavillon_barcelone_v1.2 ](http://blenderartists.org/forum/showthread.php?288611) (Ubuntu 14.04, CPU Intel Core i7 4771, compiled with `gcc --march=native`): | | Function Name | CPU Time by Utilization | Instructions Retired | CPI Rate | CPU Frequency Ratio | ---- | ---- | ---- | ---- | ---- | ---- | | | `ccl::bvh_intersect_instancing` | 9987.16s | 26759572000000 | **1.38332** | 1.05899 | > | `__ieee754_powf` | 1933.63s | 6645649500000 | **1.0751** | 1.05571 | > | `ccl::svm_image_texture` | 1078.16s | 1737641500000 | **2.29578** | 1.05716 | | `ccl::kernel_path_integrate` | 824.122s | 2536936500000 | 1.20303 | 1.0581 | | `ccl::shader_setup_from_ray` | 809.146s | 1627696000000 | 1.84461 | 1.06019 As you can see from table, `powf` calls are too expensive even for Haswell. Each time cycles kernel fetches an interpolated color for pixel (x, y), it applies alpha (if `use_alpha` flag from SVM stack is set) and converts the result from srgb to linear (if `srgb` flag from SVM stack is set) -- see `svm_image_texture`. Therefore cycles kernel produces billions of `color_srgb_to_scene_linear calls`, which use `powf`. Is far as I can see, both `use_alpha` and `srgb` flags are seem to be constants: only `EnvironmentTextureNode::compile/ImageTextureNode::compile` set them and only `svm_node_tex_environment`, `svm_node_tex_image_box` and `svm_node_tex_image` decode them from SVM stack. If Cycles internals work only in linear space, can we convert images to linear space before starting raytracer? This could give a noticeable boost for textured objects. Few notes: 1) In theory, interpolation between pixels gives different results in linear space (right now cycles interpolates in srgb space). This difference is tiny and only noticeable for extremely lowres textures. 2) `interpolate(premultiply(image), x, y) ≡ premultiply(interpolate(image, x, y))`, AFAIK 3) If user places the same image in the node tree, but with different settings (e. g. Color and Non-color data), then a copy of image should be created. No patch yet, waiting for Brecht's comment.
Author
Member

Changed status to: 'Open'

Changed status to: 'Open'
Sv. Lockal self-assigned this 2014-01-02 21:29:33 +01:00
Author
Member

Added subscriber: @Lockal

Added subscriber: @Lockal
Author
Member

Added subscribers: @ThomasDinges, @brecht, @MartijnBerger

Added subscribers: @ThomasDinges, @brecht, @MartijnBerger

It's impossible to store linear colors in 8 bits without artifacts. Storing it in floats or half-floats would be possible but takes more memory and image textures are already the biggest memory user in many scenes. Interpolation in linear space would in fact be more accurate so that's no problem.

It would be possible to use a lookup table for the values you read from the texture, that's 12 table lookups. That may be faster, I guess it depends a bit on the scene because such a table might easily stay in the cache on simple scenes but not always for more complex scenes.

It's impossible to store linear colors in 8 bits without artifacts. Storing it in floats or half-floats would be possible but takes more memory and image textures are already the biggest memory user in many scenes. Interpolation in linear space would in fact be more accurate so that's no problem. It would be possible to use a lookup table for the values you read from the texture, that's 12 table lookups. That may be faster, I guess it depends a bit on the scene because such a table might easily stay in the cache on simple scenes but not always for more complex scenes.
Member

@Lockal how good or bad is powf and how much error could we have in the desired range. and how does this translate to possible faster variations for powf?

i know that some of the implementation of functions like this tend to be slower then you would want / expect due to accuracy and or legacy reasons.

But is there any speed to be gained from using a powf that is just good enough but faster ?

@Lockal how good or bad is powf and how much error could we have in the desired range. and how does this translate to possible faster variations for powf? i know that some of the implementation of functions like this tend to be slower then you would want / expect due to accuracy and or legacy reasons. But is there any speed to be gained from using a powf that is just good enough but faster ?
Author
Member

Test patch for powf replacement. Uses speculative initial guess based on float representation , and improves the result with three iterations of Newton-Raphson method. Uncommented, can be improved with blendv (SSE4) and fma intrinsics. Gives 7% speedup on i7-4771 (haswell) for pavillon_barcelone_v1.2.blend. 4% on simple plane with texture.

srgb2linear.patch

Test patch for powf replacement. Uses speculative initial guess based on [float representation ](https://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Approximations_that_depend_on_IEEE_representation), and improves the result with three iterations of Newton-Raphson method. Uncommented, can be improved with blendv (SSE4) and fma intrinsics. Gives 7% speedup on i7-4771 (haswell) for pavillon_barcelone_v1.2.blend. 4% on simple plane with texture. [srgb2linear.patch](https://archive.blender.org/developer/F62265/srgb2linear.patch)

Nice patch.
On Ivy Bridge I get 30% speedup in images.blend (from test suite) with 100 Samples. (20.54s >> 15.52s).

Nice patch. On Ivy Bridge I get 30% speedup in images.blend (from test suite) with 100 Samples. (20.54s >> 15.52s).
Author
Member

pow_precision_test.cpp was used for testing precision and robustness of optimized pow. The optimized function gives better precision than original powf from glibc/eglibc by approximately one decimal. However optimized pow is less robust: it works only for positive numbers in range 1e-10 to 1e+10 (which should be enough for srgb->linear conversion). The original powf from glibc works for numbers up to 10e16.

[pow_precision_test.cpp](https://archive.blender.org/developer/F62310/pow_precision_test.cpp) was used for testing precision and robustness of optimized pow. The optimized function gives better precision than original powf from glibc/eglibc by approximately one decimal. However optimized pow is less robust: it works only for positive numbers in range 1e-10 to 1e+10 (which should be enough for srgb->linear conversion). The original powf from glibc works for numbers up to 10e16.

Tested with pavillon_barcelone_v1.2, scene "CPU Benchmark"

Ivy Bridge Quad Core (3.4 GHZ)
Ubuntu Linux 12.10, x64
gcc 4.7.2

Vanilla master: 08:40min
With patch: 8:00 min

So I can confirm your 7% here @Lockal, nice work!

Tested with pavillon_barcelone_v1.2, scene "CPU Benchmark" Ivy Bridge Quad Core (3.4 GHZ) Ubuntu Linux 12.10, x64 gcc 4.7.2 Vanilla master: 08:40min With patch: 8:00 min So I can confirm your 7% here @Lockal, nice work!

Nice work!

  • Could you make a color_srgb_to_scene_linear that takes a float4, so svm_image.h just calls this function and the rest is hidden in util_color.h?
  • This code assumes that pow with constant arguments will be constant folded. Can we trust visual studio 2008 to do this? You could make that value a template parameter to be sure.
  • We don't current have unit tests, if you want to create a test directory that with a c++ file that includes util_color.h, but it's up to you if you want to do this.
Nice work! * Could you make a color_srgb_to_scene_linear that takes a float4, so svm_image.h just calls this function and the rest is hidden in util_color.h? * This code assumes that pow with constant arguments will be constant folded. Can we trust visual studio 2008 to do this? You could make that value a template parameter to be sure. * We don't current have unit tests, if you want to create a test directory that with a c++ file that includes util_color.h, but it's up to you if you want to do this.
Author
Member
  1. We already have color_srgb_to_scene_linear(float3 c), but why this function lies inside #ifndef __KERNEL_OPENCL__? Is something wrong with float vectors with opencl? Also note that svm_image_texture is third in profiler list: there are obvious vector alpha multiplication and _mm_min_ps. I just want to see the result of pow changes in this patch.
  2. Good idea. C++ templates do not support float as template parameters, so I'll fold float into hex constant and add a comment
  3. It's ok as long as we can attach files here. It would be better to make not-so-cryptic code by moving common SSE block into utils_simd.h (I'll move blend(mask, a, b) for now).
1) We already have `color_srgb_to_scene_linear(float3 c)`, but why this function lies inside `#ifndef __KERNEL_OPENCL__`? Is something wrong with float vectors with opencl? Also note that `svm_image_texture` is third in profiler list: there are obvious vector alpha multiplication and _mm_min_ps. I just want to see the result of `pow` changes in this patch. 2) Good idea. C++ templates do not support float as template parameters, so I'll fold float into hex constant and add a comment 3) It's ok as long as we can attach files here. It would be better to make not-so-cryptic code by moving common SSE block into utils_simd.h (I'll move blend(mask, a, b) for now).
Member

Sandybridge hardware gives me about 7 % on Barcelona and some other archviz scenes. where higher resolution seems to give more speedup and more texture heavy scenes also gain more.

Barcelona gives 7.01 % improvement on 5 runs with vs 5 runs without.

Sandybridge hardware gives me about 7 % on Barcelona and some other archviz scenes. where higher resolution seems to give more speedup and more texture heavy scenes also gain more. Barcelona gives 7.01 % improvement on 5 runs with vs 5 runs without.
Author
Member

srgb2linear_v2.patch

New version of this patch: add comments, move blend() to util_simd.h (sse4.1 gives 2 instructions less), exp2(... * pow(...)) were replaced by precalculated constants.

[srgb2linear_v2.patch](https://archive.blender.org/developer/F62390/srgb2linear_v2.patch) New version of this patch: add comments, move blend() to util_simd.h (sse4.1 gives 2 instructions less), exp2(... * pow(...)) were replaced by precalculated constants.
Member

blender trunk 3 versions

I compiled trunk, trunk + your patch and trunk + patch + sse41 kernel in one 7z

Ill test tomorrow

[blender trunk 3 versions ](http://martijnberger.nl/file/win64-vc12_Lockal.7z) I compiled trunk, trunk + your patch and trunk + patch + sse41 kernel in one 7z Ill test tomorrow
Member

@Lockal

win64 release mode:
Optimized pow:
Domain from 1.38863e-014
error max = 0.885559 avg = -0.453968 |avg| = 0.464655 to 9.10054e-010
error max = 6.11255e-007 avg = 5.2206e-008 |avg| = 1.0857e-007 to 5.96413e-005
error max = 6.11255e-007 avg = 5.22211e-008 |avg| = 1.08562e-007 to 3.90865
error max = 6.11255e-007 avg = 5.22491e-008 |avg| = 1.08765e-007 to 256157
error max = 6.11255e-007 avg = 5.20589e-008 |avg| = 1.08351e-007 to 4.29497e+009
Classic powf:
Domain from 2.0467e-019
error max = 0.333336 avg = -0.00114124 |avg| = 0.0083194 to 1.34133e-014
error max = 3.09128e-006 avg = -2.51558e-006 |avg| = 2.51558e-006 to 8.79053e-010
error max = 2.04248e-006 avg = -1.45791e-006 |avg| = 1.45791e-006 to 5.76096e-005
error max = 9.82767e-007 avg = -4.00258e-007 |avg| = 4.16499e-007 to 3.7755
error max = 1.21638e-006 avg = 6.57398e-007 |avg| = 6.57398e-007 to 247431
error max = 2.29059e-006 avg = 1.71506e-006 |avg| = 1.71506e-006 to 1.62157e+010
error max = 3.33717e-006 avg = 2.77272e-006 |avg| = 2.77272e-006 to 1.06271e+015
error max = 3.55752e-006 avg = 3.41284e-006 |avg| = 3.41284e-006 to 1.13483e+016

win32 release mode
Optimized pow:
Domain from 1.38863e-014
error max = 0.885559 avg = -0.453968 |avg| = 0.464655 to 9.10054e-010
error max = 6.11255e-007 avg = 5.2206e-008 |avg| = 1.0857e-007 to 5.96413e-005
error max = 6.11255e-007 avg = 5.22211e-008 |avg| = 1.08562e-007 to 3.90865
error max = 6.11255e-007 avg = 5.22491e-008 |avg| = 1.08765e-007 to 256157
error max = 6.11255e-007 avg = 5.20589e-008 |avg| = 1.08351e-007 to 4.29497e+009
Classic powf:
Domain from 1.5333e-019
error max = 0.999992 avg = 0.0112573 |avg| = 0.0207178 to 1.00486e-014
error max = 3.11894e-006 avg = -2.54687e-006 |avg| = 2.54687e-006 to 6.58546e-010
error max = 2.07021e-006 avg = -1.48921e-006 |avg| = 1.48921e-006 to 4.31584e-005
error max = 1.01037e-006 avg = -4.31553e-007 |avg| = 4.41073e-007 to 2.82843
error max = 1.20056e-006 avg = 6.26103e-007 |avg| = 6.26103e-007 to 185364
error max = 2.26305e-006 avg = 1.68376e-006 |avg| = 1.68376e-006 to 1.2148e+010
error max = 3.30964e-006 avg = 2.74142e-006 |avg| = 2.74142e-006 to 7.96133e+014
error max = 3.55752e-006 avg = 3.3973e-006 |avg| = 3.3973e-006 to 1.13483e+016

flags used:
cl /arch:SSE /arch:SSE2 -D_CRT_SECURE_NO_WARNINGS /fp:fast /Ox /Gs- pow_precision_test.cpp

@Lockal win64 release mode: Optimized pow: Domain from 1.38863e-014 error max = 0.885559 avg = -0.453968 |avg| = 0.464655 to 9.10054e-010 error max = 6.11255e-007 avg = 5.2206e-008 |avg| = 1.0857e-007 to 5.96413e-005 error max = 6.11255e-007 avg = 5.22211e-008 |avg| = 1.08562e-007 to 3.90865 error max = 6.11255e-007 avg = 5.22491e-008 |avg| = 1.08765e-007 to 256157 error max = 6.11255e-007 avg = 5.20589e-008 |avg| = 1.08351e-007 to 4.29497e+009 Classic powf: Domain from 2.0467e-019 error max = 0.333336 avg = -0.00114124 |avg| = 0.0083194 to 1.34133e-014 error max = 3.09128e-006 avg = -2.51558e-006 |avg| = 2.51558e-006 to 8.79053e-010 error max = 2.04248e-006 avg = -1.45791e-006 |avg| = 1.45791e-006 to 5.76096e-005 error max = 9.82767e-007 avg = -4.00258e-007 |avg| = 4.16499e-007 to 3.7755 error max = 1.21638e-006 avg = 6.57398e-007 |avg| = 6.57398e-007 to 247431 error max = 2.29059e-006 avg = 1.71506e-006 |avg| = 1.71506e-006 to 1.62157e+010 error max = 3.33717e-006 avg = 2.77272e-006 |avg| = 2.77272e-006 to 1.06271e+015 error max = 3.55752e-006 avg = 3.41284e-006 |avg| = 3.41284e-006 to 1.13483e+016 win32 release mode Optimized pow: Domain from 1.38863e-014 error max = 0.885559 avg = -0.453968 |avg| = 0.464655 to 9.10054e-010 error max = 6.11255e-007 avg = 5.2206e-008 |avg| = 1.0857e-007 to 5.96413e-005 error max = 6.11255e-007 avg = 5.22211e-008 |avg| = 1.08562e-007 to 3.90865 error max = 6.11255e-007 avg = 5.22491e-008 |avg| = 1.08765e-007 to 256157 error max = 6.11255e-007 avg = 5.20589e-008 |avg| = 1.08351e-007 to 4.29497e+009 Classic powf: Domain from 1.5333e-019 error max = 0.999992 avg = 0.0112573 |avg| = 0.0207178 to 1.00486e-014 error max = 3.11894e-006 avg = -2.54687e-006 |avg| = 2.54687e-006 to 6.58546e-010 error max = 2.07021e-006 avg = -1.48921e-006 |avg| = 1.48921e-006 to 4.31584e-005 error max = 1.01037e-006 avg = -4.31553e-007 |avg| = 4.41073e-007 to 2.82843 error max = 1.20056e-006 avg = 6.26103e-007 |avg| = 6.26103e-007 to 185364 error max = 2.26305e-006 avg = 1.68376e-006 |avg| = 1.68376e-006 to 1.2148e+010 error max = 3.30964e-006 avg = 2.74142e-006 |avg| = 2.74142e-006 to 7.96133e+014 error max = 3.55752e-006 avg = 3.3973e-006 |avg| = 3.3973e-006 to 1.13483e+016 flags used: cl /arch:SSE /arch:SSE2 -D_CRT_SECURE_NO_WARNINGS /fp:fast /Ox /Gs- pow_precision_test.cpp

Added subscriber: @mib2berlin

Added subscriber: @mib2berlin

Hi, tested with patch compare to trunk build from juicyfruit VS 2013 with my Benchmarkfile 32x32 Tiles.

http://www.blenderartists.org/forum/showthread.php?303832-New-Cycles-Benchmark

http://martijnberger.nl/file/win64-vc12_Lockal.7z

Trunk 07:44.73
With patch 07:36.33

Intel i5 3770K
8GB
Windows 8 Ultimate

BTW. Trunk on Linux 06:12.52
Cheers, mib.

Hi, tested with patch compare to trunk build from juicyfruit VS 2013 with my Benchmarkfile 32x32 Tiles. http://www.blenderartists.org/forum/showthread.php?303832-New-Cycles-Benchmark http://martijnberger.nl/file/win64-vc12_Lockal.7z Trunk 07:44.73 With patch 07:36.33 Intel i5 3770K 8GB Windows 8 Ultimate BTW. Trunk on Linux 06:12.52 Cheers, mib.

Regarding #ifndef KERNEL_OPENCL for color_srgb_to_scene_linear. That's because OpenCL doesn't support function overloading. If you give the function a different name it should be ok.

Regarding #ifndef __KERNEL_OPENCL__ for color_srgb_to_scene_linear. That's because OpenCL doesn't support function overloading. If you give the function a different name it should be ok.

Further this looks good to me, if you want the commit the patch go ahead.

Further this looks good to me, if you want the commit the patch go ahead.
Author
Member

Pow patch committed in 96903508bc.

I'll commit one other simple patch for ccl::svm_image_texture and then will close this task.

Pow patch committed in 96903508bc. I'll commit one other simple patch for ccl::svm_image_texture and then will close this task.
Author
Member

Changed status from 'Open' to: 'Resolved'

Changed status from 'Open' to: 'Resolved'
Author
Member

Commited ccl::svm_image_texture code as acc90b40bf. No big reason to optimize interpolation itself: the 90% of it's time is an actual texture read (lea, movq).

Commited ccl::svm_image_texture code as acc90b40bf. No big reason to optimize interpolation itself: the 90% of it's time is an actual texture read (lea, movq).
jrp commented 2014-03-30 00:57:37 +01:00 (Migrated from localhost:3001)

Added subscriber: @jrp

Added subscriber: @jrp
jrp commented 2014-03-30 00:57:37 +01:00 (Migrated from localhost:3001)

Here's a slightly enhanced converter. The ^2.4 function is still rather extravagant as it only needs to work in the range 0-1.

pow_precision_test.cpp

and here's pow_precision_test.zip the complete VS2013 project / solution for those that want to optimize further.

fastpow24 pow:
Domain from 1.39e-014 in 7.15sec
error max =   0.89      avg = -0.454    |avg| =  0.465  to 9.1e-010     in 6.92
sec
error max = 6.1e-007    avg = 5.22e-008 |avg| = 1.09e-007       to 5.96e-005
in 1.97 sec
error max = 6.1e-007    avg = 5.22e-008 |avg| = 1.09e-007       to   3.91
in 1.98 sec
error max = 6.1e-007    avg = 5.22e-008 |avg| = 1.09e-007       to 2.56e+005
in 1.97 sec
error max = 6.1e-007    avg = 5.21e-008 |avg| = 1.08e-007       to 4.29e+009
in 1.73 sec

fasterpower24 pow:
Domain from 5.14e-012 in 7.77sec
error max =    0.7      avg = 0.0015    |avg| = 0.00641 to 3.37e-007    in 6.21
sec
error max = 9.5e-007    avg = 3.39e-010 |avg| = 1.52e-007       to 0.0221
in 1.73 sec
error max = 1e-006      avg = 5.76e-010 |avg| = 1.52e-007       to 1.45e+003
in 1.72 sec
error max = 1e-006      avg = 7.89e-010 |avg| = 1.52e-007       to 9.49e+007
in 1.73 sec
error max = 1e-006      avg = 6.33e-010 |avg| = 1.52e-007       to 2.87e+009
in 0.53 sec

Classic powf:
Domain from 1.53e-019 in 2.09sec
error max =      1      avg = 0.0113    |avg| = 0.0207  to 1e-014       in 2.96
sec
error max = 3.1e-006    avg = -2.55e-006        |avg| = 2.55e-006       to 6.59e
  • 010 in 2.03 sec
error max = 2.1e-006    avg = -1.49e-006        |avg| = 1.49e-006       to 4.32e
  • 005 in 2.04 sec
error max = 1e-006      avg = -4.32e-007        |avg| = 4.41e-007       to   2.8

3 in 2.03 sec

error max = 1.2e-006    avg = 6.26e-007 |avg| = 6.26e-007       to 1.85e+005
in 2.04 sec
error max = 2.3e-006    avg = 1.68e-006 |avg| = 1.68e-006       to 1.21e+010
in 2.04 sec
error max = 3.3e-006    avg = 2.74e-006 |avg| = 2.74e-006       to 7.96e+014
in 2.03 sec
error max = 3.6e-006    avg = 3.4e-006  |avg| = 3.4e-006        to 1.13e+016
in 0.491 sec
Here's a slightly enhanced converter. The ^2.4 function is still rather extravagant as it only needs to work in the range 0-1. [pow_precision_test.cpp](https://archive.blender.org/developer/F83420/pow_precision_test.cpp) and here's [pow_precision_test.zip](https://archive.blender.org/developer/F83421/pow_precision_test.zip) the complete VS2013 project / solution for those that want to optimize further. ``` fastpow24 pow: Domain from 1.39e-014 in 7.15sec error max = 0.89 avg = -0.454 |avg| = 0.465 to 9.1e-010 in 6.92 sec error max = 6.1e-007 avg = 5.22e-008 |avg| = 1.09e-007 to 5.96e-005 in 1.97 sec error max = 6.1e-007 avg = 5.22e-008 |avg| = 1.09e-007 to 3.91 in 1.98 sec error max = 6.1e-007 avg = 5.22e-008 |avg| = 1.09e-007 to 2.56e+005 in 1.97 sec error max = 6.1e-007 avg = 5.21e-008 |avg| = 1.08e-007 to 4.29e+009 in 1.73 sec fasterpower24 pow: Domain from 5.14e-012 in 7.77sec error max = 0.7 avg = 0.0015 |avg| = 0.00641 to 3.37e-007 in 6.21 sec error max = 9.5e-007 avg = 3.39e-010 |avg| = 1.52e-007 to 0.0221 in 1.73 sec error max = 1e-006 avg = 5.76e-010 |avg| = 1.52e-007 to 1.45e+003 in 1.72 sec error max = 1e-006 avg = 7.89e-010 |avg| = 1.52e-007 to 9.49e+007 in 1.73 sec error max = 1e-006 avg = 6.33e-010 |avg| = 1.52e-007 to 2.87e+009 in 0.53 sec Classic powf: Domain from 1.53e-019 in 2.09sec error max = 1 avg = 0.0113 |avg| = 0.0207 to 1e-014 in 2.96 sec error max = 3.1e-006 avg = -2.55e-006 |avg| = 2.55e-006 to 6.59e ``` - 010 in 2.03 sec ``` error max = 2.1e-006 avg = -1.49e-006 |avg| = 1.49e-006 to 4.32e ``` - 005 in 2.04 sec ``` error max = 1e-006 avg = -4.32e-007 |avg| = 4.41e-007 to 2.8 ``` 3 in 2.03 sec ``` error max = 1.2e-006 avg = 6.26e-007 |avg| = 6.26e-007 to 1.85e+005 in 2.04 sec error max = 2.3e-006 avg = 1.68e-006 |avg| = 1.68e-006 to 1.21e+010 in 2.04 sec error max = 3.3e-006 avg = 2.74e-006 |avg| = 2.74e-006 to 7.96e+014 in 2.03 sec error max = 3.6e-006 avg = 3.4e-006 |avg| = 3.4e-006 to 1.13e+016 in 0.491 sec ```
jrp commented 2014-03-30 17:43:48 +02:00 (Migrated from localhost:3001)

An here's a patch that includes the slightly faster robust power function (fasterpower24) but also has another (approxpow24) that just uses a polynomial approximation, which is adequate for the limited range needed. srgblin.txt

An here's a patch that includes the slightly faster robust power function (fasterpower24) but also has another (approxpow24) that just uses a polynomial approximation, which is adequate for the limited range needed. [srgblin.txt](https://archive.blender.org/developer/F83487/srgblin.txt)
Author
Member

jrp, are you sure that 2 iterations of Halley's method is faster than 3 iterations of Newton-Raphson method? A single iteration of Halley's method has 6*, 4+ and 1/, while Newton-Raphson has only 4*, 1+ and 1/. Halley's method is less robust because it calculates approx^5, so the working domain will be smaller.

The input of color_srgb_to_scene_linear is not limited to 1 in case of EXR images. However it is possible to call a specialized function in svm_image_texture for byte images and a generic function for float images.

The max error of polynomial is too big (approxpow24(1.0) = 0.994522324). One may use minimax approximant to achieve better results. 0.951542769e-3+(-0.3117281851e-1+(.5386576039+(.6188134751-.1274088440*x)*x)*x)*x has max error of 0.0001590453, but I think it is still too big.

jrp, are you sure that 2 iterations of Halley's method is faster than 3 iterations of Newton-Raphson method? A single iteration of Halley's method has 6*, 4+ and 1/, while Newton-Raphson has only 4*, 1+ and 1/. Halley's method is less robust because it calculates approx^5, so the working domain will be smaller. The input of `color_srgb_to_scene_linear` is not limited to 1 in case of EXR images. However it is possible to call a specialized function in `svm_image_texture` for byte images and a generic function for float images. The max error of polynomial is too big (approxpow24(1.0) = 0.994522324). One may use minimax approximant to achieve better results. `0.951542769e-3+(-0.3117281851e-1+(.5386576039+(.6188134751-.1274088440*x)*x)*x)*x` has max error of 0.0001590453, but I think it is still too big.
jrp commented 2014-03-31 00:47:17 +02:00 (Migrated from localhost:3001)

Halley does seem to be a fraction faster as the previous post illustrates. I've included the complete project file so that you can check that I am timing the right thing. In the great scheme of things the classic powerf doesn't do too badly.

EXR images should be linear already, but life being what it is, I can see that you may want to correct them anyway.

Here are a couple of other approximations:

Optimizing conversion between sRGB and linear

sRGB Approximations for HLSL

and I expect that you will have seen

Optimizations for pow() with const non-integer exponent?

A further poke through the blender code reveals that it seems to have at least one other sRGB to linear, never mind that in OpenImageIO, etc.

Halley does seem to be a fraction faster as the previous post illustrates. I've included the complete project file so that you can check that I am timing the right thing. In the great scheme of things the classic powerf doesn't do too badly. EXR images should be linear already, but life being what it is, I can see that you may want to correct them anyway. Here are a couple of other approximations: [Optimizing conversion between sRGB and linear ](http://excamera.com/sphinx/article-srgb.html) [sRGB Approximations for HLSL](http://chilliant.blogspot.co.uk/2012/08/srgb-approximations-for-hlsl.html) and I expect that you will have seen [Optimizations for pow() with const non-integer exponent?](http://stackoverflow.com/questions/6475373/optimizations-for-pow-with-const-non-integer-exponent) A further poke through the blender code reveals that it seems to have at least one other sRGB to linear, never mind that in OpenImageIO, etc.
Sign in to join this conversation.
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
6 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#38034
No description provided.