Cycles optimization: move srgb/alpha conversion out of cycles kernel #38034

New Issue

Sv. Lockal · 2014-01-02T21:29:33+01:00

Sv. Lockal commented

2014-01-02 21:29:33 +01:00

Looking at top 5 functions in profiler for pavillon_barcelone_v1.2 (Ubuntu 14.04, CPU Intel Core i7 4771, compiled with gcc --march=native):

	Function Name	CPU Time by Utilization	Instructions Retired	CPI Rate	CPU Frequency Ratio
	`ccl::bvh_intersect_instancing`	9987.16s	26759572000000	1.38332	1.05899
>	`__ieee754_powf`	1933.63s	6645649500000	1.0751	1.05571
>	`ccl::svm_image_texture`	1078.16s	1737641500000	2.29578	1.05716
	`ccl::kernel_path_integrate`	824.122s	2536936500000	1.20303	1.0581
	`ccl::shader_setup_from_ray`	809.146s	1627696000000	1.84461	1.06019

As you can see from table, powf calls are too expensive even for Haswell.

Each time cycles kernel fetches an interpolated color for pixel (x, y), it applies alpha (if use_alpha flag from SVM stack is set) and converts the result from srgb to linear (if srgb flag from SVM stack is set) -- see svm_image_texture. Therefore cycles kernel produces billions of color_srgb_to_scene_linear calls, which use powf. Is far as I can see, both use_alpha and srgb flags are seem to be constants: only EnvironmentTextureNode::compile/ImageTextureNode::compile set them and only svm_node_tex_environment, svm_node_tex_image_box and svm_node_tex_image decode them from SVM stack.

If Cycles internals work only in linear space, can we convert images to linear space before starting raytracer? This could give a noticeable boost for textured objects.

Few notes:

In theory, interpolation between pixels gives different results in linear space (right now cycles interpolates in srgb space). This difference is tiny and only noticeable for extremely lowres textures.
interpolate(premultiply(image), x, y) ≡ premultiply(interpolate(image, x, y)), AFAIK
If user places the same image in the node tree, but with different settings (e. g. Color and Non-color data), then a copy of image should be created.

No patch yet, waiting for Brecht's comment.

Looking at top 5 functions in profiler for [pavillon_barcelone_v1.2 ](http://blenderartists.org/forum/showthread.php?288611) (Ubuntu 14.04, CPU Intel Core i7 4771, compiled with `gcc --march=native`): | | Function Name | CPU Time by Utilization | Instructions Retired | CPI Rate | CPU Frequency Ratio | ---- | ---- | ---- | ---- | ---- | ---- | | | `ccl::bvh_intersect_instancing` | 9987.16s | 26759572000000 | **1.38332** | 1.05899 | > | `__ieee754_powf` | 1933.63s | 6645649500000 | **1.0751** | 1.05571 | > | `ccl::svm_image_texture` | 1078.16s | 1737641500000 | **2.29578** | 1.05716 | | `ccl::kernel_path_integrate` | 824.122s | 2536936500000 | 1.20303 | 1.0581 | | `ccl::shader_setup_from_ray` | 809.146s | 1627696000000 | 1.84461 | 1.06019 As you can see from table, `powf` calls are too expensive even for Haswell. Each time cycles kernel fetches an interpolated color for pixel (x, y), it applies alpha (if `use_alpha` flag from SVM stack is set) and converts the result from srgb to linear (if `srgb` flag from SVM stack is set) -- see `svm_image_texture`. Therefore cycles kernel produces billions of `color_srgb_to_scene_linear calls`, which use `powf`. Is far as I can see, both `use_alpha` and `srgb` flags are seem to be constants: only `EnvironmentTextureNode::compile/ImageTextureNode::compile` set them and only `svm_node_tex_environment`, `svm_node_tex_image_box` and `svm_node_tex_image` decode them from SVM stack. If Cycles internals work only in linear space, can we convert images to linear space before starting raytracer? This could give a noticeable boost for textured objects. Few notes: 1) In theory, interpolation between pixels gives different results in linear space (right now cycles interpolates in srgb space). This difference is tiny and only noticeable for extremely lowres textures. 2) `interpolate(premultiply(image), x, y) ≡ premultiply(interpolate(image, x, y))`, AFAIK 3) If user places the same image in the node tree, but with different settings (e. g. Color and Non-color data), then a copy of image should be created. No patch yet, waiting for Brecht's comment.

Sv. Lockal commented

2014-01-02 21:29:33 +01:00

Changed status to: 'Open'

Sv. Lockal self-assigned this 2014-01-02 21:29:33 +01:00

Sv. Lockal commented

2014-01-02 21:29:33 +01:00

Added subscriber: @Lockal

Sv. Lockal commented

2014-01-02 21:30:13 +01:00

Added subscribers: @ThomasDinges, @brecht, @MartijnBerger

Brecht Van Lommel commented

2014-01-03 17:44:05 +01:00

It's impossible to store linear colors in 8 bits without artifacts. Storing it in floats or half-floats would be possible but takes more memory and image textures are already the biggest memory user in many scenes. Interpolation in linear space would in fact be more accurate so that's no problem.

It would be possible to use a lookup table for the values you read from the texture, that's 12 table lookups. That may be faster, I guess it depends a bit on the scene because such a table might easily stay in the cache on simple scenes but not always for more complex scenes.

It's impossible to store linear colors in 8 bits without artifacts. Storing it in floats or half-floats would be possible but takes more memory and image textures are already the biggest memory user in many scenes. Interpolation in linear space would in fact be more accurate so that's no problem. It would be possible to use a lookup table for the values you read from the texture, that's 12 table lookups. That may be faster, I guess it depends a bit on the scene because such a table might easily stay in the cache on simple scenes but not always for more complex scenes.

Martijn Berger commented

2014-01-03 22:28:01 +01:00

@Lockal how good or bad is powf and how much error could we have in the desired range. and how does this translate to possible faster variations for powf?

i know that some of the implementation of functions like this tend to be slower then you would want / expect due to accuracy and or legacy reasons.

But is there any speed to be gained from using a powf that is just good enough but faster ?

@Lockal how good or bad is powf and how much error could we have in the desired range. and how does this translate to possible faster variations for powf? i know that some of the implementation of functions like this tend to be slower then you would want / expect due to accuracy and or legacy reasons. But is there any speed to be gained from using a powf that is just good enough but faster ?

Sv. Lockal commented

2014-01-05 14:23:25 +01:00

Test patch for powf replacement. Uses speculative initial guess based on float representation , and improves the result with three iterations of Newton-Raphson method. Uncommented, can be improved with blendv (SSE4) and fma intrinsics. Gives 7% speedup on i7-4771 (haswell) for pavillon_barcelone_v1.2.blend. 4% on simple plane with texture.

srgb2linear.patch

Test patch for powf replacement. Uses speculative initial guess based on [float representation ](https://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Approximations_that_depend_on_IEEE_representation), and improves the result with three iterations of Newton-Raphson method. Uncommented, can be improved with blendv (SSE4) and fma intrinsics. Gives 7% speedup on i7-4771 (haswell) for pavillon_barcelone_v1.2.blend. 4% on simple plane with texture. [srgb2linear.patch](https://archive.blender.org/developer/F62265/srgb2linear.patch)

Thomas Dinges commented

2014-01-05 14:57:15 +01:00

Nice patch.
On Ivy Bridge I get 30% speedup in images.blend (from test suite) with 100 Samples. (20.54s >> 15.52s).

Nice patch. On Ivy Bridge I get 30% speedup in images.blend (from test suite) with 100 Samples. (20.54s >> 15.52s).

Sv. Lockal commented

2014-01-05 16:24:08 +01:00

pow_precision_test.cpp was used for testing precision and robustness of optimized pow. The optimized function gives better precision than original powf from glibc/eglibc by approximately one decimal. However optimized pow is less robust: it works only for positive numbers in range 1e-10 to 1e+10 (which should be enough for srgb->linear conversion). The original powf from glibc works for numbers up to 10e16.

[pow_precision_test.cpp](https://archive.blender.org/developer/F62310/pow_precision_test.cpp) was used for testing precision and robustness of optimized pow. The optimized function gives better precision than original powf from glibc/eglibc by approximately one decimal. However optimized pow is less robust: it works only for positive numbers in range 1e-10 to 1e+10 (which should be enough for srgb->linear conversion). The original powf from glibc works for numbers up to 10e16.

Thomas Dinges commented

2014-01-05 16:39:09 +01:00

Tested with pavillon_barcelone_v1.2, scene "CPU Benchmark"

Ivy Bridge Quad Core (3.4 GHZ)
Ubuntu Linux 12.10, x64
gcc 4.7.2

Vanilla master: 08:40min
With patch: 8:00 min

So I can confirm your 7% here @Lockal, nice work!

Tested with pavillon_barcelone_v1.2, scene "CPU Benchmark" Ivy Bridge Quad Core (3.4 GHZ) Ubuntu Linux 12.10, x64 gcc 4.7.2 Vanilla master: 08:40min With patch: 8:00 min So I can confirm your 7% here @Lockal, nice work!

Brecht Van Lommel commented

2014-01-05 16:50:42 +01:00

Nice work!

Could you make a color_srgb_to_scene_linear that takes a float4, so svm_image.h just calls this function and the rest is hidden in util_color.h?
This code assumes that pow with constant arguments will be constant folded. Can we trust visual studio 2008 to do this? You could make that value a template parameter to be sure.
We don't current have unit tests, if you want to create a test directory that with a c++ file that includes util_color.h, but it's up to you if you want to do this.

Nice work! * Could you make a color_srgb_to_scene_linear that takes a float4, so svm_image.h just calls this function and the rest is hidden in util_color.h? * This code assumes that pow with constant arguments will be constant folded. Can we trust visual studio 2008 to do this? You could make that value a template parameter to be sure. * We don't current have unit tests, if you want to create a test directory that with a c++ file that includes util_color.h, but it's up to you if you want to do this.

Sv. Lockal commented

2014-01-05 18:21:05 +01:00

We already have color_srgb_to_scene_linear(float3 c), but why this function lies inside #ifndef __KERNEL_OPENCL__? Is something wrong with float vectors with opencl? Also note that svm_image_texture is third in profiler list: there are obvious vector alpha multiplication and _mm_min_ps. I just want to see the result of pow changes in this patch.
Good idea. C++ templates do not support float as template parameters, so I'll fold float into hex constant and add a comment
It's ok as long as we can attach files here. It would be better to make not-so-cryptic code by moving common SSE block into utils_simd.h (I'll move blend(mask, a, b) for now).

1) We already have `color_srgb_to_scene_linear(float3 c)`, but why this function lies inside `#ifndef __KERNEL_OPENCL__`? Is something wrong with float vectors with opencl? Also note that `svm_image_texture` is third in profiler list: there are obvious vector alpha multiplication and _mm_min_ps. I just want to see the result of `pow` changes in this patch. 2) Good idea. C++ templates do not support float as template parameters, so I'll fold float into hex constant and add a comment 3) It's ok as long as we can attach files here. It would be better to make not-so-cryptic code by moving common SSE block into utils_simd.h (I'll move blend(mask, a, b) for now).

Martijn Berger commented

2014-01-05 18:23:03 +01:00

Sandybridge hardware gives me about 7 % on Barcelona and some other archviz scenes. where higher resolution seems to give more speedup and more texture heavy scenes also gain more.

Barcelona gives 7.01 % improvement on 5 runs with vs 5 runs without.

Sandybridge hardware gives me about 7 % on Barcelona and some other archviz scenes. where higher resolution seems to give more speedup and more texture heavy scenes also gain more. Barcelona gives 7.01 % improvement on 5 runs with vs 5 runs without.

Sv. Lockal commented

2014-01-05 19:31:21 +01:00

srgb2linear_v2.patch

New version of this patch: add comments, move blend() to util_simd.h (sse4.1 gives 2 instructions less), exp2(... * pow(...)) were replaced by precalculated constants.

[srgb2linear_v2.patch](https://archive.blender.org/developer/F62390/srgb2linear_v2.patch) New version of this patch: add comments, move blend() to util_simd.h (sse4.1 gives 2 instructions less), exp2(... * pow(...)) were replaced by precalculated constants.

Martijn Berger commented

2014-01-05 21:39:19 +01:00

blender trunk 3 versions

I compiled trunk, trunk + your patch and trunk + patch + sse41 kernel in one 7z

Ill test tomorrow

[blender trunk 3 versions ](http://martijnberger.nl/file/win64-vc12_Lockal.7z) I compiled trunk, trunk + your patch and trunk + patch + sse41 kernel in one 7z Ill test tomorrow

Martijn Berger commented

2014-01-05 22:19:27 +01:00

@Lockal

win64 release mode:
Optimized pow:
Domain from 1.38863e-014
error max = 0.885559 avg = -0.453968 |avg| = 0.464655 to 9.10054e-010
error max = 6.11255e-007 avg = 5.2206e-008 |avg| = 1.0857e-007 to 5.96413e-005
error max = 6.11255e-007 avg = 5.22211e-008 |avg| = 1.08562e-007 to 3.90865
error max = 6.11255e-007 avg = 5.22491e-008 |avg| = 1.08765e-007 to 256157
error max = 6.11255e-007 avg = 5.20589e-008 |avg| = 1.08351e-007 to 4.29497e+009
Classic powf:
Domain from 2.0467e-019
error max = 0.333336 avg = -0.00114124 |avg| = 0.0083194 to 1.34133e-014
error max = 3.09128e-006 avg = -2.51558e-006 |avg| = 2.51558e-006 to 8.79053e-010
error max = 2.04248e-006 avg = -1.45791e-006 |avg| = 1.45791e-006 to 5.76096e-005
error max = 9.82767e-007 avg = -4.00258e-007 |avg| = 4.16499e-007 to 3.7755
error max = 1.21638e-006 avg = 6.57398e-007 |avg| = 6.57398e-007 to 247431
error max = 2.29059e-006 avg = 1.71506e-006 |avg| = 1.71506e-006 to 1.62157e+010
error max = 3.33717e-006 avg = 2.77272e-006 |avg| = 2.77272e-006 to 1.06271e+015
error max = 3.55752e-006 avg = 3.41284e-006 |avg| = 3.41284e-006 to 1.13483e+016

win32 release mode
Optimized pow:
Domain from 1.38863e-014
error max = 0.885559 avg = -0.453968 |avg| = 0.464655 to 9.10054e-010
error max = 6.11255e-007 avg = 5.2206e-008 |avg| = 1.0857e-007 to 5.96413e-005
error max = 6.11255e-007 avg = 5.22211e-008 |avg| = 1.08562e-007 to 3.90865
error max = 6.11255e-007 avg = 5.22491e-008 |avg| = 1.08765e-007 to 256157
error max = 6.11255e-007 avg = 5.20589e-008 |avg| = 1.08351e-007 to 4.29497e+009
Classic powf:
Domain from 1.5333e-019
error max = 0.999992 avg = 0.0112573 |avg| = 0.0207178 to 1.00486e-014
error max = 3.11894e-006 avg = -2.54687e-006 |avg| = 2.54687e-006 to 6.58546e-010
error max = 2.07021e-006 avg = -1.48921e-006 |avg| = 1.48921e-006 to 4.31584e-005
error max = 1.01037e-006 avg = -4.31553e-007 |avg| = 4.41073e-007 to 2.82843
error max = 1.20056e-006 avg = 6.26103e-007 |avg| = 6.26103e-007 to 185364
error max = 2.26305e-006 avg = 1.68376e-006 |avg| = 1.68376e-006 to 1.2148e+010
error max = 3.30964e-006 avg = 2.74142e-006 |avg| = 2.74142e-006 to 7.96133e+014
error max = 3.55752e-006 avg = 3.3973e-006 |avg| = 3.3973e-006 to 1.13483e+016

flags used:
cl /arch:SSE /arch:SSE2 -D_CRT_SECURE_NO_WARNINGS /fp:fast /Ox /Gs- pow_precision_test.cpp

@Lockal win64 release mode: Optimized pow: Domain from 1.38863e-014 error max = 0.885559 avg = -0.453968 |avg| = 0.464655 to 9.10054e-010 error max = 6.11255e-007 avg = 5.2206e-008 |avg| = 1.0857e-007 to 5.96413e-005 error max = 6.11255e-007 avg = 5.22211e-008 |avg| = 1.08562e-007 to 3.90865 error max = 6.11255e-007 avg = 5.22491e-008 |avg| = 1.08765e-007 to 256157 error max = 6.11255e-007 avg = 5.20589e-008 |avg| = 1.08351e-007 to 4.29497e+009 Classic powf: Domain from 2.0467e-019 error max = 0.333336 avg = -0.00114124 |avg| = 0.0083194 to 1.34133e-014 error max = 3.09128e-006 avg = -2.51558e-006 |avg| = 2.51558e-006 to 8.79053e-010 error max = 2.04248e-006 avg = -1.45791e-006 |avg| = 1.45791e-006 to 5.76096e-005 error max = 9.82767e-007 avg = -4.00258e-007 |avg| = 4.16499e-007 to 3.7755 error max = 1.21638e-006 avg = 6.57398e-007 |avg| = 6.57398e-007 to 247431 error max = 2.29059e-006 avg = 1.71506e-006 |avg| = 1.71506e-006 to 1.62157e+010 error max = 3.33717e-006 avg = 2.77272e-006 |avg| = 2.77272e-006 to 1.06271e+015 error max = 3.55752e-006 avg = 3.41284e-006 |avg| = 3.41284e-006 to 1.13483e+016 win32 release mode Optimized pow: Domain from 1.38863e-014 error max = 0.885559 avg = -0.453968 |avg| = 0.464655 to 9.10054e-010 error max = 6.11255e-007 avg = 5.2206e-008 |avg| = 1.0857e-007 to 5.96413e-005 error max = 6.11255e-007 avg = 5.22211e-008 |avg| = 1.08562e-007 to 3.90865 error max = 6.11255e-007 avg = 5.22491e-008 |avg| = 1.08765e-007 to 256157 error max = 6.11255e-007 avg = 5.20589e-008 |avg| = 1.08351e-007 to 4.29497e+009 Classic powf: Domain from 1.5333e-019 error max = 0.999992 avg = 0.0112573 |avg| = 0.0207178 to 1.00486e-014 error max = 3.11894e-006 avg = -2.54687e-006 |avg| = 2.54687e-006 to 6.58546e-010 error max = 2.07021e-006 avg = -1.48921e-006 |avg| = 1.48921e-006 to 4.31584e-005 error max = 1.01037e-006 avg = -4.31553e-007 |avg| = 4.41073e-007 to 2.82843 error max = 1.20056e-006 avg = 6.26103e-007 |avg| = 6.26103e-007 to 185364 error max = 2.26305e-006 avg = 1.68376e-006 |avg| = 1.68376e-006 to 1.2148e+010 error max = 3.30964e-006 avg = 2.74142e-006 |avg| = 2.74142e-006 to 7.96133e+014 error max = 3.55752e-006 avg = 3.3973e-006 |avg| = 3.3973e-006 to 1.13483e+016 flags used: cl /arch:SSE /arch:SSE2 -D_CRT_SECURE_NO_WARNINGS /fp:fast /Ox /Gs- pow_precision_test.cpp

Wolfgang Faehnle commented

2014-01-05 23:09:53 +01:00

Added subscriber: @mib2berlin

Wolfgang Faehnle commented

2014-01-05 23:09:53 +01:00

Hi, tested with patch compare to trunk build from juicyfruit VS 2013 with my Benchmarkfile 32x32 Tiles.

http://www.blenderartists.org/forum/showthread.php?303832-New-Cycles-Benchmark

http://martijnberger.nl/file/win64-vc12_Lockal.7z

Trunk 07:44.73
With patch 07:36.33

Intel i5 3770K
8GB
Windows 8 Ultimate

BTW. Trunk on Linux 06:12.52
Cheers, mib.

Hi, tested with patch compare to trunk build from juicyfruit VS 2013 with my Benchmarkfile 32x32 Tiles. http://www.blenderartists.org/forum/showthread.php?303832-New-Cycles-Benchmark http://martijnberger.nl/file/win64-vc12_Lockal.7z Trunk 07:44.73 With patch 07:36.33 Intel i5 3770K 8GB Windows 8 Ultimate BTW. Trunk on Linux 06:12.52 Cheers, mib.

Brecht Van Lommel commented

2014-01-06 16:39:59 +01:00

Regarding #ifndef KERNEL_OPENCL for color_srgb_to_scene_linear. That's because OpenCL doesn't support function overloading. If you give the function a different name it should be ok.

Regarding #ifndef __KERNEL_OPENCL__ for color_srgb_to_scene_linear. That's because OpenCL doesn't support function overloading. If you give the function a different name it should be ok.

Brecht Van Lommel commented

2014-01-06 16:40:24 +01:00

Further this looks good to me, if you want the commit the patch go ahead.

Sv. Lockal commented

2014-01-06 17:07:02 +01:00

Pow patch committed in 96903508bc.

I'll commit one other simple patch for ccl::svm_image_texture and then will close this task.

Pow patch committed in 96903508bc. I'll commit one other simple patch for ccl::svm_image_texture and then will close this task.

Sv. Lockal commented

2014-01-06 19:34:52 +01:00

Changed status from 'Open' to: 'Resolved'

Sv. Lockal closed this issue

2014-01-06 19:34:52 +01:00

Sv. Lockal commented

2014-01-06 19:34:52 +01:00

Commited ccl::svm_image_texture code as acc90b40bf. No big reason to optimize interpolation itself: the 90% of it's time is an actual texture read (lea, movq).

Commited ccl::svm_image_texture code as acc90b40bf. No big reason to optimize interpolation itself: the 90% of it's time is an actual texture read (lea, movq).

jrp commented

2014-03-30 00:57:37 +01:00

(Migrated from localhost:3001)

Added subscriber: @jrp

jrp commented

2014-03-30 00:57:37 +01:00

(Migrated from localhost:3001)

Here's a slightly enhanced converter. The ^2.4 function is still rather extravagant as it only needs to work in the range 0-1.

pow_precision_test.cpp

and here's pow_precision_test.zip the complete VS2013 project / solution for those that want to optimize further.

fastpow24 pow:
Domain from 1.39e-014 in 7.15sec
error max =   0.89      avg = -0.454    |avg| =  0.465  to 9.1e-010     in 6.92
sec
error max = 6.1e-007    avg = 5.22e-008 |avg| = 1.09e-007       to 5.96e-005
in 1.97 sec
error max = 6.1e-007    avg = 5.22e-008 |avg| = 1.09e-007       to   3.91
in 1.98 sec
error max = 6.1e-007    avg = 5.22e-008 |avg| = 1.09e-007       to 2.56e+005
in 1.97 sec
error max = 6.1e-007    avg = 5.21e-008 |avg| = 1.08e-007       to 4.29e+009
in 1.73 sec

fasterpower24 pow:
Domain from 5.14e-012 in 7.77sec
error max =    0.7      avg = 0.0015    |avg| = 0.00641 to 3.37e-007    in 6.21
sec
error max = 9.5e-007    avg = 3.39e-010 |avg| = 1.52e-007       to 0.0221
in 1.73 sec
error max = 1e-006      avg = 5.76e-010 |avg| = 1.52e-007       to 1.45e+003
in 1.72 sec
error max = 1e-006      avg = 7.89e-010 |avg| = 1.52e-007       to 9.49e+007
in 1.73 sec
error max = 1e-006      avg = 6.33e-010 |avg| = 1.52e-007       to 2.87e+009
in 0.53 sec

Classic powf:
Domain from 1.53e-019 in 2.09sec
error max =      1      avg = 0.0113    |avg| = 0.0207  to 1e-014       in 2.96
sec
error max = 3.1e-006    avg = -2.55e-006        |avg| = 2.55e-006       to 6.59e

010 in 2.03 sec

error max = 2.1e-006    avg = -1.49e-006        |avg| = 1.49e-006       to 4.32e

005 in 2.04 sec

error max = 1e-006      avg = -4.32e-007        |avg| = 4.41e-007       to   2.8

3 in 2.03 sec

error max = 1.2e-006    avg = 6.26e-007 |avg| = 6.26e-007       to 1.85e+005
in 2.04 sec
error max = 2.3e-006    avg = 1.68e-006 |avg| = 1.68e-006       to 1.21e+010
in 2.04 sec
error max = 3.3e-006    avg = 2.74e-006 |avg| = 2.74e-006       to 7.96e+014
in 2.03 sec
error max = 3.6e-006    avg = 3.4e-006  |avg| = 3.4e-006        to 1.13e+016
in 0.491 sec

Here's a slightly enhanced converter. The ^2.4 function is still rather extravagant as it only needs to work in the range 0-1. [pow_precision_test.cpp](https://archive.blender.org/developer/F83420/pow_precision_test.cpp) and here's [pow_precision_test.zip](https://archive.blender.org/developer/F83421/pow_precision_test.zip) the complete VS2013 project / solution for those that want to optimize further. ``` fastpow24 pow: Domain from 1.39e-014 in 7.15sec error max = 0.89 avg = -0.454 |avg| = 0.465 to 9.1e-010 in 6.92 sec error max = 6.1e-007 avg = 5.22e-008 |avg| = 1.09e-007 to 5.96e-005 in 1.97 sec error max = 6.1e-007 avg = 5.22e-008 |avg| = 1.09e-007 to 3.91 in 1.98 sec error max = 6.1e-007 avg = 5.22e-008 |avg| = 1.09e-007 to 2.56e+005 in 1.97 sec error max = 6.1e-007 avg = 5.21e-008 |avg| = 1.08e-007 to 4.29e+009 in 1.73 sec fasterpower24 pow: Domain from 5.14e-012 in 7.77sec error max = 0.7 avg = 0.0015 |avg| = 0.00641 to 3.37e-007 in 6.21 sec error max = 9.5e-007 avg = 3.39e-010 |avg| = 1.52e-007 to 0.0221 in 1.73 sec error max = 1e-006 avg = 5.76e-010 |avg| = 1.52e-007 to 1.45e+003 in 1.72 sec error max = 1e-006 avg = 7.89e-010 |avg| = 1.52e-007 to 9.49e+007 in 1.73 sec error max = 1e-006 avg = 6.33e-010 |avg| = 1.52e-007 to 2.87e+009 in 0.53 sec Classic powf: Domain from 1.53e-019 in 2.09sec error max = 1 avg = 0.0113 |avg| = 0.0207 to 1e-014 in 2.96 sec error max = 3.1e-006 avg = -2.55e-006 |avg| = 2.55e-006 to 6.59e ``` - 010 in 2.03 sec ``` error max = 2.1e-006 avg = -1.49e-006 |avg| = 1.49e-006 to 4.32e ``` - 005 in 2.04 sec ``` error max = 1e-006 avg = -4.32e-007 |avg| = 4.41e-007 to 2.8 ``` 3 in 2.03 sec ``` error max = 1.2e-006 avg = 6.26e-007 |avg| = 6.26e-007 to 1.85e+005 in 2.04 sec error max = 2.3e-006 avg = 1.68e-006 |avg| = 1.68e-006 to 1.21e+010 in 2.04 sec error max = 3.3e-006 avg = 2.74e-006 |avg| = 2.74e-006 to 7.96e+014 in 2.03 sec error max = 3.6e-006 avg = 3.4e-006 |avg| = 3.4e-006 to 1.13e+016 in 0.491 sec ```

jrp commented

2014-03-30 17:43:48 +02:00

(Migrated from localhost:3001)

An here's a patch that includes the slightly faster robust power function (fasterpower24) but also has another (approxpow24) that just uses a polynomial approximation, which is adequate for the limited range needed. srgblin.txt

An here's a patch that includes the slightly faster robust power function (fasterpower24) but also has another (approxpow24) that just uses a polynomial approximation, which is adequate for the limited range needed. [srgblin.txt](https://archive.blender.org/developer/F83487/srgblin.txt)

Sv. Lockal commented

2014-03-30 19:29:53 +02:00

jrp, are you sure that 2 iterations of Halley's method is faster than 3 iterations of Newton-Raphson method? A single iteration of Halley's method has 6*, 4+ and 1/, while Newton-Raphson has only 4*, 1+ and 1/. Halley's method is less robust because it calculates approx^5, so the working domain will be smaller.

The input of color_srgb_to_scene_linear is not limited to 1 in case of EXR images. However it is possible to call a specialized function in svm_image_texture for byte images and a generic function for float images.

The max error of polynomial is too big (approxpow24(1.0) = 0.994522324). One may use minimax approximant to achieve better results. 0.951542769e-3+(-0.3117281851e-1+(.5386576039+(.6188134751-.1274088440*x)*x)*x)*x has max error of 0.0001590453, but I think it is still too big.

jrp, are you sure that 2 iterations of Halley's method is faster than 3 iterations of Newton-Raphson method? A single iteration of Halley's method has 6*, 4+ and 1/, while Newton-Raphson has only 4*, 1+ and 1/. Halley's method is less robust because it calculates approx^5, so the working domain will be smaller. The input of `color_srgb_to_scene_linear` is not limited to 1 in case of EXR images. However it is possible to call a specialized function in `svm_image_texture` for byte images and a generic function for float images. The max error of polynomial is too big (approxpow24(1.0) = 0.994522324). One may use minimax approximant to achieve better results. `0.951542769e-3+(-0.3117281851e-1+(.5386576039+(.6188134751-.1274088440*x)*x)*x)*x` has max error of 0.0001590453, but I think it is still too big.

jrp commented

2014-03-31 00:47:17 +02:00

(Migrated from localhost:3001)

Halley does seem to be a fraction faster as the previous post illustrates. I've included the complete project file so that you can check that I am timing the right thing. In the great scheme of things the classic powerf doesn't do too badly.

EXR images should be linear already, but life being what it is, I can see that you may want to correct them anyway.

Here are a couple of other approximations:

Optimizing conversion between sRGB and linear

sRGB Approximations for HLSL

and I expect that you will have seen

Optimizations for pow() with const non-integer exponent?

A further poke through the blender code reveals that it seems to have at least one other sRGB to linear, never mind that in OpenImageIO, etc.

Halley does seem to be a fraction faster as the previous post illustrates. I've included the complete project file so that you can check that I am timing the right thing. In the great scheme of things the classic powerf doesn't do too badly. EXR images should be linear already, but life being what it is, I can see that you may want to correct them anyway. Here are a couple of other approximations: [Optimizing conversion between sRGB and linear ](http://excamera.com/sphinx/article-srgb.html) [sRGB Approximations for HLSL](http://chilliant.blogspot.co.uk/2012/08/srgb-approximations-for-hlsl.html) and I expect that you will have seen [Optimizations for pow() with const non-integer exponent?](http://stackoverflow.com/questions/6475373/optimizations-for-pow-with-const-non-integer-exponent) A further poke through the blender code reveals that it seems to have at least one other sRGB to linear, never mind that in OpenImageIO, etc.

Sign in to join this conversation.

No Label

Download

What's New

Blender Studio

Manual

Developers Blog

Documentation

Benchmark

Blender Conference

Development Fund

One-time Donations

Cycles optimization: move srgb/alpha conversion out of cycles kernel #38034