Cycles optimization: move srgb/alpha conversion out of cycles kernel #38034
Labels
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
6 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: blender/blender#38034
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Looking at top 5 functions in profiler for pavillon_barcelone_v1.2 (Ubuntu 14.04, CPU Intel Core i7 4771, compiled with
gcc --march=native
):ccl::bvh_intersect_instancing
__ieee754_powf
ccl::svm_image_texture
ccl::kernel_path_integrate
ccl::shader_setup_from_ray
As you can see from table,
powf
calls are too expensive even for Haswell.Each time cycles kernel fetches an interpolated color for pixel (x, y), it applies alpha (if
use_alpha
flag from SVM stack is set) and converts the result from srgb to linear (ifsrgb
flag from SVM stack is set) -- seesvm_image_texture
. Therefore cycles kernel produces billions ofcolor_srgb_to_scene_linear calls
, which usepowf
. Is far as I can see, bothuse_alpha
andsrgb
flags are seem to be constants: onlyEnvironmentTextureNode::compile/ImageTextureNode::compile
set them and onlysvm_node_tex_environment
,svm_node_tex_image_box
andsvm_node_tex_image
decode them from SVM stack.If Cycles internals work only in linear space, can we convert images to linear space before starting raytracer? This could give a noticeable boost for textured objects.
Few notes:
interpolate(premultiply(image), x, y) ≡ premultiply(interpolate(image, x, y))
, AFAIKNo patch yet, waiting for Brecht's comment.
Changed status to: 'Open'
Added subscriber: @Lockal
Added subscribers: @ThomasDinges, @brecht, @MartijnBerger
It's impossible to store linear colors in 8 bits without artifacts. Storing it in floats or half-floats would be possible but takes more memory and image textures are already the biggest memory user in many scenes. Interpolation in linear space would in fact be more accurate so that's no problem.
It would be possible to use a lookup table for the values you read from the texture, that's 12 table lookups. That may be faster, I guess it depends a bit on the scene because such a table might easily stay in the cache on simple scenes but not always for more complex scenes.
@Lockal how good or bad is powf and how much error could we have in the desired range. and how does this translate to possible faster variations for powf?
i know that some of the implementation of functions like this tend to be slower then you would want / expect due to accuracy and or legacy reasons.
But is there any speed to be gained from using a powf that is just good enough but faster ?
Test patch for powf replacement. Uses speculative initial guess based on float representation , and improves the result with three iterations of Newton-Raphson method. Uncommented, can be improved with blendv (SSE4) and fma intrinsics. Gives 7% speedup on i7-4771 (haswell) for pavillon_barcelone_v1.2.blend. 4% on simple plane with texture.
srgb2linear.patch
Nice patch.
On Ivy Bridge I get 30% speedup in images.blend (from test suite) with 100 Samples. (20.54s >> 15.52s).
pow_precision_test.cpp was used for testing precision and robustness of optimized pow. The optimized function gives better precision than original powf from glibc/eglibc by approximately one decimal. However optimized pow is less robust: it works only for positive numbers in range 1e-10 to 1e+10 (which should be enough for srgb->linear conversion). The original powf from glibc works for numbers up to 10e16.
Tested with pavillon_barcelone_v1.2, scene "CPU Benchmark"
Ivy Bridge Quad Core (3.4 GHZ)
Ubuntu Linux 12.10, x64
gcc 4.7.2
Vanilla master: 08:40min
With patch: 8:00 min
So I can confirm your 7% here @Lockal, nice work!
Nice work!
color_srgb_to_scene_linear(float3 c)
, but why this function lies inside#ifndef __KERNEL_OPENCL__
? Is something wrong with float vectors with opencl? Also note thatsvm_image_texture
is third in profiler list: there are obvious vector alpha multiplication and _mm_min_ps. I just want to see the result ofpow
changes in this patch.Sandybridge hardware gives me about 7 % on Barcelona and some other archviz scenes. where higher resolution seems to give more speedup and more texture heavy scenes also gain more.
Barcelona gives 7.01 % improvement on 5 runs with vs 5 runs without.
srgb2linear_v2.patch
New version of this patch: add comments, move blend() to util_simd.h (sse4.1 gives 2 instructions less), exp2(... * pow(...)) were replaced by precalculated constants.
blender trunk 3 versions
I compiled trunk, trunk + your patch and trunk + patch + sse41 kernel in one 7z
Ill test tomorrow
@Lockal
win64 release mode:
Optimized pow:
Domain from 1.38863e-014
error max = 0.885559 avg = -0.453968 |avg| = 0.464655 to 9.10054e-010
error max = 6.11255e-007 avg = 5.2206e-008 |avg| = 1.0857e-007 to 5.96413e-005
error max = 6.11255e-007 avg = 5.22211e-008 |avg| = 1.08562e-007 to 3.90865
error max = 6.11255e-007 avg = 5.22491e-008 |avg| = 1.08765e-007 to 256157
error max = 6.11255e-007 avg = 5.20589e-008 |avg| = 1.08351e-007 to 4.29497e+009
Classic powf:
Domain from 2.0467e-019
error max = 0.333336 avg = -0.00114124 |avg| = 0.0083194 to 1.34133e-014
error max = 3.09128e-006 avg = -2.51558e-006 |avg| = 2.51558e-006 to 8.79053e-010
error max = 2.04248e-006 avg = -1.45791e-006 |avg| = 1.45791e-006 to 5.76096e-005
error max = 9.82767e-007 avg = -4.00258e-007 |avg| = 4.16499e-007 to 3.7755
error max = 1.21638e-006 avg = 6.57398e-007 |avg| = 6.57398e-007 to 247431
error max = 2.29059e-006 avg = 1.71506e-006 |avg| = 1.71506e-006 to 1.62157e+010
error max = 3.33717e-006 avg = 2.77272e-006 |avg| = 2.77272e-006 to 1.06271e+015
error max = 3.55752e-006 avg = 3.41284e-006 |avg| = 3.41284e-006 to 1.13483e+016
win32 release mode
Optimized pow:
Domain from 1.38863e-014
error max = 0.885559 avg = -0.453968 |avg| = 0.464655 to 9.10054e-010
error max = 6.11255e-007 avg = 5.2206e-008 |avg| = 1.0857e-007 to 5.96413e-005
error max = 6.11255e-007 avg = 5.22211e-008 |avg| = 1.08562e-007 to 3.90865
error max = 6.11255e-007 avg = 5.22491e-008 |avg| = 1.08765e-007 to 256157
error max = 6.11255e-007 avg = 5.20589e-008 |avg| = 1.08351e-007 to 4.29497e+009
Classic powf:
Domain from 1.5333e-019
error max = 0.999992 avg = 0.0112573 |avg| = 0.0207178 to 1.00486e-014
error max = 3.11894e-006 avg = -2.54687e-006 |avg| = 2.54687e-006 to 6.58546e-010
error max = 2.07021e-006 avg = -1.48921e-006 |avg| = 1.48921e-006 to 4.31584e-005
error max = 1.01037e-006 avg = -4.31553e-007 |avg| = 4.41073e-007 to 2.82843
error max = 1.20056e-006 avg = 6.26103e-007 |avg| = 6.26103e-007 to 185364
error max = 2.26305e-006 avg = 1.68376e-006 |avg| = 1.68376e-006 to 1.2148e+010
error max = 3.30964e-006 avg = 2.74142e-006 |avg| = 2.74142e-006 to 7.96133e+014
error max = 3.55752e-006 avg = 3.3973e-006 |avg| = 3.3973e-006 to 1.13483e+016
flags used:
cl /arch:SSE /arch:SSE2 -D_CRT_SECURE_NO_WARNINGS /fp:fast /Ox /Gs- pow_precision_test.cpp
Added subscriber: @mib2berlin
Hi, tested with patch compare to trunk build from juicyfruit VS 2013 with my Benchmarkfile 32x32 Tiles.
http://www.blenderartists.org/forum/showthread.php?303832-New-Cycles-Benchmark
http://martijnberger.nl/file/win64-vc12_Lockal.7z
Trunk 07:44.73
With patch 07:36.33
Intel i5 3770K
8GB
Windows 8 Ultimate
BTW. Trunk on Linux 06:12.52
Cheers, mib.
Regarding #ifndef KERNEL_OPENCL for color_srgb_to_scene_linear. That's because OpenCL doesn't support function overloading. If you give the function a different name it should be ok.
Further this looks good to me, if you want the commit the patch go ahead.
Pow patch committed in
96903508bc
.I'll commit one other simple patch for ccl::svm_image_texture and then will close this task.
Changed status from 'Open' to: 'Resolved'
Commited ccl::svm_image_texture code as
acc90b40bf
. No big reason to optimize interpolation itself: the 90% of it's time is an actual texture read (lea, movq).Added subscriber: @jrp
Here's a slightly enhanced converter. The ^2.4 function is still rather extravagant as it only needs to work in the range 0-1.
pow_precision_test.cpp
and here's pow_precision_test.zip the complete VS2013 project / solution for those that want to optimize further.
3 in 2.03 sec
An here's a patch that includes the slightly faster robust power function (fasterpower24) but also has another (approxpow24) that just uses a polynomial approximation, which is adequate for the limited range needed. srgblin.txt
jrp, are you sure that 2 iterations of Halley's method is faster than 3 iterations of Newton-Raphson method? A single iteration of Halley's method has 6*, 4+ and 1/, while Newton-Raphson has only 4*, 1+ and 1/. Halley's method is less robust because it calculates approx^5, so the working domain will be smaller.
The input of
color_srgb_to_scene_linear
is not limited to 1 in case of EXR images. However it is possible to call a specialized function insvm_image_texture
for byte images and a generic function for float images.The max error of polynomial is too big (approxpow24(1.0) = 0.994522324). One may use minimax approximant to achieve better results.
0.951542769e-3+(-0.3117281851e-1+(.5386576039+(.6188134751-.1274088440*x)*x)*x)*x
has max error of 0.0001590453, but I think it is still too big.Halley does seem to be a fraction faster as the previous post illustrates. I've included the complete project file so that you can check that I am timing the right thing. In the great scheme of things the classic powerf doesn't do too badly.
EXR images should be linear already, but life being what it is, I can see that you may want to correct them anyway.
Here are a couple of other approximations:
Optimizing conversion between sRGB and linear
sRGB Approximations for HLSL
and I expect that you will have seen
Optimizations for pow() with const non-integer exponent?
A further poke through the blender code reveals that it seems to have at least one other sRGB to linear, never mind that in OpenImageIO, etc.