* On nvidia Kepler GPUs (sm_30 and above), there are now 145 byte images available, instead of 95.
We could extend this to about 200 if needed.
Could not test this, as I don't have a Kepler GPU, so feedback on this would be appreciated.
Thanks to Brecht for review and some fixes. :)
* Add CUDA compiler version detection to cmake/scons/runtime
* Remove noinline in kernel_shader.h and reenable --use_fast_math if CUDA 5.x
is used, these were workarounds for CUDA 4.2 bugs
* Change max number of registers to 32 for sm 2.x (based on performance tests
from Martijn Berger and confirmed here), and also for NVidia OpenCL.
Overall it seems that with these changes and the latest CUDA 5.0 download, that
performance is as good as or better than the 2.67b release with the scenes and
graphics cards I tested.
On the BMW scene, this gives roughly a 10% speedup overall with clang/gcc, and 30%
speedup with visual studio (2008). It turns out visual studio was optimizing the
existing code quite poorly compared to pretty good autovectorization by clang/gcc,
but hand written SSE code also gives a smaller speed boost there.
This code isn't enabled when using the hair minimum width feature yet, need to
make that work with the SSE code still.
* Enable the Non-Progressive integrator on GPU (CUDA) for testing.
In order to compile the CUDA kernel with it, you need at least 6GB of system memory and CUDA Toolkit 5.0 or 5.5.
It should also work with CUDA Toolkit 4.2, but in this case you should have 12GB of RAM.
In case any problems arise, just change line 65 of kernel_types.h to disable Non-Progressive again.
-- #define __NON_PROGRESSIVE__
++ //#define __NON_PROGRESSIVE__
* Replaced the Brute Force version with a nice lookup table, this speeds it up a lot.
Patch by Philipp Oeser (lichtwerk) with some cleanup and changes by myself. Thanks!
ToDo:
* Temperature values between 800 and 804 Kelvin are wrong in SVM, check on this.
* First (brute force) implementation for SVM. This works and delivers the same result as OSL, but it's slow.
* Code inside svm_blackbody.h inspired by a patch by Philipp Oeser (#35698), thanks.
Ideas:
* Use a lookup table to perform the calculations on render/ level.
* Implement it as a RNA property only, and do the calculation like Sun/Sky precompute.
and sm_30 cards, so hopefully it should all work now.
Also includes some warnings fixes related to nvcc compiler arguments, should make
no difference otherwise.
* Added a node to convert wavelength (in nanometers, from 380nm to 780nm) to RGB values. This can be useful to match real world colors easier.
* Code cleanup:
** Moved color functions (xyz and hsv) into dedicated utility files.
** Remove svm_lerp(), use interp() instead.
Documentation:
http://wiki.blender.org/index.php/Doc:2.6/Manual/Render/Cycles/Nodes/More#Wavelength
Example render:
http://www.pasteall.org/pic/show.php?id=53202
This is part of my GSoC 2013. (revisions 57322, 57326, 57335 and 57367 from soc-2013-dingto).
to be done in cycles itself to keep compatibility for bytecode too.
Also fix broken button to compile OSL from the text editors, this got broken after
recent change to disable editing of library linked nodes.
* Added a node to convert wavelength (in nanometer, from 380nm to 780nm) to RGB values. This can be useful to match real world colors easier.
Example render:
http://www.pasteall.org/pic/show.php?id=53202
ToDo:
* Move some functions into an util file, maybe a common util_color.h or so.
* Test GPU, unfortunately sm_21 doesn't work for me yet.
multiple importance sampling, so you can disable them for diffuse/glossy/transmission.
The Light Path node here is still weak and does not give this info. To make that
work we'd need to evaluate the shader multiple times which is slow and we can't
detect well enough when it is actually needed.
instead of sobol. So far one doesn't seem to be consistently better or worse than
the other for the same number of samples but more testing is needed.
The random number generator itself is slower than sobol for most number of samples,
except 16, 64, 256, .. because they can be computed faster. This can probably be
optimized, but we can do that when/if this actually turns out to be useful.
Paper this implementation is based on:
http://graphics.pixar.com/library/MultiJitteredSampling/
Also includes some refactoring of RNG code, fixing a Sobol correlation issue with
the first BSDF and < 16 samples, skipping some unneeded RNG calls and using a
simpler unit square to unit disk function.
* Revert r57203 (len() renaming)
There seems to be a problem with nVidia OpenCL after this and I haven't figured out the real cause yet.
Better to selectively enable native length() later, after figuring out what's wrong.
This fixes [#35612].
* Rename some math functions:
len -> length
len_squared -> length_squared
normalize_len -> normalize_length
* This way OpenCL uses its inbuilt length() function, rather than our own. The other two functions have been renamed for consistency.
* Tested CPU, CUDA and OpenCL compile, should be no functional changes.
* Cycles Mix closure could render strange effects, when the user entered a value out of the 0...1 range. This was already clamped for OSL, clamp for SVM as well.
* Support using devices from all OpenCL platforms, so that you can use e.g. both
Intel and NVidia OpenCL implementations if you have them installed.
* Fix compile error due to missing fmodf after recent math node change.
* Enable advanced shading for Intel OpenCL.
* CYCLES_OPENCL_DEBUG environment variable for generating debug symbols so you
can debug with gdb. This crashes the compiler with Intel OpenCL on Linux though.
To make this work the preprocessed kernel source code is written out, as gdb
needs this.
* Show OpenCL compiler warnings even if the build succeeded.
* Some small fixes to initialize cdDevice to NULL, add missing NULL check when
creating buffer and add missing space at end of build options for Apple OpenCL.
* Fix crash with multi device + opencl, now e.g. CPU + GPU render should work.
I did a few tweaks to the code and also:
* Fix viewport render failing sometimes with Apple CPU OpenCL, was not taking
workgroup size limits into account properly.
* Add compile error when advanced shading in the Blender binary and OpenCL kernel
are not in sync.
* Some closures (Toon, Diffuse Ramp) were not assigned to a CLOSURE_IS_* define, which made them invisible on render passes.
* Westin closures had wrong type, Sheen is Diffuse, Backscatter is Glossy.
* Rename fresnel_dielectric() to fresnel_dielectric_cos() to match SVM, easier when searching code.
* Also remove an old code comment in bsdf_reflection.h from Cycles branch days.