Create unique flag for output shaders with displacement data and use it
to calculate transformed normal. Implementation suggested by Brecht Van
Lommel.
Reviewers: brecht
Differential Revision: https://developer.blender.org/D890
The idea is to avoid memory allocation when only one segment step is to be allocated.
This gives some speedup which is difficult to measure on this trashcan from hell, but
it's about from 7% to 10% in the extreme case with single volume filling the whole of
the viewport. This seems to depends on the phase of the bug-o-meter in the studio.
On the linux boxes it's not that spectacular speedup, it's about 2% on my laptop and
about 3% on the studio desktop. This is likely because of the awesomeness of jemalloc.
Add compile-time check for particular glibc version which fixed the issue.
This makes it so own-compiled blender is the fastest in the world, and the
only issue remains what should we do for release builds.
After some discussion with Campbell we decided to keep it as is for now
because slowdown is not that much noticeable. We'll disable this workaround
for release builds when all the majority of the distros will switch to the
new version of glibc.
That code was mainly needed for the transition period, now we've
got all platforms updated to new OSL.
Plus there are some crucial fixes baking in the current upstream
sources which we'll need to have for the next Blender release.
Even tho it's not 100% clear when we'll switch to OSL-1.6 we'd better
start preparing earlier for this, so we don't spend time on this later.
Plus this code helps troubleshooting some OSL issues, which requires
testing with latest versions of OSL.
This mainly happens when over-saturating already saturated color.
After some discussion with Campbell and loads of tests we decided
to clamp the result RGB color. As an alternative we might want to
clamp corrected HSV values instead, but that would lead to some
larger changes in the render results.
TODO: The same is to be done for compositor nodes.
This is basically just a wrapper class, which maps the generic call from the OSL spec to our closures.
Example usage:
shader microfacet_osl(
color Color = color(0.8),
int Distribution = 0,
normal Normal = N,
vector Tangent = normalize(dPdu),
float RoughnessU = 0.0,
float RoughnessV = 0.0,
float IOR = 1.4,
int Refract = 0,
output closure color BSDF = 0)
{
if (Distribution == 0)
BSDF = Color * microfacet("ggx", Normal, Tangent, RoughnessU, RoughnessV, IOR, Refract);
else
BSDF = Color * microfacet("beckmann", Normal, Tangent, RoughnessU, RoughnessV, IOR, Refract);
}
This way it's easier to extend bitfields and see when we start running
out of free bits.
Plus added brief description of what SD_VOLUME_CUBIC flag means.
It is per-material setting which could be found under the Volume settings
in the material and world context buttons.
There could still be some code-wise improvements, like using variable-size
macro for interp3d instead of having interp3d_ex to which you can pass the
interpolation method.
This is the first step towards supporting cubic interpolation for voxel
data (such as smoke and fire). It is not epxosed to the interface at all
yet, this is to be done soon after this change.
This is so-called GPU limitation boundary hit, told compiler to NOT include
volume bound function, otherwise some real weird things used to happen.
We actually might want to do the same for CPU, inlining everything is not
the way to get fastest code.
Also reduce number of branching and multiplications a bit by inlining the branches.
This gives an unmeasurable speedup, which is in case of BMW is about 2% here.
Not as if it gives noticeable changes render-time, but it's just weird to
convert float4 to float 3 to just access individual x/y/z components.
Plus some compilers might be more stupid than GCC and don't optimize this
out well.
After discussion with cambo here we decided it's better to choose arbitrary side of the box
(in this case it's X-axis) and use image from it. That's better than doing a blackness.
P.S. This is literally a corner case anyway.
Ray actually should have infinite length, so we can detect camera in a volume
which is bigger that the far clipping of the camera.
This might also give some speedup (wouldn't expect much tho) because we don't
need to re-calculate ray direction and length after every bounce now.
basically we skip all non-volume objects now in the volume stack function.
Depending on the show it might give some percent of speedup.
Most of the speedup would be gained in the scenes when having SSS object
intersecting the volume and taking a reasonable amount of frame space.
Single precision exponent on 64bit linux tends to be order of magnitude slower
than double precision version even with single<->double precision conversion.
Some feedback in the mailing lists also suggests that logf() is also slow, but
this i didn't confirm here in the studio yet.
Depending on the shader setup it gives ~3% with the secret agent shot and up to
around 15% with the bmw scene here.
Quite straightforward change, the only annoying thing is that we can't use
indentation for include directive just because of the way headers inlineing
works for OpenCL.
Might do smarter job in path_source_replace_includes() but don't want to
spend time on this yet.
* On sm_30 and above there is no change (was not inlined already before), this just fixes a speed regression from yesterday. 6359c36ba4
* On sm_2x (tested with sm_21), I get a nice 8% speedup in the bmw scene with this. As a bonus, cubin compilation time and memory usage is significantly reduced. Regular cubin size went from 2.5MB to 2.0MB, Experimental one from 3.8MB to 2.5MB.