SVM nodes need to read all data to get the right offset for the following node.
This is quite weak, a more generic solution would be good in the future.
Our own implementation was behaving different comparing to OSL and GPU,
namely on the border pixels OSL and CUDA was doing interpolation with
black, but we were clamping coordinate.
This partially fixes issue reported in T53452.
Similar change should also be done for 3D interpolation perhaps, but this
is to be investigated separately.
Previously, the NLM kernels would be launched once per offset with one thread per pixel.
However, with the smaller tile sizes that are now feasible, there wasn't enough work to fully occupy GPUs which results in a significant slowdown.
Therefore, the kernels are now launched in a single call that handles all offsets at once.
This has two downsides: Memory accesses to accumulating buffers are now atomic, and more importantly, the temporary memory now has to be allocated for every shift at once, increasing the required memory.
On the other hand, of course, the smaller tiles significantly reduce the size of the memory.
The main bottleneck right now is the construction of the transformation - there is nothing to be parallelized there, one thread per pixel is the maximum.
I tried to parallelize the SVD implementation by storing the matrix in shared memory and launching one block per pixel, but that wasn't really going anywhere.
To make the new code somewhat readable, the handling of rectangular regions was cleaned up a bit and commented, it should be easier to understand what's going on now.
Also, some variables have been renamed to make the difference between buffer width and stride more apparent, in addition to some general style cleanup.
CUDA 9.0.176 apparently caused some slow down on high-end Pascal cards that can be mitigated by increasing the number of registers. See https://developer.blender.org/F1142667 for a detailed comparison.
Was giving difference when using sharpness of 1.0 and 0.999 even though the
result was expected to be really close to each other.
This SSS profile will probably be removed in the future in favor of more
physically bases Burley, but for the time being don't see anything wrong
fixing an existing code.
No color pass because it's hard to define what to use as color in a volume.
Reviewers: sergey, brecht
Differential Revision: https://developer.blender.org/D2903
Previously we picked one of the RGB channels with equal probability, but this
works poorly in a dense volume after many bounces. Now we take into account
the throughput and single scattering albedo.
This makes it a little more practical to do brute force SSS with volumes, but
is still very inefficient because we do direct light sampling at every volume
bounce even when inside an opaque mesh. In theory there could be a light inside
the mesh so we can't automatically disable direct lighting.
In fact this was an existing issue when exceeding the number of available
closure, but it's more common now that we set the number to 0 for shadows
and emission
Goal is to reduce OpenCL kernel recompilations.
Currently viewport renders are still set to use 64 closures as this seems to
be faster and we don't want to cause a performance regression there. Needs
to be investigated.
Reviewed By: brecht
Differential Revision: https://developer.blender.org/D2775
The algorithm averages normals from nearby surfaces. It uses the same
sampling strategy as BSSRDFs, casting rays along the normal and two
orthogonal axes, and combining the samples with MIS.
The main concern here is that we are introducing raytracing inside
shader evaluation, which could be quite bad for GPU performance and
stack memory usage. In practice it doesn't seem so bad though.
Note that using this feature can easily slow down renders 20%, and
that if you care about performance then it's better to use a bevel
modifier. Mainly this is useful for baking, and for cases where the
mesh topology makes it difficult for the bevel modifier to work well.
Differential Revision: https://developer.blender.org/D2803
This causes some difference in the classroom scene, where ray visibility
tricks are used and break the MIS balance. Otherwise there doesn't seem
to be much effect, but better to use the right formulas. Problem originally
identified by Lukas.
With a Titan Xp, reduces path trace local memory from 1092MB to 840MB.
Benchmark performance was within 1% with both RX 480 and Titan Xp.
Original patch was implemented by Sergey.
Differential Revision: https://developer.blender.org/D2249
We are already using the AO distance, so might as well offer this extra
control over the intensity. Useful when an interior scene is supposed to
be significantly darker than the background shader.