It is basically brute force volume scattering within the mesh, but part
of the SSS code for faster performance. The main difference with actual
volume scattering is that we assume the boundaries are diffuse and that
all lighting is coming through this boundary from outside the volume.
This gives much more accurate results for thin features and low density.
Some challenges remain however:
* Significantly more noisy than BSSRDF. Adding Dwivedi sampling may help
here, but it's unclear still how much it helps in real world cases.
* Due to this being a volumetric method, geometry like eyes or mouth can
darken the skin on the outside. We may be able to reduce this effect,
or users can compensate for it by reducing the scattering radius in
such areas.
* Sharp corners are quite bright. This matches actual volume rendering
and results in some other renderers, but maybe not so much real world
objects.
Differential Revision: https://developer.blender.org/D3054
The intention of this commit it to address issues mentioned in the
reports T43865,T50164 and T50452.
The code is based on Embree code with some extra vectorization
to speed up single ray to single triangle intersection.
Unfortunately, such a fix is not coming for free. There is some
slowdown for AVX2 processors, mainly due to different vectorization
code, which caused different number of instructions to be executed
and different instructions-per-cycle counters. But on another hand
this commit makes pre-AVX2 platforms such as AVX and SSE4.1 a bit
faster. The prerformance goes as following:
2.78c AVX2 2.78c AVX Patch AVX2 Patch AVX
BMW 05:21.09 06:05.34 05:32.97 (+3.5%) 05:34.97 (-8.5%)
Classroom 16:55.36 18:24.51 17:10.41 (+1.4%) 17:15.87 (-6.3%)
Fishy Cat 08:08.49 08:36.26 08:09.19 (+0.2%) 08:12.25 (-4.7%
Koro 11:22.54 11:45.24 11:13.25 (-1.5%) 11:43.81 (-0.3%)
Barcelone 14:18.32 16:09.46 14:15.20 (-0.4%) 14:25.15 (-10.8%)
On GPU the performance is about 1.5-2% slower in my tests on GTX1080
but afraid we can't do much as a part of this chaneg here and
consider it a price to pay for more proper intersection check.
Made in collaboration with Maxym Dmytrychenko, big thanks to him!
Reviewers: brecht, juicyfruit, lukasstockner97, dingto
Differential Revision: https://developer.blender.org/D1574
Avoid construction of temporary array and make utility function force-inlined.
Additionally avoid calling float4_to_float3 twice.
This brings render times to the same values as before current patch series.
This is a preparation work for the followup commit which wil l move
remaining parts of Woop intersection logic to an utility file.
Doing it as a separate commit to keep changes more atomic and easier
to bisect when/if needed.
Similar to the previous commit, avoid negative effect of bad branch prediction.
Gives measurable performance up to ~2% in tests here.
Once again, thanks to Maxym Dmytrychenko!
Similar to regular triangle intersection case. Gives about 3% speedup rendering
SSS object on my desktop,
Question: how to avoid such a code duplication in a nice way without speed loss?
This commit basically vectorizes existing code using AVX2 instructions
(without modifying algorithm itself). This gives quite nice speedups:
BMW: -8%
Classroom: -5%
Cat: -5%
Koro: +1%
Barcelona: -8%
That's on Linux machine, reported performance improvement on Windows
goes up to 20%.
Not currently sure why Koro is somewhat slower because it mainly uses
curve intersection tests, could be a time noise? Or osmething with the
cache utilization perhaps? In any case speedup in other scenes makes
me thinking that current state is acceptable for initial implementation.
This is again inspired by Maxym Dmytrychenko.
In the triangle intersection refinement code, rays that are parallel to the triangle caused a divide by zero.
These rays might initially hit the triangle due to the watertight intersection test, but are very rare - therefore, just skipping the refinement for them works fine.
Also, a few remaining issues in the MultiGGX code are fixed that were caused by rays parallel to the surface (which happened more often there due to smooth shading).
The issue was caused by SSS intersection code gathering all
intersections without check for duplicated ones. This caused
situations when same intersection will be recorded twice in
the case if triangle is shared by several BVH nodes.
Usually this is handled by checking intersection distance
after sorting intersections (in shadow_blocked for example)
but for SSS we don't do such sorting and using number of
intersections to calculate various things.
Didn't find anything smarter than to check intersection
distance in triangle_intersect_subsurface().
This solves render artifacts in the cost of 1.5% slowdown
of extreme case rendering (SSS object filling in whole
FullHD screen).
Reviewers: brecht
Reviewed By: brecht
Differential Revision: https://developer.blender.org/D2105
There are several internal changes for this:
First idea is to make __tri_verts to behave similar to __tri_storage,
meaning, __tri_verts array now contains all vertices of all triangles
instead of just mesh vertices. This saves some lookup when reading
triangle coordinates in functions like triangle_normal().
In order to make it efficient needed to store global triangle offset
somewhere. So no __tri_vindex.w contains a global triangle index which
can be used to read triangle vertices.
Additionally, the order of vertices in that array is aligned with
primitives from BVH. This is needed to keep cache as much coherent as
possible for BVH traversal. This causes some extra tricks needed to
fill the array in and deal with True Displacement but those trickery
is fully required to prevent noticeable slowdown.
Next idea was to use this __tri_verts instead of __tri_storage in
intersection code. Unfortunately, this is quite tricky to do without
noticeable speed loss. Mainly this loss is caused by extra lookup
happening to access vertex coordinate.
Fortunately, tricks here and there (i,e, some types changes to avoid
casts which are not really coming for free) reduces those losses to
an acceptable level. So now they are within couple of percent only,
On a positive site we've achieved:
- Few percent of memory save with triangle-only scenes. Actual save
in this case is close to size of all vertices.
On a more fine-subdivided scenes this benefit might become more
obvious.
- Huge memory save of hairy scenes. For example, on koro.blend
there is about 20% memory save. Similar figure for bunny.blend.
This memory save was the main goal of this commit to move forward
with Hair BVH which required more memory per BVH node. So while
this sounds exciting, this memory optimization will become invisible
by upcoming Hair BVH work.
But again on a positive side, we can add an option to NOT use Hair
BVH and then we'll have same-ish render times as we've got currently
but will have this 20% memory benefit on hairy scenes.
OpenCL seems to work fine here, and for some reason that comparison was
giving compilation error on OpenCL here.
Better to compile OpenCL kernel than to be fully robust to weird corner
cases.
Still not sure how to properly solve the issue, needs some trickery to get
actual optimized values from intersection function (using printf() avoids
some optimization and makes stuff render correct).
For the time being let's just simplify check.
This commit introduces a SSS-oriented intersection structure which is replacing
old logic of having separate arrays for just intersections and shader data and
encapsulates all the data needed for SSS evaluation.
This giver a huge stack memory saving on GPU. In own experiments it gave 25%
memory usage reduction on GTX560Ti (722MB vs. 946MB).
Unfortunately, this gave some performance loss of 20% which only happens on GPU.
This is perhaps due to different memory access pattern. Will be solved in the
future, hopefully.
Famous saying: won in memory - lost in time (which is also valid in other way
around).
It was possible to miss some intersection caused by wrong barycentric
coordinates sign.
Cases when one of the coordinate is zero and other are negative was not
handled correct.
Epsilon was quite arbitrary for GPU, replaced with checking for zero-sized faces.
It should solve both original report and the new one. After the release we can check
why GPU doesn't produce accurate math here and go to the root of the issue.
Found a way to make AVX2 CPUs happy by reshuffling instructions a bit,
so now there's no weird precision errors happening in there.
This solves some render speed regressions on CPU, but unfortunately
this doesn't help for GPU rendering.
Initial idea was to optimize calculation a bit by skipping calculation of actual
triangle edges and use vector from ray origin to triangles. In practice this
optimization didn't quite work in cases when origin point is too close to the
triangle.
Let's do 2.76 with a bit more complicated calculation, still looking into exact
reasons why watertight intersections fails in certain cases, but actual fix might
bit be ready so soon.
This fixes wrong eyes on the lady from T46013.
The issue was caused by some numeric instability in triangle intersection which
was visible on avx2 CPUs and GPUs (at least sm_20 here) but maybe some others
too.
Committing rather a workaround for now to be safe for the release, still need
some investigation.
From tests with grass field from Gooseberry project didn't see measurable
slowdown.
This gives quite the same problems as experimental CUDA kernels
and for until it's found a root cause of the problem we'd just
explicitly uninline the function.
Issue was caused by MSVC not being able to optimize some code out in the same
way as GCC/Clang does, so now that parts of code are explicitly unfolded in
order to help compilers out.
This makes speed loss much less drastic on my laptop. That's probably as good
as we can do with MSVC without investing infinite amount of time looking trying
to workaround the optimizer.