Commit Graph

1960 Commits

Author SHA1 Message Date
e361adbca2 Cycles: Fix compilation error of LCG RNG 2017-03-17 09:58:08 +01:00
60a344b43d Cycles: Fix handling of barriers 2017-03-17 01:54:04 -04:00
1cad64900e Cycles: Define ccl_local variables in kernel functions
Declaring ccl_local in a device function is not supported
by certain compilers.
2017-03-16 11:27:17 +01:00
1ff753baa4 Cycles: Workaround for compilation error caused by passing KernelGlobals
Pass globals as a bare pointer, same as it sued to be prior to split kernel rework.

AMD CPU platform and Intel OpenCL were complaining about this.

Perhaps we shouldn't pass globals as pointer at all, this isn't something what is
really portable and can cause issues on 32 bit perhaps.
2017-03-16 11:27:17 +01:00
26620f3f87 Cycles: Avoid some ccl_local in various kernels 2017-03-16 11:27:17 +01:00
8dd0355c21 Cycles: Try to avoid infinite loops by catching invalid ray states 2017-03-14 06:22:57 -04:00
76acaefdd7 Cycles: Cleanup, wipe obviously outdated parts of split kernel comments 2017-03-13 17:16:16 +01:00
0c72008592 fix msvc warnings about unknown opencl pragmas 2017-03-13 10:08:14 -06:00
aa36c73c33 Cycles: Add missing header in the file 2017-03-13 16:59:09 +01:00
Hristo Gueorguiev
f169ff8b88 Fix T50925: Add AO approximation to split kernel 2017-03-13 11:15:58 +01:00
8794a43b68 Cycles: Make MESA compiler more happy
While this compiler is not officially supported yet, getting it to work is
a nice thing because more and more AMD cards will fall under MESA driver.

It's also nice to use explicit comparison with NULL, which makes it more
clear whether variable is a boolean or pointer. Even Rust enforces this!

Patch by Ian Bruce with own modifications.
2017-03-13 09:57:25 +01:00
96868a3941 Fix T50888: Numeric overflow in split kernel state buffer size calculation
Overflow led to the state buffer being too small and the split kernel to
get stuck doing nothing forever.
2017-03-11 05:39:28 -05:00
59fd21296a Cycles: Cleanup, extra semicolon and space 2017-03-10 15:38:30 +01:00
4a2cde3f0e Cycles: Enable SSS and volumes for CUDA and Nvidia OpenCL split kernel 2017-03-10 02:09:41 -05:00
Hristo Gueorguiev
9de9f25b24 Cycles: add single program debug option for split kernel
Single program generally compiles kernels faster (2-3 times), loads faster,
takes less drive space (2-3 times), and reduces the number of cached kernels.
2017-03-09 17:09:37 +01:00
Hristo Gueorguiev
06c051363b Cycles: split kernel_shadow_blocked to AO & DL parts
Reduces memory allocation for split kernel.

This allows for faster rendering due to bigger global size,
specially when GPU memory is limited.

Perfromance results:

                         R9 290 total render time
                        Before    After   Change
BMW                      4:37      4:34   -1.1 %
Classroom               14:43     14:30   -1.5 %
Fishy Cat               11:20     11:04   -2.4 %
Koro                    12:11     12:04   -1.0 %
Pabellon Barcelona      22:01     20:44   -5.8 %
Pabellon Barcelona(*)   15:32     15:09   -2.5 %

(*) without glossy connected to volume
2017-03-09 17:09:37 +01:00
Hristo Gueorguiev
e8b5a5bf5b Cycles: Speedup transparent shadows in split kernel
This commit enables record-all transparent shadows rays.

Perfromance results:

               R9 290 render time (without synchronization), seconds
                        Before    After   Change
BMW                      261.5    262.5   +0.4 %
Classroom                869.6    867.3   -0.3 %
Fishy Cat                657.4    639.8   -2.7 %
Koro                    1909.8    692.8  -63.7 %
Pabellon Barcelona      1633.3   1238.0  -24.2 %
Pabellon Barcelona(*)   1158.1    903.8  -22.0 %

(*) without glossy connected to volume
2017-03-09 17:09:37 +01:00
Hristo Gueorguiev
57e26627c4 Cycles: SSS and Volume rendering in split kernel
Decoupled ray marching is not supported yet.

Transparent shadows are always enabled for volume rendering.

Changes in kernel/bvh and kernel/geom are from Sergey.
This simiplifies code significantly, and prepares it for
record-all transparent shadow function in split kernel.
2017-03-09 17:09:37 +01:00
c837bd5ea5 Cycles: Fix CUDA build error for some compilers
Needed to include `util_types.h` before using `uint`.
2017-03-08 16:44:43 -05:00
712f7c3640 Cycles: Make it possible to access KernelGlobals from split data initialization function 2017-03-08 11:02:54 +01:00
ef7c36f5ed Cycles: Cleanup, remove residue of previous split kernel data
This is all in split data state array.
2017-03-08 10:26:29 +01:00
64751552f7 Cycles: Fix indentation 2017-03-08 01:31:32 -05:00
fe7cc94dfa Cycles: Fix strict warning about unused variable 2017-03-08 01:31:32 -05:00
306034790f Cycles: Calculate size of split state buffer kernel side
By calculating the size of the state buffer in the kernel rather than the host
less code is needed and the size actually reflects the requested features.

Will also be a little faster in some cases because of larger global work size.
2017-03-08 01:31:30 -05:00
223f45818e Cycles: Initialize rng_state for split kernel
Because the split kernel can render multiple samples in parallel it is
necessary to have everything initialized before rendering of any samples
begins. The code that normally handles initialization of
`rng_state` (`kernel_path_trace_setup()`) only does so for the first sample,
which was causing artifacts in the split kernel due to uninitialized
`rng_state` for some samples.

Note that because the split kernel can render samples in parallel this
means that the split kernel is incompatible with the LCG.
2017-03-08 01:31:09 -05:00
cd7d5669d1 Cycles: Remove sum_all_radiance kernel
This was only needed for the previous implementation of parallel samples. As
we don't have that any more it can be removed.

Real reason for removal tho is this: `per_sample_output_buffers` was being
calculated too small and artifacts resulted. The tile buffer is already
the correct size and calculating the size for `per_sample_output_buffers`
is a bit difficult with the current layout of the code. As
`per_sample_output_buffers` was only needed for `sum_all_radiance`,
removing that kernel and writing output to the tile buffer directly
fixes the artifacts.
2017-03-08 01:31:07 -05:00
4cf501b835 Cycles: Split path initialization into own kernel
This makes it easier to initialize things correctly in the data_init kernel
before they are needed by path tracing.
2017-03-08 01:30:43 -05:00
817873cc83 Cycles: CUDA implementation of split kernel 2017-03-08 01:24:53 -05:00
0892352bfe Cycles: CPU implementation of split kernel 2017-03-08 00:52:41 -05:00
352ee7c3ef Cycles: Remove ccl_fetch and SOA 2017-03-08 00:52:41 -05:00
230c00d872 Cycles: OpenCL split kernel refactor
This does a few things at once:

- Refactors host side split kernel logic into a new device
  agnostic class `DeviceSplitKernel`.
- Removes tile splitting, a new work pool implementation takes its place and
  allows as many threads as will fit in memory regardless of tile size, which
  can give performance gains.
- Refactors split state buffers into one buffer, as well as reduces the
  number of arguments passed to kernels. Means there's less code to deal
  with overall.
- Moves kernel logic out of OpenCL kernel files so they can later be used by
  other device types.
- Replaced OpenCL specific APIs with new generic versions
- Tiles can now be seen updating during rendering
2017-03-08 00:52:41 -05:00
520b53364c Cycles: Add OpenCL kernel for zeroing memory buffers
Transferring memory to the device was very slow and there's really no
need when only zeroing a buffer.
2017-03-08 00:52:41 -05:00
87f236cd10 Cycles: Fix division by zero in volume code which was producing -nan 2017-02-28 17:33:06 +01:00
8c5826f59a Fix T50698: Cycles baking artifacts with transparent surfaces. 2017-02-25 03:12:53 +01:00
4e9b17da4c Cycles: Speedup by avoiding extra calculations in noise texture when unneeded
Noise texture is now faster when the color socket is unused. Potential for
speedup spotted by @nutel.

Some performance results:

                     Render Time Before    After    Difference
Gooseberry benchmark         47:51.34    45:55.57       -4%
Koro                         12:24.92    12:18.46     -0.8%
Simple cube (Color socket)      48.53       48.72     +0.3%
Simple cube (Fac socket)        48.74       32.78    -32.7%
Goethe displacement           1:21.18     1:08.47    -15.6%
Cycles brick displacement     3:02.38     2:16.76    -25.0%
Large displacement scene     23:54.12    20:09.62    -15.6%

Reviewed By: sergey

Differential Revision: https://developer.blender.org/D2513
2017-02-21 07:24:33 -05:00
fe47163a1e Cycles: Fix CUDA compilation error after recent changes 2017-02-15 15:01:08 +01:00
8b8c0d0049 Cycles: Don't calculate primitive time if BVH motion steps are not used
Solves memory regression by the default configuration.
2017-02-15 12:59:31 +01:00
6cdc954e8c Cycles: Pass special flag whether BVH motion steps are used
Doesn't currently change anything, but would need for some future
work here.

It uses existing padding in kernel BVH structure, so there is
nothing changed memory-wise.
2017-02-15 12:45:06 +01:00
dc7bbd731a Cycles: Fix wrong hair render results when using BVH motion steps
The issue here was mainly coming from minimal pixel width feature
which is quite commonly enabled in production shots.

This feature will use some probabilistic heuristic in the curve
intersection function to check whether we need to return intersection
or not. This probability is calculated for every intersection check.
Now, when we use multiple BVH nodes for curve primitives we increase
probability of that primitive to be considered a good intersection
for us. This is similar to increasing minimal width of curve.

What is worst here is that change in the intersection probability
fully depends on exact layout of BVH, meaning probability might
change differently depending on a view angle, the way how builder
binned the primitives and such. This makes it impossible to do
simple check like dividing probability by number of BVH steps.

Other solution might have been to split BVH into fully independent
trees, but that will increase memory usage of all the static
objects in the scenes, which is also not something desirable.

For now used most simple but robust approach: store BVH primitives
time and test it in curve intersection functions. This solves the
regression, but has two downsides:

- Uses more memory.

  which isn't surprising, and ANY solution to this problem will
  use more memory.

  What we still have to do is to avoid this memory increase for
  cases when we don't use BVH motion steps.

- Reduces number of maximum available textures on pre-kepler cards.

  There is not much we can do here, hardware gets old but we need
  to move forward on more modern hardware..
2017-02-15 12:45:04 +01:00
930186d3df Cycles: Optimize sorting of transparent intersections on CUDA 2017-02-13 18:24:45 +01:00
21dbfb7828 Cycles: Fix wrong transparent shadows with CUDA
Was a bug in recent optimization commit.
2017-02-13 18:22:10 +01:00
b16fd22018 Cycles: Fix regression with transparent shadows in volume 2017-02-08 14:00:48 +01:00
da31a82832 Cycles: Solve speed regression by casting opaque ray first 2017-02-08 14:00:48 +01:00
04cf1538b5 Cycles: Fix compilation error on OpenCL 2017-02-08 14:00:48 +01:00
31a025f51e Cycles: Split shadow functions to avoid some duplicated calculations 2017-02-08 14:00:48 +01:00
dde40989f3 Cycles: Store shadow intersections in the kernel globals
Seems CUDA failed to de-duplicate the array across multiple inlined
versions of the shadow_blocked(). Helped it a bit with that now.

Gives about 100MB memory improvement on a scenes after previous
commit and brings up memory "regression" to only 100MB comparing to
the master branch now.
2017-02-08 14:00:48 +01:00
7447950bc3 Cycles: Speedup transparent shadows on CUDA
This commit enables record-all behavior of transparent shadows
rays.

Render times difference goes as following:

               GTX 1080 render time
BMW                  -0.5%
Fishy Cat            -0.0%
Pabellon Barcelona   -11.6%
Classroom            +1.2%
Koro                 -58.6%

Kernel will now use some extra VRAM memory to store the intersection
array (200MB on my configuration). This we can optimize out with some
further commits.
2017-02-08 14:00:48 +01:00
9830eeb44b Cycles: Implement record-all transparent shadow function for GPU
The idea is to record all possible transparent intersections when
shooting transparent ray on GPU (similar to what we were doing  on
CPU already).

This avoids need of doing whole ray-to-scene intersections queries
for each intersection and speeds up a lot cases like transparent
hair in the cost of extra memory.

This commit is a base ground for now and this feature is kept
disabled for until some further tweaks.
2017-02-08 14:00:48 +01:00
9c3d202e56 Cycles: Use an utility function to sort intersections array 2017-02-08 14:00:48 +01:00
58a10122d0 Cycles: Make GPU version of shadow_blocked() closer to CPU
Now we break the traversal cycle and then perform volume attenuation
and check with zero throughput. Not sure it makes any measurable sense
at this moment, but in the future it might help de-duplicating some
extra logic here.
2017-02-08 14:00:48 +01:00