Commit Graph

13 Commits

Author SHA1 Message Date
e3e16cecc4 Code refactor: remove rng_state buffer and compute hash on the fly.
A little faster on some benchmark scenes, a little slower on others, seems
about performance neutral on average and saves a little memory.
2017-10-04 21:11:14 +02:00
45dcd20ca9 Cycles: CUDA split performance tweaks, still far from megakernel.
On Pabellon, 25.8s mega, 35.4s split before, 32.7s split after.
2017-08-05 14:32:59 +02:00
ea846a4dfc Cycles: Add kernel to enqueue inactive rays
The queue will be used to make reuse of inactive threads to keep
the GPU more busy.
2017-06-10 03:51:18 -04:00
Hristo Gueorguiev
6bf4115c13 Cycles: Split kernel - sort shaders
Reduce thread divergence in kernel_shader_eval.

Rays are sorted in blocks of 2048 according to shader->id.

On R9 290 Classroom is ~30% faster, and Pabellon Barcelone is ~8% faster.

No sorting for CUDA split kernel.

Reviewers: sergey, maiself

Reviewed By: maiself

Differential Revision: https://developer.blender.org/D2598
2017-05-03 15:30:45 +02:00
915766f42d Cycles: Branched path tracing for the split kernel
This implements branched path tracing for the split kernel.

General approach is to store the ray state at a branch point, trace the
branched ray as normal, then restore the state as necessary before iterating
to the next part of the path. A state machine is used to advance the indirect
loop state, which avoids the need to add any new kernels. Each iteration the
state machine recreates as much state as possible from the stored ray to keep
overall storage down.

Its kind of hard to keep all the different integration loops in sync, so this
needs lots of testing to make sure everything is working correctly. We should
probably start trying to deduplicate the integration loops more now.

Nonbranched BMW is ~2% slower, while classroom is ~2% faster, other scenes
could use more testing still.

Reviewers: sergey, nirved

Reviewed By: nirved

Subscribers: Blendify, bliblubli

Differential Revision: https://developer.blender.org/D2611
2017-05-02 14:26:46 -04:00
0579eaae1f Cycles: Make all #include statements relative to cycles source directory
The idea is to make include statements more explicit and obvious where the
file is coming from, additionally reducing chance of wrong header being
picked up.

For example, it was not obvious whether bvh.h was refferring to builder
or traversal, whenter node.h is a generic graph node or a shader node
and cases like that.

Surely this might look obvious for the active developers, but after some
time of not touching the code it becomes less obvious where file is coming
from.

This was briefly mentioned in T50824 and seems @brecht is fine with such
explicitness, but need to agree with all active developers before committing
this.

Please note that this patch is lacking changes related on GPU/OpenCL
support. This will be solved if/when we all agree this is a good idea to move
forward.

Reviewers: brecht, lukasstockner97, maiself, nirved, dingto, juicyfruit, swerner

Reviewed By: lukasstockner97, maiself, nirved, dingto

Subscribers: brecht

Differential Revision: https://developer.blender.org/D2586
2017-03-29 13:41:11 +02:00
1cad64900e Cycles: Define ccl_local variables in kernel functions
Declaring ccl_local in a device function is not supported
by certain compilers.
2017-03-16 11:27:17 +01:00
96868a3941 Fix T50888: Numeric overflow in split kernel state buffer size calculation
Overflow led to the state buffer being too small and the split kernel to
get stuck doing nothing forever.
2017-03-11 05:39:28 -05:00
4a2cde3f0e Cycles: Enable SSS and volumes for CUDA and Nvidia OpenCL split kernel 2017-03-10 02:09:41 -05:00
306034790f Cycles: Calculate size of split state buffer kernel side
By calculating the size of the state buffer in the kernel rather than the host
less code is needed and the size actually reflects the requested features.

Will also be a little faster in some cases because of larger global work size.
2017-03-08 01:31:30 -05:00
cd7d5669d1 Cycles: Remove sum_all_radiance kernel
This was only needed for the previous implementation of parallel samples. As
we don't have that any more it can be removed.

Real reason for removal tho is this: `per_sample_output_buffers` was being
calculated too small and artifacts resulted. The tile buffer is already
the correct size and calculating the size for `per_sample_output_buffers`
is a bit difficult with the current layout of the code. As
`per_sample_output_buffers` was only needed for `sum_all_radiance`,
removing that kernel and writing output to the tile buffer directly
fixes the artifacts.
2017-03-08 01:31:07 -05:00
4cf501b835 Cycles: Split path initialization into own kernel
This makes it easier to initialize things correctly in the data_init kernel
before they are needed by path tracing.
2017-03-08 01:30:43 -05:00
817873cc83 Cycles: CUDA implementation of split kernel 2017-03-08 01:24:53 -05:00