Commit Graph

295 Commits

Author SHA1 Message Date
9c91c75ea6 Merge branch 'master' into blender2.8 2018-01-11 13:24:41 +11:00
d0892a6648 Fix issue with moving CUDA memory to host and multiple devices.
This is not expected to fix all issues. Also adds some more details
to error reporting to investigate failures.
2018-01-11 00:00:48 +01:00
be40389165 Merge branch 'master' into blender2.8 2018-01-03 23:44:47 +11:00
c621832d3d Cycles: CUDA support for rendering scenes that don't fit on GPU.
In that case it can now fall back to CPU memory, at the cost of reduced
performance. For scenes that fit in GPU memory, this commit should not
cause any noticeable slowdowns.

We don't use all physical system RAM, since that can cause OS instability.
We leave at least half of system RAM or 4GB to other software, whichever
is smaller.

For image textures in host memory, performance was maybe 20-30% slower
in our tests (although this is highly hardware and scene dependent). Once
other type of data doesn't fit on the GPU, performance can be e.g. 10x
slower, and at that point it's probably better to just render on the CPU.

Differential Revision: https://developer.blender.org/D2056
2018-01-02 23:50:18 +01:00
6699454fb6 Cycles: make CUDA code a bit more robust to host/device alloc failures.
Fixes a few corner cases found while stress testing host mapped memory.
2018-01-02 23:46:19 +01:00
9f0d067c2e Merge branch 'master' into blender2.8 2017-12-21 11:17:34 +01:00
5650fe77e4 Cycles: Cleanup, indentation 2017-12-20 17:42:50 +01:00
03a5eccc94 Merge branch 'master' into blender2.8 2017-11-30 18:30:41 +11:00
fa3d50af95 Cycles: Improve denoising speed on GPUs with small tile sizes
Previously, the NLM kernels would be launched once per offset with one thread per pixel.
However, with the smaller tile sizes that are now feasible, there wasn't enough work to fully occupy GPUs which results in a significant slowdown.

Therefore, the kernels are now launched in a single call that handles all offsets at once.
This has two downsides: Memory accesses to accumulating buffers are now atomic, and more importantly, the temporary memory now has to be allocated for every shift at once, increasing the required memory.
On the other hand, of course, the smaller tiles significantly reduce the size of the memory.

The main bottleneck right now is the construction of the transformation - there is nothing to be parallelized there, one thread per pixel is the maximum.
I tried to parallelize the SVD implementation by storing the matrix in shared memory and launching one block per pixel, but that wasn't really going anywhere.

To make the new code somewhat readable, the handling of rectangular regions was cleaned up a bit and commented, it should be easier to understand what's going on now.
Also, some variables have been renamed to make the difference between buffer width and stride more apparent, in addition to some general style cleanup.
2017-11-30 07:37:08 +01:00
d992240bfa Fix unneeded legacy OpenGL call in Cycles viewport drawing. 2017-11-24 00:12:48 +01:00
Julian Eisel
7f96323cd0 Merge branch 'master' into blender2.8 2017-11-19 13:16:14 +01:00
40f528a7da Cycles: Add per-tile render time debug pass
Reviewers: sergey, brecht

Differential Revision: https://developer.blender.org/D2920
2017-11-17 16:40:24 +01:00
Dalai Felinto
1cb6cea71c Merge remote-tracking branch 'origin/master' into blender2.8 2017-11-13 11:48:48 -02:00
e568c1a975 Fix T53289: CUDA missing textures not showing pink, after recent changes. 2017-11-12 20:45:47 +01:00
7a6ad2901c Merge branch 'master' into blender2.8 2017-11-10 10:13:19 +01:00
bd4bea3e98 Cycles: avoid reallocating tile denoising memory many times during render. 2017-11-09 20:28:00 +01:00
c99481b632 Merge branch 'master' into blender2.8 2017-11-09 10:59:15 +01:00
087331c495 Cycles: Replace __MAX_CLOSURE__ build option with runtime integrator variable
Goal is to reduce OpenCL kernel recompilations.

Currently viewport renders are still set to use 64 closures as this seems to
be faster and we don't want to cause a performance regression there. Needs
to be investigated.

Reviewed By: brecht

Differential Revision: https://developer.blender.org/D2775
2017-11-09 01:04:06 -05:00
7b1d707481 Merge branch 'master' into blender2.8 2017-11-08 00:20:59 +01:00
ff34e48911 Cycles: add an extra CUDA synchronize before rendering.
It should not be needed as far as I know, but just in case it fixes any
of the recent issues like T52572.
2017-11-07 22:35:12 +01:00
91af8f2ae2 Merge branch 'master' into blender2.8
Conflicts:
	intern/cycles/device/device.cpp
	source/blender/blenkernel/intern/library.c
	source/blender/blenkernel/intern/material.c
	source/blender/editors/object/object_add.c
	source/blender/editors/object/object_relations.c
	source/blender/editors/space_outliner/outliner_draw.c
	source/blender/editors/space_outliner/outliner_edit.c
	source/blender/editors/space_view3d/drawobject.c
	source/blender/editors/util/ed_util.c
	source/blender/windowmanager/intern/wm_files_link.c
2017-11-06 18:02:46 +01:00
5801ef71e4 Code refactor: device memory cleanups, preparing for mapped host memory. 2017-11-05 15:22:04 +01:00
5475314f49 Cycles: reserve CUDA local memory ahead of time.
This way we can log the amount of memory used, and it will be important
for host mapped memory support.
2017-11-05 15:22:04 +01:00
d4fe083b35 Merge branch 'master' into blender2.8 2017-11-04 21:45:52 +11:00
33b5e8daff Code refactor: replace CUDA array with linear memory for 1D and 2D textures.
This is a prequisite for getting host memory allocation to work. There appears
to be no support for 3D textures using host memory. The original version of
this code was written by Stefan Werner for D2056.
2017-11-04 02:23:00 +01:00
6ec599c682 Fix T53247: mixed CPU + GPU render wrong texture limits. 2017-11-03 20:32:29 +01:00
f5456df095 Merge branch 'master' into blender2.8 2017-10-24 02:05:41 +02:00
070a668d04 Code refactor: move more memory allocation logic into device API.
* Remove tex_* and pixels_* functions, replace by mem_*.
* Add MEM_TEXTURE and MEM_PIXELS as memory types recognized by devices.
* No longer create device_memory and call mem_* directly, always go
  through device_only_memory, device_vector and device_pixels.
2017-10-24 01:25:19 +02:00
aa8b4c5d81 Code refactor: use device_only_memory and device_vector in more places. 2017-10-24 01:25:13 +02:00
7ad9333fad Code refactor: store device/interp/extension/type in each device_memory. 2017-10-24 01:03:59 +02:00
Julian Eisel
147f9585db Merge branch 'master' into blender2.8 2017-10-23 00:04:20 +02:00
57a0cb797d Code refactor: avoid some unnecessary device memory copying. 2017-10-21 20:58:28 +02:00
0f8a57de68 Merge branch 'master' into blender2.8 2017-10-19 13:58:01 +02:00
910dd7fb1b Cycles: Add extra logging in CUDA device detection code 2017-10-19 11:26:10 +02:00
dc95c79971 Merge branch 'master' into blender2.8 2017-10-11 13:14:16 +05:00
6ec43a765b Merge branch 'master' into blender2.8 2017-10-10 01:36:36 +11:00
e360d003ea Cycles: schedule more work for non-display and compute preemption CUDA cards.
This change affects CUDA GPUs not connected to a display or connected to a
display but supporting compute preemption so that the display does not
freeze. I couldn't find an official list, but compute preemption seems to be
only supported with GTX 1070+ and Linux (not GTX 1060- or Windows).

This helps improve small tile rendering performance further if there are
sufficient samples x number of pixels in a single tile to keep the GPU busy.
2017-10-08 21:12:16 +02:00
cdb0b3b1dc Code refactor: use DeviceInfo to enable QBVH and decoupled volume shading. 2017-10-08 13:17:33 +02:00
23098cda99 Code refactor: make texture code more consistent between devices.
* Use common TextureInfo struct for all devices, except CUDA fermi.
* Move image sampling code to kernels/*/kernel_*_image.h files.
* Use arrays for data textures on Fermi too, so device_vector<Struct> works.
2017-10-07 14:53:14 +02:00
ea606a7847 Merge branch 'master' into blender28 2017-10-06 21:25:33 +11:00
fb99ea79f8 Code refactor: split displace/background into separate kernels, remove luma. 2017-10-05 17:57:58 +02:00
49199963bf Fix incorrect CUDA remaining time estimate after previous commit. 2017-10-04 23:25:51 +02:00
6da6f8d33f Cycles: CUDA faster rendering of small tiles, using multiple samples like OpenCL.
The work size is still very conservative, and this doesn't help for progressive
refine. For that we will need to render multiple tiles at the same time. But this
should already help for denoising renders that require too much memory with big
tiles, and just generally soften the performance dropoff with small tiles.

Differential Revision: https://developer.blender.org/D2856
2017-10-04 21:58:47 +02:00
12f4538205 Code refactor: use split variance calculation for mega kernels too.
There is no significant difference in denoised benchmark scenes and
denoising ctests, so might as well make it all consistent.
2017-10-04 21:11:14 +02:00
e3e16cecc4 Code refactor: remove rng_state buffer and compute hash on the fly.
A little faster on some benchmark scenes, a little slower on others, seems
about performance neutral on average and saves a little memory.
2017-10-04 21:11:14 +02:00
5b7d6ea54b Code refactor: add WorkTile struct for passing work to kernel.
This makes sharing some code between mega/split in following commits a bit
easier, and also paves the way for rendering multiple tiles later.
2017-10-04 21:11:14 +02:00
cc8c064f11 Merge branch 'master' into blender2.8 2017-09-28 03:05:46 +10:00
88520dd5b6 Code refactor: simplify CUDA context push/pop.
Makes it possible to call a function like mem_alloc() when the context is
already active. Also fixes some missing pops in case of errors.
2017-09-27 13:43:21 +02:00
3e555d3d78 Merge branch 'master' into blender2.8 2017-08-21 15:41:03 +10:00
43a6cf1504 Cycles: attempt to recover from crashing CUDA/OpenCL drivers on Windows.
I don't know if this will actually work, needs testing. Ref T52064.
2017-08-20 23:18:25 +02:00