Commit Graph

43 Commits

Author SHA1 Message Date
Michael Jones
654e1e901b Cycles: Use local atomics for faster shader sorting (enabled on Metal)
This patch adds two new kernels: SORT_BUCKET_PASS and SORT_WRITE_PASS. These replace PREFIX_SUM and SORTED_PATHS_ARRAY on supported devices (currently implemented on Metal, but will be trivial to enable on the other backends). The new kernels exploit sort partitioning (see D15331) by sorting each partition separately using local atomics. This can give an overall render speedup of 2-3% depending on architecture. As before, we fall back to the original non-partitioned sorting when the shader count is "too high".

Reviewed By: brecht

Differential Revision: https://developer.blender.org/D16909
2023-02-06 11:18:26 +00:00
afc091c3c4 Cleanup: spelling in comments 2022-11-01 12:24:58 +11:00
8dd7b5b26b Cycles: Metal integrator state size tuning
This patch tunes the integrator state sizing for Metal (`num_concurrent_states` and `num_concurrent_busy_states`).

On all GPUs architecture, we adjust the busy:total states ratio to be 1:4 which gives better rendering performance than the previous 1:16 ratio (independent of total state count). This gives a small performance uplift (e.g. 2-3% on M1 Ultra).

Additionally for M2 architectures, we double the overall state size if there is available headroom. Inclusive of the first change, we can expect uplift of close to 10% in future, as this results in larger dispatch sizes and minimises work submission overheads. In order to make an accurate determination of available headroom, we defer the calculation of `num_concurrent_states` and `num_concurrent_busy_states` until the time of integrator state allocation (i.e. after all of the scene data has been allocated). We also refactor `alloc_integrator_soa` to calculate an *exact* single-state-size in a first pass, right before allocating the integrator SoA buffers in a second pass.

Reviewed By: brecht

Differential Revision: https://developer.blender.org/D16313
2022-10-24 17:14:33 +01:00
6a4f4810f3 Fix T100246: Cycles GPU render error when adding AO node during viewport render 2022-08-18 20:04:22 +02:00
523bbf7065 Cycles: generalize shader sorting / locality heuristic to all GPU devices
This was added for Metal, but also gives good results with CUDA and OptiX.
Also enable it for future Apple GPUs instead of only M1 and M2, since this has
been shown to help across multiple GPUs so the better bet seems to enable
rather than disable it.

Also moves some of the logic outside of the Metal device code, and always
enables the code in the kernel since other devices don't do dynamic compile.

Time per sample with OptiX + RTX A6000:
                                         new                  old
barbershop_interior                      0.0730s              0.0727s
bmw27                                    0.0047s              0.0053s
classroom                                0.0428s              0.0464s
fishy_cat                                0.0102s              0.0108s
junkshop                                 0.0366s              0.0395s
koro                                     0.0567s              0.0578s
monster                                  0.0206s              0.0223s
pabellon                                 0.0158s              0.0174s
sponza                                   0.0088s              0.0100s
spring                                   0.1267s              0.1280s
victor                                   0.0524s              0.0531s
wdas_cloud                               0.0817s              0.0816s

Ref D15331, T87836
2022-07-15 13:42:47 +02:00
4b1d315017 Cycles: Improve cache usage on Apple GPUs by chunking active indices
This patch partitions the active indices into chunks prior to sorting by material in order to tradeoff some material coherence for better locality. On Apple Silicon GPUs (particularly higher end M1-family GPUs), we observe overall render time speedups of up to 15%. The partitioning is implemented by repeating the range of `shader_sort_key` for each partition, and encoding a "locator" key which distributes the indices into sorted chunks.

Reviewed By: brecht

Differential Revision: https://developer.blender.org/D15331
2022-07-14 14:26:18 +01:00
ff1883307f Cleanup: renaming and consistency for kernel data
* Rename "texture" to "data array". This has not used textures for a long time,
  there are just global memory arrays now. (On old CUDA GPUs there was a cache
  for textures but not global memory, so we used to put all data in textures.)
* For CUDA and HIP, put globals in KernelParams struct like other devices.
* Drop __ prefix for data array names, no possibility for naming conflict now that
  these are in a struct.
2022-06-20 12:30:48 +02:00
2c1bffa286 Cleanup: add verbose logging category names instead of numbers
And use them more consistently than before.
2022-06-17 14:08:14 +02:00
f2cd7e08fe Fix Cycles MNEE not working for Metal
Move MNEE to own kernel, separate from shader ray-tracing. This does introduce
the limitation that a shader can't use both MNEE and AO/bevel, but that seems
like the better trade-off for now.

We can experiment with bigger kernel organization changes later.

Differential Revision: https://developer.blender.org/D15070
2022-05-31 17:24:43 +02:00
9cfc7967dd Cycles: use SPDX license headers
* Replace license text in headers with SPDX identifiers.
* Remove specific license info from outdated readme.txt, instead leave details
  to the source files.
* Add list of SPDX license identifiers used, and corresponding license texts.
* Update copyright dates while we're at it.

Ref D14069, T95597
2022-02-11 17:47:34 +01:00
ae28d90578 Fix T93350: Cycles renders shows black during rendering huge resolutions
The root of the issue is caused by Cycles ignoring OpenGL limitation on
the maximum resolution of textures: Cycles was allocating texture of the
final render resolution. It was exceeding limitation on certain GPUs and
driver.

The idea is simple: use multiple textures for the display, each of which
will fit into OpenGL limitations.

There is some code which allows the display driver to know when to start
the new tile. Also added some code to allow force graphics interop to be
re-created. The latter one ended up not used in the final version of the
patch, but it might be helpful for other drivers implementation.

The tile size is limited to 8K now as it is the safest size for textures
on many GPUs and OpenGL drivers.

This is an updated fix with a workaround for freezing with the NVIDIA
driver on Linux.

Differential Revision: https://developer.blender.org/D13385
2022-01-07 17:20:04 +01:00
204ae33d75 Revert "Fix T93350: Cycles renders shows black during rendering huge resolutions"
This reverts commit 5e37f70307.

It is leading to freezing of the entire desktop for a few seconds when stopping
3D viewport rendering on my Linux / NVIDIA system.
2021-12-07 20:49:34 +01:00
Sergey Sharybin
5e37f70307 Fix T93350: Cycles renders shows black during rendering huge resolutions
The root of the issue is caused by Cycles ignoring OpenGL limitation on
the maximum resolution of textures: Cycles was allocating texture of the
final render resolution. It was exceeding limitation on certain GPUs and
driver.

The idea is simple: use multiple textures for the display, each of which
will fit into OpenGL limitations.

There is some code which allows the display driver to know when to start
the new tile. Also added some code to allow force graphics interop to be
re-created. The latter one ended up not used in the final version of the
patch, but it might be helpful for other drivers implementation.

The tile size is limited to 8K now as it is the safest size for textures
on many GPUs and OpenGL drivers.

Differential Revision: https://developer.blender.org/D13385
2021-12-07 19:01:42 +01:00
2e6f914e37 Fix debug build error after recent Cycles kernel argument changes 2021-11-29 18:41:37 +01:00
98a5c924fc Cycles: Metal readiness: Specify DeviceQueue::enqueue arg types
This patch adds new arg-type parameters to `DeviceQueue::enqueue` and its overrides. This is in preparation for the Metal backend which needs this information for correct argument encoding.

Ref T92212

Reviewed By: brecht

Maniphest Tasks: T92212

Differential Revision: https://developer.blender.org/D13357
2021-11-29 14:56:06 +00:00
William Leeson
c49d2cbe92 Merge branch 'blender-v3.0-release' to bring in D13042:
Fix performance decrease with Scrambling Distance on
2021-11-25 09:41:03 +01:00
Alaska
b41c72b710 Fix performance decrease with Scrambling Distance on
With the current code in master, scrambling distance is enabled on non-hardware accelerated ray tracing devices see a measurable performance decrease when compared scrambling distance on vs off. From testing, this performance decrease comes from the large tile sizes scheduled in `tile.cpp`.

This patch attempts to address the performance decrease by using different algorithms to calculate the tile size for devices with hardware accelerated ray traversal and devices without. Large tile sizes for hardware accelerated devices and small tile sizes for others.

Most of this code is based on proposals from @brecht and @leesonw

Reviewed By: brecht, leesonw

Differential Revision: https://developer.blender.org/D13042
2021-11-25 09:32:26 +01:00
ce395c84a3 Merge branch 'blender-v3.0-release' 2021-11-11 15:29:35 +01:00
d26d3cfe19 Fix T92868: Cycles catcher with transparency crashes
The issue was caused by splitting happening twice.

Fixed by checking for split flag which is assigned to the both states
during split.

The tricky part was to write catcher data at the moment of split: the
transparency and shadow catcher sample count is to be accumulated at
that point. Now it is happening in the `intersect_closest` kernel.
The downside is that render buffer is to be passed to the kernel, but
the benefit is that extra split bounce check is not needed now.

Had to move the passes write to shadow catcher header, since include
of `film/passes.h` causes all the fun of requirement to have BSDF
data structures available.

Differential Revision: https://developer.blender.org/D13177
2021-11-11 15:21:35 +01:00
Andrii
c63e735f6b Cycles: Add sample offset option
This patch exposes the sampling offset option to Blender. It is located in the "Sampling > Advanced" panel.
For example, this can be useful to parallelize rendering and distribute different chunks of samples for each computer to render.

---

I also had to add this option to `RenderWork` and `RenderScheduler` classes so that the sample count in the status string can be calculated correctly.

Reviewed By: leesonw

Differential Revision: https://developer.blender.org/D13086
2021-11-11 09:39:25 +01:00
806521f703 Fix T92671: confusing Cycles debug logs about CPU architecture
Instead of printing debug flags listing various CPU and GPU settings that
may or may not be used, print when we are using them. This include CPU
kernel types, OptiX debugging and CUDA and HIP adaptive compilation. BVH
type was already printed.
2021-11-01 08:36:50 +01:00
William Leeson
366262bef5 Distance Scrambling for for Cycles X - Sobol version
Cycles:Distance Scrambling for Cycles Sobol Sampler

This option implements micro jittering an is based on the INRIA
research paper [[ https://hal.inria.fr/hal-01325702/document | on micro jittering ]]
and work by Lukas Stockner for implementing the scrambling distance.
It works by controlling the correlation between pixels by either using
a user supplied value or an adaptive algorithm to limit the maximum
deviation of the sample values between pixels.

This is a follow up of https://developer.blender.org/D12316

The PMJ version can be found here: https://developer.blender.org/D12511

Reviewed By: leesonw

Differential Revision: https://developer.blender.org/D12318
2021-10-26 16:11:27 +02:00
fd25e883e2 Cycles: remove prefix from source code file names
Remove prefix of filenames that is the same as the folder name. This used
to help when #includes were using individual files, but now they are always
relative to the cycles root directory and so the prefixes are redundant.

For patches and branches, git merge and rebase should be able to detect the
renames and move over code to the right file.
2021-10-26 15:37:04 +02:00
d7d40745fa Cycles: changes to source code folders structure
* Split render/ into scene/ and session/. The scene/ folder now contains the
  scene and its nodes. The session/ folder contains the render session and
  associated data structures like drivers and render buffers.
* Move top level kernel headers into new folders kernel/camera/, kernel/film/,
  kernel/light/, kernel/sample/, kernel/util/
* Move integrator related kernel headers into kernel/integrator/
* Move OSL shaders from kernel/shaders/ to kernel/osl/shaders/

For patches and branches, git merge and rebase should be able to detect the
renames and move over code to the right file.
2021-10-26 15:36:39 +02:00
b698fe1e04 Cleanup: compiler warnings 2021-10-26 15:35:31 +02:00
75704091fc Cycles: add additive AO support through Fast GI settings
Add a Fast GI Method, either Replace for the existing behavior, or Add
to add ambient occlusion like the old world settings.

This replaces the old Ambient Occlusion settings in the world properties.
2021-10-26 14:56:43 +02:00
Brecht Van Lommel
df00463764 Cycles: add shadow path compaction for GPU rendering
Similar to main path compaction that happens before adding work tiles, this
compacts shadow paths before launching kernels that may add shadow paths.

Only do it when more than 50% of space is wasted.

It's not a clear win in all scenes, some are up to 1.5% slower. Likely caused
by different order of scheduling kernels having an unpredictable performance
impact. Still feels like compaction is just the right thing to avoid cases
where a few shadow paths can hold up a lot of main paths.

Differential Revision: https://developer.blender.org/D12944
2021-10-21 15:38:03 +02:00
0c52eed863 Cycles: more accurately count main paths for adding work tiles
Easy now thanks to the main and shadow path decoupling. Doesn't help
in an benchmark scene except Spring, where it reduces render time by
maybe 2-3%.

Ref T87836
2021-10-20 17:50:31 +02:00
52c5300214 Cleanup: some renaming to better distinguish main and shadow paths 2021-10-20 17:50:31 +02:00
cccfa597ba Cycles: make ambient occlusion pass take into account transparency again
Taking advantage of the new decoupled main and shadow paths. For CPU we
just store two nested structs in the integrator state, one for direct light
shadows and one for AO. For the GPU we restrict the number of shade surface
states to be executed based on available space in the shadow paths queue.

This also helps improve performance in benchmark scenes with an AO pass,
since it is no longer needed to use the shader raytracing kernel there,
which has worse performance.

Differential Revision: https://developer.blender.org/D12900
2021-10-20 17:50:31 +02:00
Brecht Van Lommel
943e73b07e Cycles: decouple shadow paths from main path on GPU
The motivation for this is twofold. It improves performance (5-10% on most
benchmark scenes), and will help  to bring back transparency support for the
ambient occlusion pass.

* Duplicate some members from the main path state in the shadow path state.
* Add shadow paths incrementally to the array similar to what we do for
  the shadow catchers.
* For the scheduling, allow running shade surface and shade volume kernels
  as long as there is enough space in the shadow paths array. If not, execute
  shadow kernels until it is empty.

* Add IntegratorShadowState and ConstIntegratorShadowState typedefs that
  can be different between CPU and GPU. For GPU both main and shadow paths
  juse have an integer for SoA access. Bt with CPU it's a different pointer
  type so we get type safety checks in code shared between CPU and GPU.
* For CPU, add a separate IntegratorShadowStateCPU struct embedded in
  IntegratorShadowState.
* Update various functions to take the shadow state, and make SVM take either
  type of state using templates.

Differential Revision: https://developer.blender.org/D12889
2021-10-19 15:09:29 +02:00
c5a13ffcb4 Cleanup: spelling in comments 2021-10-18 12:13:10 +11:00
aa46459543 Fix shadow catcher behind transparent object on GPU
The assumption about absent shadow path was wrong.

The rest of the changes are to ensure shadow paths are finished prior
to the split, so that they write to the proper passes.

The issue was caught by running regression tests on OptiX.

Differential Revision: https://developer.blender.org/D12857
2021-10-14 09:39:38 +02:00
cc04399937 Fix missing Cycles volume stack re-allocation
Need to check allocation size, as the features do not change with
volume stack depth detection.
2021-10-12 11:55:23 +02:00
719c319055 Fix Cycles long start on scene without volumes
The state template iteration had difficult time dealing with 0-sized
arrays, causing iteration for until integer overflows.
2021-10-07 15:54:56 +02:00
c6275da852 Fix T91922: Cycles artifacts with high volume nested level
Make volume stack allocated conditionally, potentially based on the
actual nested level of objects in the scene.

Currently the nested level is estimated by number of volume objects.
This is a non-expensive check which is probably enough in practice
to get almost perfect memory usage and performance.

The conditional allocation is a bit tricky.

For the CPU we declare and define maximum possible volume stack,
because there are only that many integrator states on the CPU.

On the GPU we declare outer SoA to have all volume stack elements,
but only allocate actually needed ones. The actually used volume
stack size is passed as a pre-processor, which seems to be easiest
and fastest for the GPU state copy.

There seems to be no speed regression in the demo files on RTX6000.

Note that scenes with high nested level of volume will now be slower
but correct.

Differential Revision: https://developer.blender.org/D12759
2021-10-06 15:46:32 +02:00
6e268a749f Fix adaptive sampling artifacts on tile boundaries
Implement an overscan support for tiles, so that adaptive sampling can
rely on the pixels neighbourhood.

Differential Revision: https://developer.blender.org/D12599
2021-10-05 16:19:14 +02:00
9a0850c8c2 Cycles: Fix wrong GPU state calculation
Currently was only used for logging, but better to fix the size so
that it matches reality.

The issue was caused by decoupling number of shadow intersections
and using much higher number for CPU. This caused the total state
on GPU to be logged as 10s of gigabytes instead of 100s of megabytes.

Differential Revision: https://developer.blender.org/D12755
2021-10-05 16:09:31 +02:00
a754e35198 Cycles: refactor API for GPU display
* Split GPUDisplay into two classes. PathTraceDisplay to implement the Cycles side,
  and DisplayDriver to implement the host application side. The DisplayDriver is now
  a fully abstract base class, embedded in the PathTraceDisplay.
* Move copy_pixels_to_texture implementation out of the host side into the Cycles side,
  since it can be implemented in terms of the texture buffer mapping.
* Move definition of DeviceGraphicsInteropDestination into display driver header, so
  that we do not need to expose private device headers in the public API.
* Add more detailed comments about how the DisplayDriver should be implemented.

The "driver" terminology might not be obvious, but is also used in other renderers.

Differential Revision: https://developer.blender.org/D12626
2021-09-30 20:48:08 +02:00
4d4113adc2 Cycles: record large number of transparent shadow intersections on CPU
So we can do fewer intersection calls, only on the GPU do we need to save
memory and do this in small steps.

Ref T87836
2021-09-29 16:37:32 +02:00
731325a022 Cycles: Make sure GPU transfer is finished prior display update
Noticed while looking into flickering issues in viewport.

Doesn't seem to solve the flicker issue for me, but is something
what is supposed to be happening anyway.

Differential Revision: https://developer.blender.org/D12673
2021-09-29 14:05:51 +02:00
4d66cbd140 Cleanup: spelling in comments 2021-09-22 14:54:01 +10:00
0803119725 Cycles: merge of cycles-x branch, a major update to the renderer
This includes much improved GPU rendering performance, viewport interactivity,
new shadow catcher, revamped sampling settings, subsurface scattering anisotropy,
new GPU volume sampling, improved PMJ sampling pattern, and more.

Some features have also been removed or changed, breaking backwards compatibility.
Including the removal of the OpenCL backend, for which alternatives are under
development.

Release notes and code docs:
https://wiki.blender.org/wiki/Reference/Release_Notes/3.0/Cycles
https://wiki.blender.org/wiki/Source/Render/Cycles

Credits:
* Sergey Sharybin
* Brecht Van Lommel
* Patrick Mours (OptiX backend)
* Christophe Hery (subsurface scattering anisotropy)
* William Leeson (PMJ sampling pattern)
* Alaska (various fixes and tweaks)
* Thomas Dinges (various fixes)

For the full commit history, see the cycles-x branch. This squashes together
all the changes since intermediate changes would often fail building or tests.

Ref T87839, T87837, T87836
Fixes T90734, T89353, T80267, T80267, T77185, T69800
2021-09-21 14:55:54 +02:00