We should actually be using CL_DEVICE_MEM_BASE_ADDR_ALIGN for sub buffers,
previous change in this code was incorrect. Renamed the function now to
make the specific purpose of this alignment clear, it's not required for
data types in general.
Previously, the NLM kernels would be launched once per offset with one thread per pixel.
However, with the smaller tile sizes that are now feasible, there wasn't enough work to fully occupy GPUs which results in a significant slowdown.
Therefore, the kernels are now launched in a single call that handles all offsets at once.
This has two downsides: Memory accesses to accumulating buffers are now atomic, and more importantly, the temporary memory now has to be allocated for every shift at once, increasing the required memory.
On the other hand, of course, the smaller tiles significantly reduce the size of the memory.
The main bottleneck right now is the construction of the transformation - there is nothing to be parallelized there, one thread per pixel is the maximum.
I tried to parallelize the SVD implementation by storing the matrix in shared memory and launching one block per pixel, but that wasn't really going anywhere.
To make the new code somewhat readable, the handling of rectangular regions was cleaned up a bit and commented, it should be easier to understand what's going on now.
Also, some variables have been renamed to make the difference between buffer width and stride more apparent, in addition to some general style cleanup.
Some drivers may report very large allocation sizes, which could cause
unnecessary memory usage. This is now limited to 2gb which should
still be enough to get the needed performance benefits without waste.
* Remove tex_* and pixels_* functions, replace by mem_*.
* Add MEM_TEXTURE and MEM_PIXELS as memory types recognized by devices.
* No longer create device_memory and call mem_* directly, always go
through device_only_memory, device_vector and device_pixels.
CPU rendering will be restricted to a BVH2, which is not ideal for raytracing
performance but can be shared with the GPU. Decoupled volume shading will be
disabled to match GPU volume sampling.
The number of CPU rendering threads is reduced to leave one core dedicated to
each GPU. Viewport rendering will also only use GPU rendering still. So along
with the BVH2 usage, perfect scaling should not be expected.
Go to User Preferences > System to enable the CPU to render alongside the GPU.
Differential Revision: https://developer.blender.org/D2873
* Use common TextureInfo struct for all devices, except CUDA fermi.
* Move image sampling code to kernels/*/kernel_*_image.h files.
* Use arrays for data textures on Fermi too, so device_vector<Struct> works.
This is a bit confusing, especially when one mixes OpenCL code where ulong equals
to uint64_t with CPU side code where ulong is expected to be something else from
the naming.
This commit makes it so we use explicit name, common on all platforms.
Problem was that some code checks to see if device_pointer is null or
not and the new allocator wasn't even setting the pointer to anything
as it tracks memory location separately. Setting the pointer to non
null keeps all users of device_pointer happy.
- Apparently MSVC does not support compound literals
in C++ (at least by the looks of it).
- Not sure how opencl_device_assert was managing to
set protected property of the Device class.
Common folks, nobody considered master a C++11 only branch. Such decision is to
be done officially and will involve changes in quite a few infrastructure related
areas.
Image textures were being packed into a single buffer for OpenCL, which
limited the amount of memory available for images to the size of one
buffer (usually 4gb on AMD hardware). By packing textures into multiple
buffers that limit is removed, while simultaneously reducing the number
of buffers that need to be passed to each kernel.
Benchmarks were within 2%.
Fixes T51554.
Differential Revision: https://developer.blender.org/D2745
This is something which was reported to work fine by Mai, Benjamin and
confirmed by myself. Disabling this workaround gains us some speedup:
Before Now
bmw27 04:28.42 04:07.79
classroom 09:26.48 08:54.53
fishy_cat 08:44.01 08:18.70
koro 09:17.98 08:57.18
pavillon_barcelone 12:26.64 11:52.81
Test environment is:
- Ubuntu 16.04, with all updates installed
- AMD RX 480 GPU
- amdgpu pro driver version 17.10-450821
Some of the functions might have been inlined, but others i don't see
how that was possible (don't think virtual functions can be inlined here).
In any case, better be explicitly optimal in the code.
Technically not passing all buffers used by a kernel is undefined
behavior. We haven't had any issues with this so far on AMD or
Nvidia, but it's known to be a problem with Intel and we received
a report from AMD that this is a problem on newer hardware, so we
need to make this change at some point.
Unfortunately there a cost to being correct, about 5% for the
benchmark scenes. For low sample counts it's even worse, I've
seen up to 50% slowdown. For the latter case I think adjusting
tile updating logic can help, but not sure what that would look
like yet (it would be just a few lines change however).
Due to various driver issues with AMD GCN 1 cards we can no longer support
these GPUs. This patch makes them unavailable to select for Cycles rendering.
GCN cards 2 and higher are still supported. Please use the most recent
drivers available to ensure proper functionality.
See here for a list to check which GPUs are supported:
https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units
The previous outlier heuristic only checked whether the pixel is more than
twice as bright compared to the 75% quantile of the 5x5 neighborhood.
While this detected fireflies robustly, it also incorrectly marked a lot of
legitimate small highlights as outliers and filtered them away.
This commit adds an additional condition for marking a pixel as a firefly:
In addition to being above the reference brightness, the lower end of the
3-sigma confidence interval has to be below it.
Since the lower end approximates how low the true value of the pixel might be,
this test separates pixels that are supposed to be very bright from pixels that
are very bright due to random fireflies.
Also, since there is now a reliable outlier filter as a preprocessing step,
the additional confidence interval test in the reconstruction kernel is no
longer needed.
- Some arguments were inapproriatry tagged as unused
using (void)foo semantic.
Only use such semantic in tricky casses, when something
needs to be ignored in release builds or something is
dependent on tricky ifndef policy.
For rest of the cases just use void foo(int /bar*/)
semantic, which ensures variable is not used. Solves
confusion and code running out of sync with later
development.
- Used proper unused semantic to some arguments.
- Added braces to make code easier to follow, tricky
indentation with ifdef, uh.
Extremely bright pixels in the rendered image cause the denoising algorithm
to produce extremely noticable artifacts. Therefore, a heuristic is needed
to exclude these pixels from the filtering process.
The new approach calculates the 75% percentile of the 5x5 neighborhood of
each pixel and flags the pixel if it is more than twice as bright.
During the reconstruction process, flagged pixels are skipped. Therefore,
they don't cause any problems for neighboring pixels, and the outlier pixels
themselves are replaced by a prediction of their actual value based on their
feature pass values and the neighboring pixels.
Therefore, the denoiser now also works as a smarter despeckling filter that
uses a more accurate prediction of the pixel instead of a simple average.
This can be used even if denoising isn't wanted by setting the denoising
radius to 1.
This commit contains the first part of the new Cycles denoising option,
which filters the resulting image using information gathered during rendering
to get rid of noise while preserving visual features as well as possible.
To use the option, enable it in the render layer options. The default settings
fit a wide range of scenes, but the user can tweak individual settings to
control the tradeoff between a noise-free image, image details, and calculation
time.
Note that the denoiser may still change in the future and that some features
are not implemented yet. The most important missing feature is animation
denoising, which uses information from multiple frames at once to produce a
flicker-free and smoother result. These features will be added in the future.
Finally, thanks to all the people who supported this project:
- Google (through the GSoC) and Theory Studios for sponsoring the development
- The authors of the papers I used for implementing the denoiser (more details
on them will be included in the technical docs)
- The other Cycles devs for feedback on the code, especially Sergey for
mentoring the GSoC project and Brecht for the code review!
- And of course the users who helped with testing, reported bugs and things
that could and/or should work better!
Using -cl-fast-relaxed-math assumes no NaN/Inf values in any expression.
This causes problems on overflow, division by zero, square root of negative number.
Comparisons with NaN or infinite value are affected as well.
This patch causes <2% slowdown on benchmark scenes.
Fix T50985: Rendering volume scatter with GPU OpenCL comes to an halt after a few seconds