OpenImageDenoise GPU acceleration #115045
Labels
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
8 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: blender/blender#115045
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Tasks for 4.1
d16d2bbd3a
)fd8bb41224
)31d55e87f9
)bc886857f3
)cuDevicePrimaryCtxRetain
for improved CUDA and HIP performanceFuture Tasks
Denoising device selection in preferences(not needed, just using same devices seems ok)Add option to force CPU denoising in preferences(not needed, already on scene) (#117734)When to use the GPU
We want to use the GPU as much as possible for performance, but there are two situations where we do not.
When using a desktop or render farm designed for CPU usage only. This should be controlled by the preferences. If no GPU is selected for rendering, then denoising could just not use any GPU either. It's unclear if it's worth adding an option to use the GPU for denoising only, keeping the preferences simple seems preferable.
When there is not sufficient GPU memory available, we want to use CPU denoising instead. This varies from scene to scene. Ideally we can catch OIDN out of memory errors, and automatically fall back to CPU rendering. Still we may also want an option to manually force it to use the CPU. This option could be placed in the "Memory > Performance" panel along with the tiling settings.
Before I start working on CUDA support, I wanted to check with regards to licensing. OIDN and one of its dependencies, cutlass (https://github.com/NVIDIA/cutlass), both are written against the CUDA runtime API, not the driver API (cuew). That requires linking against one of the binaries included in the CUDA SDK. Would the Blender Foundation consider the libraries in the CUDA SDK as incompatible with the GPL or are they viewed as "system libraries"?
I will check, it's not obvious to me. Ideally we could somehow use the driver API.
I agree that ideally we should use the driver API but it's not really up to us. CUTLASS uses the runtime, so we would have to fork it and modify it somehow, which could be a lot of extra work.
What about the HIP runtime? That is shipped with the AMD drivers. Is it OK to link against that DLL/SO statically?
Does the fact that OIDN loads its CUDA and HIP backends with dlopen() make any difference from a licensing perspective?
The potential issue is with distributing GPL and proprietary cudart binaries in a single package. Not so much static vs. dynamic library linking.
So this means that HIP support should be fine since it doesn't require shipping any runtime/non-free dependency?
If the required HIP runtime is shipped with the AMD drivers it's fine. But to be clear, I would not call that linking "statically". But rather dynamic linking without manual
dlopen()
.A possibly more precise terminology would be "implicit linking" but it seems there's not a lot of consensus on this.
dlopen()
is platform-specific.The bottom line is that it seems we could move forward with HIP support. Please let us know whether distributing CUDART would be acceptable.
What's the deadline for enabling HIP, CUDA (if possible) and Metal support (will ship soon in OIDN 2.2) in Blender 4.1? Do these need to be added in Bcon1 or Bcon2 is fine too?
Bcon1 is preferable, but Bcon2 is possible if needed.
Also, great to hear Metal is coming!
For the cudart license, it's not looking good. We most likely won't be able to distribute it.
What could the alternative be?
Nothing seeing any straightforward solutions so far. CC @pmoursnv.
A minimal open source CUDART shim seems like the most promising alternative to me. There are only a few functions that need to be implemented for OIDN, and this doesn't seem too complicated. I would much prefer this over modifying CUTLASS and maintaining a fork of it or switching to some less performant library. I'll look into this approach in more detail to see whether it's indeed a viable solution.
I managed to implement a minimal CUDART on top of the driver API, and so far it works great. We could replace the proprietary CUDART with this shim starting with the next OIDN release. So this means CUDA support could be also enabled in Blender 4.1, right?
There's a minor issue in Cycles though, regarding how a CUDA context is created:
fd3629b80a/intern/cycles/device/cuda/device_impl.cpp (L107)
It's not typically recommended to create custom CUDA contexts this way because other CUDA components like OIDN will use a separate context for the same device, causing performance issues. The CUDA documentation recommends to use the primary context instead, using cuDevicePrimaryCtxRetain. This way both Cycles and OIDN would use the primary context. Could you switch to this?
@aafra many thanks for solving CUDART problem.
I will look into using
cuDevicePrimaryCtxRetain
. There can be multiple renders instances running at the same time, like a 3D viewport render and a small material preview render. So I will need to check what the effect on that is, if we can do everything safely in one primary context.@aafra @Stefan_Werner it was pointed out that the GTX 1650 and 1660 cards have compute capability 7.5, but no tensor cores. As far as I can tell OIDN only checks compute capability. So I'm wondering if that is correct, or if perhaps these GPUs should be blacklisted?
https://devtalk.blender.org/t/2024-01-23-render-cycles-meeting/33059/4
I'm aware of these GPUs, and I tried to look into this a while ago but I couldn't find any mention of checking for tensor core support in the CUDA documentation. I also couldn't find any mention in the docs that some CC 7.5 or later GPUs may not support tensor instructions. We're using tensor cores through CUTLASS, and CUTLASS reports that it's supported on Volta and later, so it's not really up to OIDN to properly handle these special cases. We claim support for whatever CUTLASS claims to support. I didn't find any special checks in CUTLASS either. So there seems to be some disconnect between marketing specs and CUDA docs. My best guess is that maybe these cards do support tensor instructions (i.e. do not crash, etc.) but are emulated. This seems to be confirmed by the following comment from an NVIDIA employee: https://stackoverflow.com/questions/66621753/how-am-i-able-to-run-tensor-core-instructions-without-actually-having-tensor-cor
I have a GTX 1660 to test on, I'll test it when we actually do the lib update.
Thanks both, sounds good.
If it fails, we can always blacklist it on the Cycles side too.
@LazyDodo There's an easy way to test whether OIDN works on GTX 1660. Could you please download the latest OIDN binaries (https://github.com/OpenImageDenoise/oidn/releases/tag/v2.1.0) and run
oidnBenchmark -d cuda -v 1
?Thanks! It looks good except that failed test but it's an out of memory error (probably because the GPU has less memory than what we typically use for testing), so I think it's fine.
OIDN 2.2 release candidate is out:
9db3b38d50
This includes support for Meteor Lake, Metal, and ARM64 on Windows/Linux too, and it switched to the CUDA driver API. Please let me know if you encounter any issues.
These two tasks have been marked as completed. But I can't find commits for either of them. And looking at the code, the OptiX denoiser is still being used by default. I haven't checked on the automatic fallback. Am I missing something here?
Indeed, they should not have been marked resolved yet.
I will take care of the NVIDIA default denoiser.
Probably too late for automatic fallback to CPU, though at least the code is trying to reduce GPU memory usage when allocation fails.
@aafra hey, just wanted to let you know that I've tested OIDN (GPU) on my 1660GTX (CUDA) and it works great. Even a little bit faster than OptiX while providing much better quality (especially in scenes with a lot of emissive materials). Viewport performance is also great - same as OptiX or even better in some scenes.
PS. To clarify: In the viewport I set passes to 'Albedo and Normal', prefilter to 'Fast' or 'None' and start sample to 1.
I see no performance improvements in Viewport on 1650, maybe a slight regression too, and #118020.
Docs say 16XX is supported, but maybe its 166X?
A GPU being supported doesn't necessarily mean that it would be faster than any CPU. In your case, the GPU doesn't have tensor cores, so it's emulated, which means that performance could be actually worse than for some CPUs.
GPU accelerated OIDN has been disabled in the 4.1 release for AMD GPUs, despite it being available in the beta. Is it possible for the feature to be enabled via the
Experimental
section in Blender's settings?Not in 4.1. The way OIDN works currently, making it available at all risks crashes even when not using it.
@brecht Couldn't you perhaps add an environment variable (or use OIDN's already existing
OIDN_DEVICE_HIP
) which could enable OIDN on HIP? In most cases there shouldn't be a crash (it has been fixed for all known cases so far), and this way interested users could give it a try.Has all code bringing support for GPU OIDN for AMD GPUs been removed from Blender until 4.2? If someone wishes to use that feature, should they use the latest version of 4.1 beta instead?
I've made it possible now to set the
OIDN_DEVICE_HIP
environment variable for 4.1.To use this, you then either have to use 4.1 with this environment variable, or 4.2 alpha without it.