Viewport performance slows down when using OptiX Denoiser on Windows machine with dual GPUs. #95836

Open
opened 2022-02-17 15:01:54 +01:00 by Hyesung · 23 comments

System Information
Operating system:
Windows 11 (Windows 10 has the same bug)
Graphics card:
2x Geforce RTX 3090
Graphics driver
Nvidia Driver 511.65 / 511.79 / 472.12 ...

Blender Version
Broken:
Blender 3.10 05697470abd0d / All versions of 3.1, 3.0 and 3.2 are the same.

Worked:
Blender 2.90

Short description of error
Viewport render slows down when using the OptiX viewport denoiser with two GPUs enabled using OptiX.
If you check only one GPU, it will work at a normal speed.

As far as I know, it is a bug in Windows 10 and Windows 11, probably it works normally in Linux.

Exact steps for others to reproduce the error

  1. Run Blender on Windows.
  2. Check the two GPUs on OptiX in Preferences.
  3. Turn on Viewport Denoiser (OptiX)
  4. Rotate the basic cube. It's slowly rendered.
  5. Uncheck one GPU. It works fast normally.

1.mp4

**System Information** Operating system: Windows 11 (Windows 10 has the same bug) Graphics card: 2x Geforce RTX 3090 Graphics driver Nvidia Driver 511.65 / 511.79 / 472.12 ... **Blender Version** Broken: Blender 3.10 05697470abd0d / All versions of 3.1, 3.0 and 3.2 are the same. Worked: Blender 2.90 **Short description of error** Viewport render slows down when using the OptiX viewport denoiser with two GPUs enabled using OptiX. If you check only one GPU, it will work at a normal speed. As far as I know, it is a bug in Windows 10 and Windows 11, probably it works normally in Linux. **Exact steps for others to reproduce the error** 1. Run Blender on Windows. 2. Check the two GPUs on OptiX in Preferences. 3. Turn on Viewport Denoiser (OptiX) 4. Rotate the basic cube. It's slowly rendered. 5. Uncheck one GPU. It works fast normally. [1.mp4](https://archive.blender.org/developer/F12871326/1.mp4)
Author

Added subscriber: @Hyesung

Added subscriber: @Hyesung

#93178 was marked as duplicate of this issue

#93178 was marked as duplicate of this issue
Hyesung changed title from Viewport performance slows down when using OptiX Denoiser on Window machine with dual GPUs. to Viewport performance slows down when using OptiX Denoiser on Windows machine with dual GPUs. 2022-02-17 15:23:14 +01:00

Added subscriber: @mano-wii

Added subscriber: @mano-wii

It's possible that this is not a bug.
My guess is that since both GPUs are being heavily used for rendering, there isn't much left to handle the Denoiser.
But perhaps the use of GPU resources could be better managed.

EDIT: However if that were the case, PCs with a single GPU would also have the same problem.

It's possible that this is not a bug. My guess is that since both GPUs are being heavily used for rendering, there isn't much left to handle the Denoiser. But perhaps the use of GPU resources could be better managed. **EDIT:** However if that were the case, PCs with a single GPU would also have the same problem.
Author

I heard from my friend that this is not reproduced on Linux.
There's something wrong with Blender's Windows version.

And it's normal on Blender 2.90, so I guess it's a bug in Cycles X.

I heard from my friend that this is not reproduced on Linux. There's something wrong with Blender's Windows version. And it's normal on Blender 2.90, so I guess it's a bug in Cycles X.

Added subscriber: @iss

Added subscriber: @iss

Changed status from 'Needs Triage' to: 'Confirmed'

Changed status from 'Needs Triage' to: 'Confirmed'

Can reproduce when checking GPU + CPU as well.

Can reproduce when checking GPU + CPU as well.

Added subscribers: @pmoursnv, @brecht

Added subscribers: @pmoursnv, @brecht

@pmoursnv, is this something you can look into?

@pmoursnv, is this something you can look into?

Note that slower performance with CPU + GPU is not necessarily unexpected here. But if it's slower than 2.90 and only happening on Windows, that is unexpected.

Note that slower performance with CPU + GPU is not necessarily unexpected here. But if it's slower than 2.90 and only happening on Windows, that is unexpected.

Added subscribers: @Hongyu, @Blendify, @Harti

Added subscribers: @Hongyu, @Blendify, @Harti
Member

Added subscriber: @Sergey

Added subscriber: @Sergey
Member

This is not Windows specific and is related to how tiling and multi GPU is currently implemented in Cycles X:

Before 3.0 there was quite a bit of code that ensured denoising of individual tiles in the viewport would run on the same GPUs as rendering and such no copies would need to be performed (see e.g. https:*developer.blender.org/diffusion/B/browse/blender-v2.93-release/intern/cycles/device/device.cpp$703 and https:*developer.blender.org/diffusion/B/browse/blender-v2.93-release/intern/cycles/device/device_multi.cpp$664). A lot of that went away as part of Cycles X due to changes with how Cycles X handles tiling and since the denoising pipeline was rewritten. As of right now this means Cycles X does not perform denoising on the same devices as rendering, but instead always performs denoising on the first OptiX device. This means if you are rendering with two or more GPUs, all the results from the second and further GPUs first have to be copied to the first GPU (see https://developer.blender.org/diffusion/B/browse/master/intern/cycles/integrator/path_trace.cpp$525), then denoised and then copied back to the individual GPUs. And those copies do a roundtrip through the CPU, which is really slow.

I don't see a quick fix for this without changing how denoising is handled in Cycles X fundamentally again unfortunately (to be able to denoise the individual tiles each GPU is working on directly on those devices). In its current form it could maybe be improved a bit by changing the copy to do a GPU to GPU copy, rather than GPU to CPU to GPU, but there being a copy at all would still have overhead. Also tagging @sergey.

This is not Windows specific and is related to how tiling and multi GPU is currently implemented in Cycles X: Before 3.0 there was quite a bit of code that ensured denoising of individual tiles in the viewport would run on the same GPUs as rendering and such no copies would need to be performed (see e.g. https:*developer.blender.org/diffusion/B/browse/blender-v2.93-release/intern/cycles/device/device.cpp$703 and https:*developer.blender.org/diffusion/B/browse/blender-v2.93-release/intern/cycles/device/device_multi.cpp$664). A lot of that went away as part of Cycles X due to changes with how Cycles X handles tiling and since the denoising pipeline was rewritten. As of right now this means Cycles X does not perform denoising on the same devices as rendering, but instead always performs denoising on the first OptiX device. This means if you are rendering with two or more GPUs, all the results from the second and further GPUs first have to be copied to the first GPU (see https://developer.blender.org/diffusion/B/browse/master/intern/cycles/integrator/path_trace.cpp$525), then denoised and then copied back to the individual GPUs. And those copies do a roundtrip through the CPU, which is really slow. I don't see a quick fix for this without changing how denoising is handled in Cycles X fundamentally again unfortunately (to be able to denoise the individual tiles each GPU is working on directly on those devices). In its current form it could maybe be improved a bit by changing the copy to do a GPU to GPU copy, rather than GPU to CPU to GPU, but there being a copy at all would still have overhead. Also tagging @sergey.

I wonder how much speedup we can get from a GPU to GPU copy, and optimizations for that? The reason being that each device denoising its own render buffer is not going to work if we implement D14014: Cycles multi-GPU distribution using slices [WIP].

Without denoising we also copy everything to a single OpenGL texture and I guess that has acceptable performance, but that half float RGBA buffer is smaller of course. Maybe could do something like running the filter_guiding_preprocess kernel on each device, writing out a half float buffer to be copied to a single GPU for denoising?

At least for interactive rendering, there should also be no need to copy denoised results back to each device.

I wonder how much speedup we can get from a GPU to GPU copy, and optimizations for that? The reason being that each device denoising its own render buffer is not going to work if we implement [D14014: Cycles multi-GPU distribution using slices [WIP]](https://archive.blender.org/developer/D14014). Without denoising we also copy everything to a single OpenGL texture and I guess that has acceptable performance, but that half float RGBA buffer is smaller of course. Maybe could do something like running the `filter_guiding_preprocess` kernel on each device, writing out a half float buffer to be copied to a single GPU for denoising? At least for interactive rendering, there should also be no need to copy denoised results back to each device.
Member

Good idea, though would need a bit of refactoring. Probably something like (pseudo code):

foreach (device in path_trace_works_) {
    device->render_samples()
}

if (denoising) {
    denoiser_device = get_denoiser_device()

    if (path_trace_works_.size() > 1 || path_trace_works_[0] != denoiser_device) {
        foreach (device in path_trace_works_) {
            guiding_buffer = device->filter_guiding_preprocess()
            denoiser_device->copy_from(guiding_buffer) // Direct GPU to GPU copy if possible
        }
    }
    else {
        denoiser_device->filter_guiding_preprocess() // Perform in-place
    }

    denoiser_device->denoise_buffer()
    denoiser_device->copy_to_display()
}
else {
    foreach (device in path_trace_works_) {
        device->copy_to_display()
    }
}

Currently there are as many GPU to CPU to GPU roundtrips as 3 times the number of devices (first to get data from devices to denoiser device, second to get denoised data from denoiser device back to devices and third to get data from devices to display, which is not using interop when there are multiple devices as far as I can see).
This would reduce that to as many GPU to GPU copies as the number of devices plus one GPU to CPU to GPU roundtrip to get data to display. Could ideally improve that further and try and detect if the denoiser device equals the device used by OpenGL and use interop to do direct GPU to GPU copy for the last part as well. Still some copies, but considerably less than before and less data to transfer.

Good idea, though would need a bit of refactoring. Probably something like (pseudo code): ``` foreach (device in path_trace_works_) { device->render_samples() } if (denoising) { denoiser_device = get_denoiser_device() if (path_trace_works_.size() > 1 || path_trace_works_[0] != denoiser_device) { foreach (device in path_trace_works_) { guiding_buffer = device->filter_guiding_preprocess() denoiser_device->copy_from(guiding_buffer) // Direct GPU to GPU copy if possible } } else { denoiser_device->filter_guiding_preprocess() // Perform in-place } denoiser_device->denoise_buffer() denoiser_device->copy_to_display() } else { foreach (device in path_trace_works_) { device->copy_to_display() } } ``` Currently there are as many GPU to CPU to GPU roundtrips as 3 times the number of devices (first to get data from devices to denoiser device, second to get denoised data from denoiser device back to devices and third to get data from devices to display, which is not using interop when there are multiple devices as far as I can see). This would reduce that to as many GPU to GPU copies as the number of devices plus one GPU to CPU to GPU roundtrip to get data to display. Could ideally improve that further and try and detect if the denoiser device equals the device used by OpenGL and use interop to do direct GPU to GPU copy for the last part as well. Still some copies, but considerably less than before and less data to transfer.

Yes, that pseudo-code looks right.

Yes, that pseudo-code looks right.
Contributor

Added subscriber: @Raimund58

Added subscriber: @Raimund58

Removed subscriber: @Harti

Removed subscriber: @Harti
Author

Is it possible to solve this problem by connecting to NvLink?

Is it possible to solve this problem by connecting to NvLink?

This issue was referenced by 79787bf8e1

This issue was referenced by 79787bf8e1e1d766e34dc6f8c5eda2efcceaa6cc
Philipp Oeser removed the
Interest
Render & Cycles
label 2023-02-09 14:03:19 +01:00

image

id like to add that when using 2 gpu and scale the camera view really small, it will have this white artifact

![image](/attachments/f6d7f945-0bf2-4d4d-af98-74bf59ec35d7) id like to add that when using 2 gpu and scale the camera view really small, it will have this white artifact

@Hyesung Please submit a bug report about it, filling in all the requested bits of information in the form template.

@Hyesung Please submit a bug report about it, filling in all the requested bits of information in the form template.
Sign in to join this conversation.
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
10 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#95836
No description provided.