Viewport performance slows down when using OptiX Denoiser on Windows machine with dual GPUs. #95836

New Issue

Hyesung · 2022-02-17T15:01:54+01:00

Hyesung commented

2022-02-17 15:01:54 +01:00

System Information
Operating system:
Windows 11 (Windows 10 has the same bug)
Graphics card:
2x Geforce RTX 3090
Graphics driver
Nvidia Driver 511.65 / 511.79 / 472.12 ...

Blender Version
Broken:
Blender 3.10 05697470abd0d / All versions of 3.1, 3.0 and 3.2 are the same.

Worked:
Blender 2.90

Short description of error
Viewport render slows down when using the OptiX viewport denoiser with two GPUs enabled using OptiX.
If you check only one GPU, it will work at a normal speed.

As far as I know, it is a bug in Windows 10 and Windows 11, probably it works normally in Linux.

Exact steps for others to reproduce the error

Run Blender on Windows.
Check the two GPUs on OptiX in Preferences.
Turn on Viewport Denoiser (OptiX)
Rotate the basic cube. It's slowly rendered.
Uncheck one GPU. It works fast normally.

1.mp4

**System Information** Operating system: Windows 11 (Windows 10 has the same bug) Graphics card: 2x Geforce RTX 3090 Graphics driver Nvidia Driver 511.65 / 511.79 / 472.12 ... **Blender Version** Broken: Blender 3.10 05697470abd0d / All versions of 3.1, 3.0 and 3.2 are the same. Worked: Blender 2.90 **Short description of error** Viewport render slows down when using the OptiX viewport denoiser with two GPUs enabled using OptiX. If you check only one GPU, it will work at a normal speed. As far as I know, it is a bug in Windows 10 and Windows 11, probably it works normally in Linux. **Exact steps for others to reproduce the error** 1. Run Blender on Windows. 2. Check the two GPUs on OptiX in Preferences. 3. Turn on Viewport Denoiser (OptiX) 4. Rotate the basic cube. It's slowly rendered. 5. Uncheck one GPU. It works fast normally. [1.mp4](https://archive.blender.org/developer/F12871326/1.mp4)

Hyesung commented

2022-02-17 15:01:54 +01:00

Added subscriber: @Hyesung

blender-admin commented

2022-02-17 15:01:54 +01:00

#93178 was marked as duplicate of this issue

Hyesung changed title from ~~Viewport performance slows down when using OptiX Denoiser on Window machine with dual GPUs.~~ to Viewport performance slows down when using OptiX Denoiser on Windows machine with dual GPUs.

2022-02-17 15:23:14 +01:00

Germano Cavalcante commented

2022-02-18 02:30:11 +01:00

Added subscriber: @mano-wii

Germano Cavalcante commented

2022-02-18 02:30:11 +01:00

It's possible that this is not a bug.
My guess is that since both GPUs are being heavily used for rendering, there isn't much left to handle the Denoiser.
But perhaps the use of GPU resources could be better managed.

EDIT: However if that were the case, PCs with a single GPU would also have the same problem.

It's possible that this is not a bug. My guess is that since both GPUs are being heavily used for rendering, there isn't much left to handle the Denoiser. But perhaps the use of GPU resources could be better managed. **EDIT:** However if that were the case, PCs with a single GPU would also have the same problem.

Hyesung commented

2022-02-18 06:02:53 +01:00

I heard from my friend that this is not reproduced on Linux.
There's something wrong with Blender's Windows version.

And it's normal on Blender 2.90, so I guess it's a bug in Cycles X.

I heard from my friend that this is not reproduced on Linux. There's something wrong with Blender's Windows version. And it's normal on Blender 2.90, so I guess it's a bug in Cycles X.

Richard Antalik commented

2022-02-18 19:13:27 +01:00

Added subscriber: @iss

Richard Antalik commented

2022-02-18 19:13:27 +01:00

Changed status from 'Needs Triage' to: 'Confirmed'

Richard Antalik commented

2022-02-18 19:13:27 +01:00

Can reproduce when checking GPU + CPU as well.

Brecht Van Lommel commented

2022-02-28 16:10:26 +01:00

Added subscribers: @pmoursnv, @brecht

Brecht Van Lommel commented

2022-02-28 16:10:26 +01:00

@pmoursnv, is this something you can look into?

Brecht Van Lommel commented

2022-02-28 16:11:54 +01:00

Note that slower performance with CPU + GPU is not necessarily unexpected here. But if it's slower than 2.90 and only happening on Windows, that is unexpected.

Richard Antalik commented

2022-03-01 00:55:22 +01:00

Added subscribers: @Hongyu, @Blendify, @Harti

Patrick Mours commented

2022-03-01 13:09:11 +01:00

Added subscriber: @Sergey

Patrick Mours commented

2022-03-01 13:09:11 +01:00

This is not Windows specific and is related to how tiling and multi GPU is currently implemented in Cycles X:

Before 3.0 there was quite a bit of code that ensured denoising of individual tiles in the viewport would run on the same GPUs as rendering and such no copies would need to be performed (see e.g. https:*developer.blender.org/diffusion/B/browse/blender-v2.93-release/intern/cycles/device/device.cpp$703 and https:*developer.blender.org/diffusion/B/browse/blender-v2.93-release/intern/cycles/device/device_multi.cpp$664). A lot of that went away as part of Cycles X due to changes with how Cycles X handles tiling and since the denoising pipeline was rewritten. As of right now this means Cycles X does not perform denoising on the same devices as rendering, but instead always performs denoising on the first OptiX device. This means if you are rendering with two or more GPUs, all the results from the second and further GPUs first have to be copied to the first GPU (see https://developer.blender.org/diffusion/B/browse/master/intern/cycles/integrator/path_trace.cpp$525), then denoised and then copied back to the individual GPUs. And those copies do a roundtrip through the CPU, which is really slow.

I don't see a quick fix for this without changing how denoising is handled in Cycles X fundamentally again unfortunately (to be able to denoise the individual tiles each GPU is working on directly on those devices). In its current form it could maybe be improved a bit by changing the copy to do a GPU to GPU copy, rather than GPU to CPU to GPU, but there being a copy at all would still have overhead. Also tagging @sergey.

This is not Windows specific and is related to how tiling and multi GPU is currently implemented in Cycles X: Before 3.0 there was quite a bit of code that ensured denoising of individual tiles in the viewport would run on the same GPUs as rendering and such no copies would need to be performed (see e.g. https:*developer.blender.org/diffusion/B/browse/blender-v2.93-release/intern/cycles/device/device.cpp$703 and https:*developer.blender.org/diffusion/B/browse/blender-v2.93-release/intern/cycles/device/device_multi.cpp$664). A lot of that went away as part of Cycles X due to changes with how Cycles X handles tiling and since the denoising pipeline was rewritten. As of right now this means Cycles X does not perform denoising on the same devices as rendering, but instead always performs denoising on the first OptiX device. This means if you are rendering with two or more GPUs, all the results from the second and further GPUs first have to be copied to the first GPU (see https://developer.blender.org/diffusion/B/browse/master/intern/cycles/integrator/path_trace.cpp$525), then denoised and then copied back to the individual GPUs. And those copies do a roundtrip through the CPU, which is really slow. I don't see a quick fix for this without changing how denoising is handled in Cycles X fundamentally again unfortunately (to be able to denoise the individual tiles each GPU is working on directly on those devices). In its current form it could maybe be improved a bit by changing the copy to do a GPU to GPU copy, rather than GPU to CPU to GPU, but there being a copy at all would still have overhead. Also tagging @sergey.

Brecht Van Lommel commented

2022-03-01 13:43:05 +01:00

I wonder how much speedup we can get from a GPU to GPU copy, and optimizations for that? The reason being that each device denoising its own render buffer is not going to work if we implement D14014: Cycles multi-GPU distribution using slices [WIP].

Without denoising we also copy everything to a single OpenGL texture and I guess that has acceptable performance, but that half float RGBA buffer is smaller of course. Maybe could do something like running the filter_guiding_preprocess kernel on each device, writing out a half float buffer to be copied to a single GPU for denoising?

At least for interactive rendering, there should also be no need to copy denoised results back to each device.

I wonder how much speedup we can get from a GPU to GPU copy, and optimizations for that? The reason being that each device denoising its own render buffer is not going to work if we implement [D14014: Cycles multi-GPU distribution using slices [WIP]](https://archive.blender.org/developer/D14014). Without denoising we also copy everything to a single OpenGL texture and I guess that has acceptable performance, but that half float RGBA buffer is smaller of course. Maybe could do something like running the `filter_guiding_preprocess` kernel on each device, writing out a half float buffer to be copied to a single GPU for denoising? At least for interactive rendering, there should also be no need to copy denoised results back to each device.

Patrick Mours commented

2022-03-01 14:33:13 +01:00

Good idea, though would need a bit of refactoring. Probably something like (pseudo code):

foreach (device in path_trace_works_) {
    device->render_samples()
}

if (denoising) {
    denoiser_device = get_denoiser_device()

    if (path_trace_works_.size() > 1 || path_trace_works_[0] != denoiser_device) {
        foreach (device in path_trace_works_) {
            guiding_buffer = device->filter_guiding_preprocess()
            denoiser_device->copy_from(guiding_buffer) // Direct GPU to GPU copy if possible
        }
    }
    else {
        denoiser_device->filter_guiding_preprocess() // Perform in-place
    }

    denoiser_device->denoise_buffer()
    denoiser_device->copy_to_display()
}
else {
    foreach (device in path_trace_works_) {
        device->copy_to_display()
    }
}

Currently there are as many GPU to CPU to GPU roundtrips as 3 times the number of devices (first to get data from devices to denoiser device, second to get denoised data from denoiser device back to devices and third to get data from devices to display, which is not using interop when there are multiple devices as far as I can see).
This would reduce that to as many GPU to GPU copies as the number of devices plus one GPU to CPU to GPU roundtrip to get data to display. Could ideally improve that further and try and detect if the denoiser device equals the device used by OpenGL and use interop to do direct GPU to GPU copy for the last part as well. Still some copies, but considerably less than before and less data to transfer.

Good idea, though would need a bit of refactoring. Probably something like (pseudo code): ``` foreach (device in path_trace_works_) { device->render_samples() } if (denoising) { denoiser_device = get_denoiser_device() if (path_trace_works_.size() > 1 || path_trace_works_[0] != denoiser_device) { foreach (device in path_trace_works_) { guiding_buffer = device->filter_guiding_preprocess() denoiser_device->copy_from(guiding_buffer) // Direct GPU to GPU copy if possible } } else { denoiser_device->filter_guiding_preprocess() // Perform in-place } denoiser_device->denoise_buffer() denoiser_device->copy_to_display() } else { foreach (device in path_trace_works_) { device->copy_to_display() } } ``` Currently there are as many GPU to CPU to GPU roundtrips as 3 times the number of devices (first to get data from devices to denoiser device, second to get denoised data from denoiser device back to devices and third to get data from devices to display, which is not using interop when there are multiple devices as far as I can see). This would reduce that to as many GPU to GPU copies as the number of devices plus one GPU to CPU to GPU roundtrip to get data to display. Could ideally improve that further and try and detect if the denoiser device equals the device used by OpenGL and use interop to do direct GPU to GPU copy for the last part as well. Still some copies, but considerably less than before and less data to transfer.

Brecht Van Lommel commented

2022-03-01 14:46:13 +01:00

Yes, that pseudo-code looks right.