"hipErrorNoBinaryForGpu: Unable to find code object for all current devices!" with nVidia GPU and AMD iGPU #119444
Labels
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
4 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: blender/blender#119444
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
System Information
Operating system: Windows 10 Pro 22H2 19045.4046
Graphics card: nVidia RTX 3060, AMD Radeon Graphics in Ryzen 7 5700G
Blender Version
Broken: 4.1 Release candidate
3e8ed795cb
Worked: (newest version of Blender that worked as expected) 4.0.2
Short description of error
On a system with nVidia GPU and AMD iGPU blender crashes when you switch to Cycles
"hipErrorNoBinaryForGpu: Unable to find code object for all current devices!"
HIP is not even set in Preferences -> System. Selecting CUDA, OptiX or None does not help.
Exact steps for others to reproduce the error
Based on the default startup file
I wanted to attach the crash log, but AppData/Local/Temp is empty. Tried emptying temp and crashing again and still no files. It seems Blender is not creating the crash log
@brecht committed a fix for a issue like this a few days ago (
c388ed1e53
) and this fix is in the version of Blender you're using. So it's odd that you're experiencing issues.I will note that the previous user that experienced this issue was able to resolve the issue by updating their AMD iGPU drivers. This may help you too. Can you give it a try?
@aafra, @salipour, I'll check if I can find some way to fix this tomorrow. I guess OIDN is initializing HIP when we query info about NVIDIA devices.
If we can't fix it this week, I will probably disable HIP support for OIDN in 4.1. It's quite bad to make Cycles GPU rendering crash entirely because of an old AMD driver for the integrated GPU, when we don't even want to use that but an NVIDIA card instead.
Thank you! Updating the driver fixed everything.
@brecht I will add these old APU targets to the upcoming OIDN patch release coming out today. It’s an ugly solution but I think it’s still good enough for now. We now have confirmation that the issue happens only with older drivers. If I add all these old targets, OIDN shouldn’t crash anymore (we fixed such a crash for newer APUs before this way). We don’t need to worry about any newer targets which don’t exist in HIP yet because those GPUs would require a new driver to work anyway. So this should be a definitive fix. Would this be good enough for Blender?
I could look into a more elegant solution as well but the problem is that AFAIK this issue was reproduced only with different kinds of APUs, and I don’t have access to such a machine. I cannot reproduce this issue with dGPU (by removing the targets for it). But I think adding the APU targets should also work reliably.
@aafra and @brecht I have access to a Ryzen 5 5600G, along with a AMD, Intel, and NVIDIA dGPU.
If you need any testing done, I can do it on that computer. However, that computer is not properly setup for testing at the moment, and I'd need to do some deconstructing and rebuilding of computers to get it into a testable state. I estimate it would take me about an hour or more to set it up.
@lichtwerk has a Ryzen 9 5900HX and NVIDIA RTX 3080 equiped laptop. They may also be able to help out in testing here.
I don't trust this enough, especially not a few days (or even weeks) before the release. If we miss some architecture or other factor, it means using Cycles in Blender 4.1 could crash for a large number of users. I don't want to take that risk.
In #119448 I'm trying to patch OIDN to not load the HIP module until we explicitly ask it to. I think that's a more reliable solution, but I haven't tested it yet.
@brecht We cannot miss any architecture because the list of possible HIP targets is well-known but I get your point. But this is the best I can do on OIDN side at the moment, especially at such extremely short notice.
@brecht The main drawback of your solution is that if there is an AMD integrated GPU in the system, OIDN HIP support will be disabled for discrete AMD GPUs too, even if the driver version is up to date. I think this is too limiting.
A potentially better and safer solution would be is simply checking the HIP driver/runtime version and load the OIDN HIP module only if it's recent enough, which has the fix. This way, we wouldn't need to care about unsupported HIP devices at all. With recent drivers, OIDN would work with discrete AMD GPUs too, even if there is an unsupported iGPU. I think this should be very easy to add to Cycles but I'm also trying to add this directly to OIDN. It's a lot trickier in OIDN because I don't know when exactly does the crash happen. I just managed to get indirect access to a machine, so hopefully I could figure this out today.
After discussing with Sergey here, I think we should do the following:
To check the driver version on the Cycles side, we have
hipewHasOldDriver
though this code is for Windows only. This checks the version of the dll file. This should be quite safe compared to actually initializing HIP, which we want to avoid doing unless a user actually enabled it in the preferences.This bug also existed in Linux drivers, I'm not sure if there is an equivalent safe way of checking the version. Or if it's likely for a user to have an older driver like this.
Either way, I still want to get more testing for such a fix and not include it in 4.1.0 immediately.
@brecht Why not the check the driver or runtime version with
hipDriverGetVersion
orhipRuntimeGetVersion
? These are available on both Windows and Linux.Because we then have to load the HIP shared library and call its functions, which we don't want to do until a user has explicitly enabled HIP in the preferences. We've had crashes doing that for CUDA, OpenCL and HIP in the past, and at least on Linux + HIP I know it's still possible for this to happen now.
I'm confused now. This crash with the integrated AMD GPUs is specific to OIDN, right? So because of this you don't even want to load the HIP runtime at all for older driver versions, even though you did before without OIDN?
Without OIDN, Cycles does not load the HIP runtime until a user explicitly enables HIP in the preferences.
OIDN always initializes all the devices types. Ideally Cycles could tell it to load just the ones that we want. We can do that with the environment variables, but this OIDN initialization only happens once so it can't be updated when a user is editing the preferences.
So really adding OIDN is also adding some risk for CUDA too, since ideally we also do not initialize that until a user asks for it. But it's been a long time since I saw issues with that, so it's probably ok. And there isn't really an equivalent situation with an integrated NVIDIA GPU.
@Alaska I'd like to ask for your help with debugging this on the machine you have. Could you please do the following?
bin
of the OIDN package:oidnBenchmark -ld
What's the output of this command or does any error/crash happen?
Thanks!
@brecht What you describe seems to be a somewhat different matter: a change in behavior in OIDN. You can disable loading some device modules with environment variables but once OIDN gets initialized, you can't load any more modules. In Blender the user can change the device at runtime, so this doesn't seem like a feasible solution. So what exactly do you need from OIDN? To lazily load device modules? That is possible only if Cycles never queries the available devices in OIDN, and would just directly create the kind of device it needs (e.g. CUDA). Also, it would not be possible to unload any already loaded device modules. In any case, such feature cannot be added to OIDN before the next major release.
Edit: Blender currently does query all available OIDN devices to find a device by PCI ID. While this logic is used in Blender, lazily loading select devices isn't possible, unless OIDN would introduce some new API functions. In any case, this would be a major change in OIDN.
@aafra I'll test this tomorrow unless someone else tests it first.
Indeed we'd want to lazily load device modules. It's not obvious what the right API for that would be. For Cycles something simple like this would work, if it prevents other functions from automatically initializing all device types:
In general you may need a bigger change to avoid race conditions. Like passing an optional
OIDNDeviceType
to all the functions that query device information, to limit them to a single device type and only initialize that one on demand.Of course something more sneaky with environment variables is possible too, as I was doing in #119448.
Unloading devices is not important, we don't do that in Cycles either. It's just to avoid crashing.
The PR to disable OIDN HIP is #119476.
@Alaska it would be convenient if you could test that this does indeed disable HIP support for OIDN. But I will check it here too.
@brecht I will consider lazy device module loading but I really don’t want to make major API changes because of this, especially not adding a device type parameter to all device query functions. The only reason why this would be needed is because Cycles iterates over all OIDN devices. Why is this necessary? Cycles itself decides what devices to use, so it should be able to create an OIDN device on the specific device it wants. If that device isn’t supported by OIDN, device creation fails. It’s unclear to me why matching by PCI address is needed for this.
This was added by @Stefan_Werner in #115854. I think it was done because we want a way to check if a given device is supported by OIDN, without actually creating it.
We want to communicate in the user interface if OIDN is supported on the device, but without the risk and overhead of actually creating an OIDN device on Blender startup. Maybe an API function to check that could be added? Or is it already possible?
I guess if that existed, it could do lazy module loading behind the scenes.
The overhead of just checking whether a device is supported wouldn’t be much lower than actually trying to create the device, and it wouldn’t be any less riskier. You’re already taking a bigger risk because when you iterate over the devices, the same checks are done by OIDN for all of them, not just the ones that are selected in Blender. There’s no other way to get that list which Cycles iterates over.
Device creation doesn’t do much more than checking for support. So unless there is solid proof that trying to creating a device is too costly, I don’t think adding new API functions would be justified.
Ok, looking at the implementation it does look cheap enough to run on startup. I can make the change for Blender 4.2.
Great! For the next OIDN version I'll try to switch to lazy device module loading, when using only API functions specific to a particular device type.
Perhaps I could still add some kind of device support check to OIDN but in an easier way than I initially thought. Instead of adding several new API functions for each device type, I could add just one:
oidnIsDeviceSupported(OIDNDevice)
. You would still need to create the device object but that doesn't really do anything yet. The actual work, including checking for support, is happening only inoidnCommitDevice
. If you don't want to have an initialized device, you could calloidnIsDeviceSupported
instead ofoidnCommitDevice
, and then just release this unintialized device object. This would strictly do only the necessary checks, minimizing the risk as much as possible. This would be easy enough to add to be worthwhile.@brecht with Blender 4.1 and a RX 7800XT, I can confirm that the GPU is not being used for denoising, and the
Use GPU
denoising button is greyed out.@aafra should I still do these tests?
@Alaska Yes, please do the tests.
Using a AMD Ryzen 5 5600G (Early 2023 GPU drivers) with Intel Arc A750 setup, I can reproduce crashing with the same error message in the logs with in Blender 4.1
3e8ed795cb
(what the original reporter of this bug was using).Using Blender 4.1
1640121a6313
(Includes Brechts initial attempt at fixing his issue), there is still crashing (currently expected).Using 4.1
335ff6efab67
(Brecht disabled HIP OIDN) there are no crashes. And when I select my Intel GPU, it can and will be used for denoising when enabled. This was just to reconfirm everything was working with one of the broken setups.@aafra running
oidnBenchmark -ld
with the v2.2.1 binaries from https://github.com/OpenImageDenoise/oidn/releases/tag/v2.2.1 just gives the error"hipErrorNoBinaryForGPU: Unable to find code object for all current devices!"
.If I run the benchmark on my Intel GPU or my CPU
oidnBenchmark -d sycl
oroidnBenchmark -d cpu
I get the same error and OIDN crashes (this is expected based on the current code?).If there are any more tests you'd like me to run in the next few days, please let me know.
@Alaska Thanks a lot! Could you perhaps try to find out where exactly does OIDN crash, specifically for which HIP API call?
@aafra I assume this just means compiling a debug builds of OIDN and stepping through the code to find the issue. If not, could you help guide me through the process of doing this? If it's easier we can talk on Blender chat or devtalk.
Blender chat: https://blender.chat/channel/render-cycles-module/members-list/Alaska
Devtalk: https://devtalk.blender.org/u/alaska/summary
@aafra Through out the day I've been trying to compile OIDN and haven't been able to successfully compile it with GPU support or one that even works properly with the CPU. I'd need some assistant with this if you want me to test this for you.
I have asked for help on Blender-chat in the meantime just in case someone can offer some quick help. https://blender.chat/channel/blender-builds?msg=dSLy4yceTSgRKfjMh
Thanks @Alaska ! I'm currently finishing up the new OIDN release. I could help with the compilation after the release. But before that let's try the new OIDN 2.2.2 binaries on this machine. Hopefully, the bug will be fixed.
@Alaska Could you please try to run
oidnBenchmark -ld
using the fresh OIDN 2.2.2 binaries? https://github.com/OpenImageDenoise/oidn/releases/tag/v2.2.2With the new 2.2.2 binaries, the crash doesn't occur.
With help from Attila Áfra in Blender chat, we managed to figure out which HIP calls where causing issues. In OIDN it was
hipGetDeviceCount
. Attila wanted to figure out if we could detect the broken drivers withhipRuntimeGetVersion
, but that also resulted in crashing.@brecht With the help of @Alaska and @MarkFreeDev , we can conclude that the HIP error/crash doesn't happen using the following minimum driver versions (older versions may also work but this is what was tested so far):
Windows: Adrenalin Edition 24.1.1
Linux: ROCm 5.7.0
I also confirmed this Windows driver version on a different machine with a newer AMD APU.
So I think checking the driver version would be a robust solution. I'll look into how it would be best to implement this in OIDN. Meanwhile, the workaround in OIDN 2.2.2 also seems to work but it may be more risky.