Cycles HIP issues on Debian #102018
Labels
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset System
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Code Documentation
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Viewport & EEVEE
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Asset Browser Project
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Module
Viewport & EEVEE
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Severity
High
Severity
Low
Severity
Normal
Severity
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
10 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: blender/blender#102018
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
System Information
Operating system: Linux-6.0.0-1-rt-amd64-x86_64-with-glibc2.35
Devuan GNU/Linux 5 (daedalus/ceres) x86_64 (aside from init system differences this distribution can be treated as Debian sid)
Graphics card: Vega 20 [Radeon VII]
system-info.txt
dmesg
lshw
Blender Version
Broken: 3.4.0 Alpha, branch: master, commit date: 2022-10-21 11:11, hash:
26f181c6b7
Short description of error
As of recently Debian is packaging ROCm drivers for AMD GPUs and version 5.2.3-1 just landed in unstable (sid). This version as per official AMD ROCm release notes is an equivalent of 22.20.1.
One of the differences between official AMD's version and Debian's is ROCm installation path
/opt/rocm/
versus/usr/lib/x86_64-linux-gnu/
respectively:When ROCm is installed from Debian's repositories Blender does not detect AMD GPU even thou the ROCm platform is validated with AMD's tools.
https://docs.amd.com/bundle/ROCm_Installation_Guidev5.0/page/How_To_Install_ROCm.html#_Installation_Methods
rocminfo and hipconfig outputs with both GPUs detected:
rocminfo
hipconfig
And how it looks in package manager and Blender:
I suspect this is caused by ROCm paths being different between Blender libraries and HIP installed from Debian's packages, but I could be very much wrong.
Please consider this report as void if this is a mistake on my side. I'm reporting this here only because this installation differs from the official one and ROCm is a very new package in Debian.
Exact steps for others to reproduce the error
On Debian unstable (sid) Linux distribution:
rocminfo
andhipconfig
Preferences
->System
->HIP
Additional info:
ROCm packaging is done by Debian AI team:
https://lists.debian.org/debian-ai/2022/10/
https://salsa.debian.org/rocm-team/community/team-project
Added subscriber: @silex
#102158 was marked as duplicate of this issue
Added subscribers: @brecht, @OmarEmaraDev
Changed status from 'Needs Triage' to: 'Needs Developer To Reproduce'
Looks like Blender only searched for
/opt/rocm/hip/lib/libamdhip64.so
inhipewHipInit
? Not sure how this works.@brecht Is this expected?
Added subscribers: @Sayak-Biswas, @BrianSavery
CC @Sayak-Biswas @bsavery.
I guess it should also look for
libamdhip64.so
anywhere in the library path instead of just a fixed location? Something like:Changed status from 'Needs Developer To Reproduce' to: 'Confirmed'
Added subscribers: @jmcelroy, @deadpin
From the merged in bug: openSUSE ROCm binaries are being installed into both
/opt/rocm-5.3.0/hip/lib/
, and one in/opt/rocm-5.3.0/lib/
. Hopefully, once this is fixed those locations will be ok too.This issue was referenced by
af7dd99588
This issue was referenced by
f66236a827
Changed status from 'Confirmed' to: 'Resolved'
I'm not able to test this myself, so if it doesn't work please let me know.
I think the libraries should be found now in
/usr/lib/x86_64-linux-gnu/
as in the original report.@brecht Thanks for working on this.
With the patch Blender crashes in mesa after navigating to
Edit
->Preferences
->System
->HIP
.There is no crash log file unfortunately, but debugging shows that HIP is initialized:
Changed status from 'Resolved' to: 'Needs User Info'
I suspect that's a conflict between LLVM in HIP and LLVM in Mesa. We've had this issue before where there's a conflict between the LLVM in Blender with the one in Mesa, and for that we hide the LLVM symbols on the Blender side.
The way HIP is built on Debian might not hide those symbols. If that's the case it would be an issue for any application using HIP + OpenGL and would need to be solved by either Debian or ROCm developers. I don't see anything we can immediately do on the Blender side.
Thanks again. I'll report this on ROCm github or will try to forward this to Debian AI team.
I reported this problem in Debian AI mailing list and got very quick response with some findings that might be helpful with diagnosing the problem: https://lists.debian.org/debian-ai/2022/11/msg00008.html
That looks like a different issue with a custom Blender build, something related to jemalloc and HIP.
It's not clear to me what they did to verify that the LLVM symbols are properly hidden. I think
Option 'h' registered more than once!
clearly points to a conflict between multiple LLVM versions.Note that many LLVM symbols do not have a prefix to easily identify them, see for example this symbol blacklist we used to use for Blender (we switched to a whitelist since):
https://developer.blender.org/diffusion/B/browse/master/source/creator/blender.map;v3.2.2$76
Added subscriber: @cgbloor
I rebuilt the ROCm stack with debug symbols and got a better backtrace. Blender dies during the static initialization of comgr while creating the
-h
option for comgr-objdump - x. It's not very clear to me why any command-line options are being registered once (let alone twice).I have so many questions. What's the other component that has registered
-h
? What protects against double-registration when using AMD's binary packages? How might double-registration behaviour relate to a conflict between LLVM versions?Registering those command line options is part of the static initialization of LLVM. If there are multiple LLVM libraries in memory with visible symbols, rather than each LLVM library initializing their own variables, the variables of one of the instances will be initialized twice. And then you get that error.
I don't know how the AMD binary package is built exactly, but presumably the LLVM symbols are hidden or there is some other mechanism to keep the LLVM symbols separate from the mesa LLVM symbols.
Added subscriber: @Lendo
Changed status from 'Needs User Info' to: 'Resolved'
Changed status from 'Resolved' to: 'Confirmed'
Got accidentally closed by backporting to 3.3.
Still not something we can fix on the Blender side i believe but in Linux distribution packing, so will mark as known issue.
Added subscriber: @rherilier
why using hypothetical paths to access to libamdhip64.so and hipcc?
here is a minimal CMakeLists.txt to find them at configure time :
As the CMake macro
blender_add_lib(...)
does not seem to allow extra compilation flags like "-DXXX=YYY", so using a config.h.in should do the trick.We build Blender binaries to work on multiple Linux distributions.
If some Linux distribution wants to build and package Blender with a specific path to the library I guess we could support that, but not sure why they would not have it in a system library path.
This has worked fine for CUDA so far, and I imagine distributions want to package HIP the same way.
You have a point.
brecht's solution should solve the problem: as a packaged
hipcc
depends on libamdhip64's development package (it's the case under Debian), there is no need to add extra names to search for like "libamdhip64.so.5".Blender does not detect Radeon VII GPUs with 5.2.3 ROCm drivers installedto Cycles HIP issues on DebianI've renamed the task to reflect that this is actually no longer about detection, the issue remaining now is the Mesa LLVM vs. HIP LLVM conflict.
Added subscriber: @Grant520
LLVM Command Line issue was temporary fixed in recent ROCm update in Debian.
So far Blender can detect both GPUs, and rendering also work. As far as I'm concerned this task can be closed.
The problem still waits for permanent fix upstream:
https://github.com/RadeonOpenCompute/ROCm-CompilerSupport/issues/52
Did anyone else test this yet on Debian? I'm curious what library versions and hardware was used. I'm still seeing this in Blender 3.4.1 when navigating to the HIP config dialog.
Here is the crash log:
I believe this might be due to the ROCm libraries that come with Debian testing still being too old for newer AMD chips:
I have an AMD Ryzen 9 7950X which has onboard video that has a device ID that is not recognized by 5.2.x versions of ROCm. The "unrecognized id" part seems to crash Blender, even though the following devices are valid.
LLVM Command Line issue was fixed in ROCm:
https://github.com/RadeonOpenCompute/ROCm-CompilerSupport/issues/52#issuecomment-1385754897
Since this is fixed in ROCm, I don't think there is anything we can do further on the Blender side besides hope all distros have updated or will do so soon, so closing.