Crash during render with AMD HIP on Linux #112084

Open
opened 2023-09-07 15:01:48 +02:00 by Cyril · 31 comments

System Information
Operating system: Linux Mint 21.2 (kernel 6.2.0-31-generic x86_64)
Graphics card: AMD Radeon RX 6600

Blender Version
Broken: 3.6.3 Release Candidate

I already wrote about it in another issue but was told to create a new one. This was, probably, already reported but I could not determine which of the other reported issues are related to this one.

The following I copy-pasted from my original comment, in case if it will be deleted and to make it easier to read.

When running from terminal, an error message appears before crash "Memory access fault by GPU node-1 (Agent handle: 0x7feed4dc1000) on address 0x7fed00007000. Reason: Page not present or supervisor privilege.".

The problem occurred only with one .blend file. I tried a couple of others and rendering works without issues. The problem file is Blender 3.2 splash screen. After cleaning up the scene, I narrowed down the cause. The crash happen when there are two objects in the scene, one having Cast Shadow Caustics (located in Object Properties > Shading > Caustics) enabled and another having Receive Shadow Caustics enabled, and a light source with Shadow Caustics enabled (located in Object Data Properties > Light). The objects must be in the camera view.

There is one thing that I have not been able to figure out. When experimenting with the splash screen scene, there is no crash if only the object with Cast Shadow Caustics enabled (water) is visible in the camera, but crashes when object with Receive Shadow Caustics enabled (ship) is visible. But when I experimented with the default Blender scene, by duplicating the cube and setting those parameters, Blender crashed independently of which of the objects is visible in the camera, even if only the cube with Cast Shadow Caustics is visible, which does not cause a crash in the splash screen scene.

I uploaded both scenes. In order to reduce size I replaced the boat mesh in the cleaned up splash screen scene with a cube (I replaced it in the Edit mode to preserve other parameters).

**System Information** Operating system: Linux Mint 21.2 (kernel 6.2.0-31-generic x86_64) Graphics card: AMD Radeon RX 6600 **Blender Version** Broken: 3.6.3 Release Candidate I already wrote about it in another [issue](https://projects.blender.org/blender/blender/issues/109431#issuecomment-1015443) but was told to create a new one. This was, probably, already reported but I could not determine which of the other reported issues are related to this one. The following I copy-pasted from my original comment, in case if it will be deleted and to make it easier to read. When running from terminal, an error message appears before crash "Memory access fault by GPU node-1 (Agent handle: 0x7feed4dc1000) on address 0x7fed00007000. Reason: Page not present or supervisor privilege.". The problem occurred only with one .blend file. I tried a couple of others and rendering works without issues. The problem file is [Blender 3.2 splash screen](https://cloud.blender.org/p/gallery/629f23f908e12d4ff15241d3). After cleaning up the scene, I narrowed down the cause. The crash happen when there are two objects in the scene, one having Cast Shadow Caustics (located in Object Properties > Shading > Caustics) enabled and another having Receive Shadow Caustics enabled, and a light source with Shadow Caustics enabled (located in Object Data Properties > Light). The objects must be in the camera view. There is one thing that I have not been able to figure out. When experimenting with the splash screen scene, there is no crash if only the object with Cast Shadow Caustics enabled (water) is visible in the camera, but crashes when object with Receive Shadow Caustics enabled (ship) is visible. But when I experimented with the default Blender scene, by duplicating the cube and setting those parameters, Blender crashed independently of which of the objects is visible in the camera, even if only the cube with Cast Shadow Caustics is visible, which does not cause a crash in the splash screen scene. I uploaded both scenes. In order to reduce size I replaced the boat mesh in the cleaned up splash screen scene with a cube (I replaced it in the Edit mode to preserve other parameters).
Cyril added the
Priority
Normal
Type
Report
Status
Needs Triage
labels 2023-09-07 15:01:49 +02:00
Brian Savery (AMD) self-assigned this 2023-09-08 01:54:17 +02:00

Does the 4.0.3 RC Blender daily build fix this issue for you? #116697 (comment)

Does the 4.0.3 RC Blender daily build fix this issue for you? https://projects.blender.org/blender/blender/issues/116697#issuecomment-1097599
Author

Does the 4.0.3 RC Blender daily build fix this issue for you? #116697 (comment)

Both scenes that I uploaded could render without crashing since at least 4.0.1, but the full "Blender 3.2 splash screen" scene still crashes in both 4.0.1 and 4.0.3 RC.

> Does the 4.0.3 RC Blender daily build fix this issue for you? https://projects.blender.org/blender/blender/issues/116697#issuecomment-1097599 Both scenes that I uploaded could render without crashing since at least 4.0.1, but the full "Blender 3.2 splash screen" scene still crashes in both 4.0.1 and 4.0.3 RC.
Author

I modified the splash screen scene. It crashes in Blender 4.1.0 release candidate, but not in older Blender 3.6.2. So, it looks like the original problem was fixed, but the fix introduced another one.

Crash error is the same as before.

I modified the splash screen scene. It crashes in Blender 4.1.0 release candidate, but not in older Blender 3.6.2. So, it looks like the original problem was fixed, but the fix introduced another one. Crash error is the same as before.
Member

Hi, what version of ROCm package you have? Does it work with 5.7?
Can you attach crash logs?: https://docs.blender.org/manual/en/dev/troubleshooting/crash.html#linux

Hi, what version of ROCm package you have? Does it work with 5.7? Can you attach crash logs?: https://docs.blender.org/manual/en/dev/troubleshooting/crash.html#linux
Pratik Borhade added
Status
Needs Information from User
and removed
Status
Needs Triage
labels 2024-03-14 09:59:24 +01:00
Author

Hi, what version of ROCm package you have? Does it work with 5.7?
Can you attach crash logs?: https://docs.blender.org/manual/en/dev/troubleshooting/crash.html#linux

I had ROCm 5.5.3 installed. Now I have ROCm 6.0.2. Blender 4.1.0 crashes with both. Blender 3.6.2 does not crash.

ROCm installed using amdgpu-install (I mostly followed installation instructions from here. I don't know if it's important, just thought to mention.

I now also use different kernel version: 6.5.0-25-generic x86_64.

The crash log does not seem to have anything unusual, besides the same "Memory access fault" right at the end, as was in the original problem.

I can render other scenes, e.g. Blender 3.1 splash screen scene, renders with both Blender versions without issues with HIP. Also Blender 3.2 splash screen scene renders in both Blender versions without HIP.

The scene I uploaded in the previous post is Blender 3.2 splash screen scene, but I deleted as much as possible from it and left only minimum set of objects that still cause crash.

> Hi, what version of ROCm package you have? Does it work with 5.7? > Can you attach crash logs?: https://docs.blender.org/manual/en/dev/troubleshooting/crash.html#linux I had ROCm 5.5.3 installed. Now I have ROCm 6.0.2. Blender 4.1.0 crashes with both. Blender 3.6.2 does not crash. ROCm installed using amdgpu-install (I mostly followed installation instructions from [here](https://amdgpu-install.readthedocs.io/en/latest/). I don't know if it's important, just thought to mention. I now also use different kernel version: 6.5.0-25-generic x86_64. The crash log does not seem to have anything unusual, besides the same "Memory access fault" right at the end, as was in the original problem. I can render other scenes, e.g. Blender 3.1 splash screen scene, renders with both Blender versions without issues with HIP. Also Blender 3.2 splash screen scene renders in both Blender versions without HIP. The scene I uploaded in the previous post is Blender 3.2 splash screen scene, but I deleted as much as possible from it and left only minimum set of objects that still cause crash.

We currently have no real solution for the "Memory access fault" problem. I have suspicions it's an LLVM compiler bug introduced somewhere between ROCm 5.7 and ROCm 6.0, but haven't got the time to bisect through.

@Tolei Are you running Blender 4.1.0 from your distribution's packager, or are you using the official binaries? Past experiences have shown that the official binaries don't have this "memory access fault" problem.

We currently have no real solution for the "Memory access fault" problem. I have suspicions it's an LLVM compiler bug introduced somewhere between ROCm 5.7 and ROCm 6.0, but haven't got the time to bisect through. @Tolei Are you running Blender 4.1.0 from your distribution's packager, or are you using the official binaries? Past experiences have shown that the official binaries don't have this "memory access fault" problem.
Author

We currently have no real solution for the "Memory access fault" problem. I have suspicions it's an LLVM compiler bug introduced somewhere between ROCm 5.7 and ROCm 6.0, but haven't got the time to bisect through.

The problem existed with ROCm 5.5.3.

@Tolei Are you running Blender 4.1.0 from your distribution's packager, or are you using the official binaries? Past experiences have shown that the official binaries don't have this "memory access fault" problem.

I use official binaries.

Whatever was done between Blender 3.6.2 and 4.1.0 fixed original problem but seems to have introduced another one. The same scene, that I first noticed crashing, continues to crash, but the crash happen for different reasons on those two Blender versions. Originally, I managed to determine that the problem was caused by shadow caustics. But, I don't know what causes crash in 4.1.0.

Just to summarize:
The two scenes, that I uploaded before, were both derived from Blender 3.2 splash screen scene:

The full Blender 3.2 Splash Screen scene, crashes in both versions.
I've also tested Blender 3.1 splash screen scene, and it does not crash in either of those versions.

Update:
I checked 4.0.2. Both scenes render in 4.0.2, but the full scene still crashes. The original problem seems to have been only partially fixed, because the crash of the full scene still happened because of shadow caustics. The new problem was introduced after 4.0.2.

Update 2:
I've tried various daily releases. Full Blender 3.2 splash screen scene could be rendered in Blender 4.1.0 Alpha b5a9c98e04fa, but crashes in next daily release 99f9084bee58.

> We currently have no real solution for the "Memory access fault" problem. I have suspicions it's an LLVM compiler bug introduced somewhere between ROCm 5.7 and ROCm 6.0, but haven't got the time to bisect through. The problem existed with ROCm 5.5.3. > @Tolei Are you running Blender 4.1.0 from your distribution's packager, or are you using the official binaries? Past experiences have shown that the official binaries don't have this "memory access fault" problem. I use official binaries. Whatever was done between Blender 3.6.2 and 4.1.0 fixed original problem but seems to have introduced another one. The same scene, that I first noticed crashing, continues to crash, but the crash happen for different reasons on those two Blender versions. Originally, I managed to determine that the problem was caused by shadow caustics. But, I don't know what causes crash in 4.1.0. Just to summarize: The two scenes, that I uploaded before, were both derived from Blender 3.2 splash screen scene: * [Blender-3.2-Splash-Screen.blend](https://projects.blender.org/attachments/e3bb5fd2-4aaf-463f-aeef-efd20478a4c5) - crashes in Blender 3.6.2 but not in 4.1.0 * [derived-crash-4.1.blend](https://projects.blender.org/attachments/139fef0f-1abe-4315-a7f1-af74f3a95bc0) - crashes in 4.1.0 but not in 3.6.2 The full [Blender 3.2 Splash Screen scene](https://cloud.blender.org/p/gallery/629f23f908e12d4ff15241d3), crashes in both versions. I've also tested Blender 3.1 splash screen scene, and it does not crash in either of those versions. Update: I checked 4.0.2. Both scenes render in 4.0.2, but the full scene still crashes. The original problem seems to have been only partially fixed, because the crash of the full scene still happened because of shadow caustics. The new problem was introduced after 4.0.2. Update 2: I've tried various daily releases. Full Blender 3.2 splash screen scene could be rendered in Blender 4.1.0 Alpha [b5a9c98e04fa](https://projects.blender.org/blender/blender/commit/b5a9c98e04fa), but crashes in next daily release [99f9084bee58](https://projects.blender.org/blender/blender/commit/99f9084bee58).
Member

Whatever was done between Blender 3.6.2 and 4.1.0 fixed original problem but seems to have introduced another one.

Maybe ROCm 6 support?

The same scene, that I first noticed crashing, continues to crash, but the crash happen for different reasons on those two Blender versions

But is it still the Memory access fault by GPU node-1?
This has been reported already at ROCm and Arch repo:

> Whatever was done between Blender 3.6.2 and 4.1.0 fixed original problem but seems to have introduced another one. Maybe ROCm 6 support? > The same scene, that I first noticed crashing, continues to crash, but the crash happen for different reasons on those two Blender versions But is it still the `Memory access fault by GPU node-1`? This has been reported already at ROCm and Arch repo: - https://github.com/ROCm/ROCm/issues/2930 - https://gitlab.archlinux.org/archlinux/packaging/packages/blender/-/issues/6
Author

But is it still the Memory access fault by GPU node-1?
This has been reported already at ROCm and Arch repo:

I actually found a working Blender releases, where the scene renders without issues, but those were several Blender 4.1.0 Alpha daily releases. Then another bug was introduced that causes the render of the same scene to crash again. So, whatever was causing it originally, it was fixed. But starting from daily Blender 4.1.0 Alpha release 99f9084bee58 it start crashing again, with the same "Memory access fault by GPU node-1" error.

Also note, that the commit 99f9084bee58 is not necessarily the one that caused the regression. Since, I only tested daily releases. The actual commit should be somewhere after b5a9c98e04fa and up to 99f9084bee58.

> But is it still the `Memory access fault by GPU node-1`? > This has been reported already at ROCm and Arch repo: > - https://github.com/ROCm/ROCm/issues/2930 > - https://gitlab.archlinux.org/archlinux/packaging/packages/blender/-/issues/6 I actually found a working Blender releases, where the scene renders without issues, but those were several Blender 4.1.0 Alpha daily releases. Then another bug was introduced that causes the render of the same scene to crash again. So, whatever was causing it originally, it was fixed. But starting from daily Blender 4.1.0 Alpha release [99f9084bee58](https://projects.blender.org/blender/blender/commit/99f9084bee58) it start crashing again, with the same "Memory access fault by GPU node-1" error. Also note, that the commit [99f9084bee58](https://projects.blender.org/blender/blender/commit/99f9084bee58) is not necessarily the one that caused the regression. Since, I only tested daily releases. The actual commit should be somewhere after [b5a9c98e04fa](https://projects.blender.org/blender/blender/commit/b5a9c98e04fa) and up to [99f9084bee58](https://projects.blender.org/blender/blender/commit/99f9084bee58).

My preliminary investigations show that this may be a compiler bug. I tried two rounds of compilation of Blender 4.0.2 (with the ROCm 6 patch applied), where the only difference is:

  • compile HIP code with LLVM 16 (corresponds to around ROCm 5.5 ~ 5.7)
  • compile HIP code with AMD's LLVM fork that's released with ROCm 6.0

Both are running on kernel 6.6.22-281.current and with ROCm 6.0 runtime. The former doesn't crash when rendering the classroom example but the latter crashes.

My preliminary investigations show that this may be a compiler bug. I tried two rounds of compilation of Blender 4.0.2 (with the ROCm 6 patch applied), where the only difference is: - compile HIP code with LLVM 16 (corresponds to around ROCm 5.5 ~ 5.7) - compile HIP code with AMD's LLVM fork that's released with ROCm 6.0 Both are running on kernel `6.6.22-281.current` and with ROCm 6.0 runtime. The former doesn't crash when rendering the classroom example but the latter crashes.

But starting from daily Blender 4.1.0 Alpha release 99f9084bee it start crashing again, with the same "Memory access fault by GPU node-1" error.

Did Blender bump the ROCm version it is compiling with? ROCm 6.0 is known to cause the memory access fault but ROCM 5 doesn't. IIRC previously the Blender CI had been compiling Blender daily against ROCm 5.5 or 5.7, which doesn't have the memory access fault.

> But starting from daily Blender 4.1.0 Alpha release 99f9084bee58 it start crashing again, with the same "Memory access fault by GPU node-1" error. Did Blender bump the ROCm version it is compiling with? ROCm 6.0 is known to cause the memory access fault but ROCM 5 doesn't. IIRC previously the Blender CI had been compiling Blender daily against ROCm 5.5 or 5.7, which doesn't have the memory access fault.

Both are running on kernel 6.6.22-281.current and with ROCm 6.0 runtime. The former doesn't crash when rendering the classroom example but the latter crashes.

Update: the full Blender 3.2 splash screen @Tolei posted is in the same situation. With AMD's LLVM fork that comes with ROCM 6.0, Blender crashes with the memory access fault error, but with LLVM 16 it renders successfully.

I have to test with Blender 4.0.2 instead of 4.1 because my distro is being held to 4.0.2 due to some dependency issues. Once that gets resolved I'll test with Blender 4.1. Just wanted to post here so that other interested folks may want to do some similar testing and confirm that this is a compiler bug that got introduced somewhere after LLVM 16.

> Both are running on kernel 6.6.22-281.current and with ROCm 6.0 runtime. The former doesn't crash when rendering the classroom example but the latter crashes. Update: the full Blender 3.2 splash screen @Tolei posted is in the same situation. With AMD's LLVM fork that comes with ROCM 6.0, Blender crashes with the memory access fault error, but with LLVM 16 it renders successfully. I have to test with Blender 4.0.2 instead of 4.1 because my distro is being held to 4.0.2 due to some dependency issues. Once that gets resolved I'll test with Blender 4.1. Just wanted to post here so that other interested folks may want to do some similar testing and confirm that this is a compiler bug that got introduced somewhere after LLVM 16.
Member

Did Blender bump the ROCm version it is compiling with?

d2e91fb0d7 : That is Blender could be compiled with either ROCM 6 or 5 and run on either

@GZGavinZhao ^
Not aware of the current situation though. @BrianSavery @brecht may know more about this.

> Did Blender bump the ROCm version it is compiling with? d2e91fb0d72fe565e6fcab9a1c071dce83aca0db : That is Blender could be compiled with either ROCM 6 or 5 and run on either @GZGavinZhao ^ Not aware of the current situation though. @BrianSavery @brecht may know more about this.

d2e91fb0d7 : That is Blender could be compiled with either ROCM 6 or 5 and run on either

@PratikPB2123 Sorry, I meant in the CI environment that builds the release binaries and Blender daily builds. Do you know what the ROCm version is used in that?

> d2e91fb0d7 : That is Blender could be compiled with either ROCM 6 or 5 and run on either @PratikPB2123 Sorry, I meant in the CI environment that builds the release binaries and Blender daily builds. Do you know what the ROCm version is used in that?
Member

Unfortunately, no :/

Unfortunately, no :/

I have to test with Blender 4.0.2 instead of 4.1 because my distro is being held to 4.0.2 due to some dependency issues. Once that gets resolved I'll test with Blender 4.1. Just wanted to post here so that other interested folks may want to do some similar testing and confirm that this is a compiler bug that got introduced somewhere after LLVM 16.

Just tested with Blender 4.1. Same behavior. When compiling the HIP fatbins using LLVM 16, Blender renders fine, but when compiling using AMD's fork of LLVM for ROCm 6.0, I get the usual Memory access fault by GPU node-1 failure.

> I have to test with Blender 4.0.2 instead of 4.1 because my distro is being held to 4.0.2 due to some dependency issues. Once that gets resolved I'll test with Blender 4.1. Just wanted to post here so that other interested folks may want to do some similar testing and confirm that this is a compiler bug that got introduced somewhere after LLVM 16. Just tested with Blender 4.1. Same behavior. When compiling the HIP fatbins using LLVM 16, Blender renders fine, but when compiling using AMD's fork of LLVM for ROCm 6.0, I get the usual `Memory access fault by GPU node-1` failure.
Member

Is this issue unique to RX 6600?

Is this issue unique to RX 6600?

Reproduces for me on RX 6950 XT too

Reproduces for me on RX 6950 XT too
Member

what about 7000 series?

what about 7000 series?

I have reproduced this behavior on gfx1032 (RX6600M), gfx90c (the iGPU of Ryzen 7 5800H), and gfx900 (Vega 10).

I have reproduced this behavior on `gfx1032` (RX6600M), `gfx90c` (the iGPU of Ryzen 7 5800H), and `gfx900` (Vega 10).

I'm going to bisect the rocm-6.0.x branch of AMD's LLVM fork to see if I can find where went wrong.

Also attached the fatbins compiled with LLVM16, in case anyone is running a Blender version that crashes with the same "memory access fault" message and wants to drop them into the 4.1/scripts/addons/cycles/lib/ directory and test. With these fatbins you should no longer get errors. They're compiled against gfx803;gfx900;gfx902;gfx904;gfx906;gfx908;gfx90a;gfx90c;gfx1010;gfx1011;gfx1012;gfx1013;gfx1030;gfx1031;gfx1032;gfx1033;gfx1034;gfx1035;gfx1036;gfx1100;gfx1101;gfx1102;gfx1103.

I'm going to bisect the `rocm-6.0.x` branch of AMD's LLVM fork to see if I can find where went wrong. Also attached the fatbins compiled with LLVM16, in case anyone is running a Blender version that crashes with the same "memory access fault" message and wants to drop them into the `4.1/scripts/addons/cycles/lib/` directory and test. With these fatbins you should no longer get errors. They're compiled against `gfx803;gfx900;gfx902;gfx904;gfx906;gfx908;gfx90a;gfx90c;gfx1010;gfx1011;gfx1012;gfx1013;gfx1030;gfx1031;gfx1032;gfx1033;gfx1034;gfx1035;gfx1036;gfx1100;gfx1101;gfx1102;gfx1103`.

PS. Crashes look like this (using the .blends from the OP):

Memory access fault by GPU node-1 (Agent handle: 0x7f1db8337e00) on address 0x7f1bf177e000. Reason: Page not present or supervisor privilege.

Or:

Memory access fault by GPU node-2 (Agent handle: 0x7f8d6c138400) on address 0x7f8ba597e000. Reason: Page not present or supervisor privilege.

Out of 8 permutations of 3 devices, only those two do not crash:
image image

PS. Crashes look like this (using the `.blend`s from the OP): ``` Memory access fault by GPU node-1 (Agent handle: 0x7f1db8337e00) on address 0x7f1bf177e000. Reason: Page not present or supervisor privilege. ``` Or: ``` Memory access fault by GPU node-2 (Agent handle: 0x7f8d6c138400) on address 0x7f8ba597e000. Reason: Page not present or supervisor privilege. ``` Out of 8 permutations of 3 devices, only those two **do not crash**: ![image](/attachments/c35573ed-2517-47d7-947d-45e36a3f9ae7) ![image](/attachments/0ac90fac-4d59-4f91-bf66-a11c70344bd9)
Member

@GZGavinZhao can you make sure RCOM 6 runtime is loaded in Blender. I don't know about Linux but on Windows, the dll names are different for pre RCOM6 and RCOM 6 (amdhip64.dll vs. amdhip64_6.dll)

@GZGavinZhao can you make sure RCOM 6 runtime is loaded in Blender. I don't know about Linux but on Windows, the dll names are different for pre RCOM6 and RCOM 6 ([amdhip64.dll vs. amdhip64_6.dll](https://projects.blender.org/blender/blender/src/commit/65bfae2258722c3aa5cd10cef7be908bffa6059b/extern/hipew/src/hipew.c#L236))

I'd gladly test the custom libs, but after recent XZ vuln, I'm a bit scared 😅
Do we have some docs how to set up a safe test env with VMs/Docker/other such that GPU devices work as intended?

I'd gladly test the custom libs, but after recent XZ vuln, I'm a bit scared 😅 Do we have some docs how to set up a safe test env with VMs/Docker/other such that GPU devices work as intended?

I'm going to bisect the rocm-6.0.x branch of AMD's LLVM fork to see if I can find where went wrong.

Also attached the fatbins compiled with LLVM16, in case anyone is running a Blender version that crashes with the same "memory access fault" message and wants to drop them into the 4.1/scripts/addons/cycles/lib/ directory and test. With these fatbins you should no longer get errors. They're compiled against gfx803;gfx900;gfx902;gfx904;gfx906;gfx908;gfx90a;gfx90c;gfx1010;gfx1011;gfx1012;gfx1013;gfx1030;gfx1031;gfx1032;gfx1033;gfx1034;gfx1035;gfx1036;gfx1100;gfx1101;gfx1102;gfx1103.

After placing the fatbin for gfx908 (4x Instinct MI100s) in the folder, I no longer get crashing for standard renders. The issue remains for shared memory being on though, which I assume is due to the XGMI bridge being too spooky.

I also notice that while before i would get insanely low render times when it would run, it wouldn't denoise completely. Now with the fatbin it denoises and looks the same as my NVidia card, but takes over twice as long. (6.5s -> 14s)

> I'm going to bisect the `rocm-6.0.x` branch of AMD's LLVM fork to see if I can find where went wrong. > > Also attached the fatbins compiled with LLVM16, in case anyone is running a Blender version that crashes with the same "memory access fault" message and wants to drop them into the `4.1/scripts/addons/cycles/lib/` directory and test. With these fatbins you should no longer get errors. They're compiled against `gfx803;gfx900;gfx902;gfx904;gfx906;gfx908;gfx90a;gfx90c;gfx1010;gfx1011;gfx1012;gfx1013;gfx1030;gfx1031;gfx1032;gfx1033;gfx1034;gfx1035;gfx1036;gfx1100;gfx1101;gfx1102;gfx1103`. After placing the fatbin for gfx908 (4x Instinct MI100s) in the folder, I no longer get crashing for standard renders. The issue remains for shared memory being on though, which I assume is due to the XGMI bridge being too spooky. I also notice that while before i would get insanely low render times when it would run, it wouldn't denoise completely. Now with the fatbin it denoises and looks the same as my NVidia card, but takes over twice as long. (6.5s -> 14s)

@TNT3530 Glad to hear that standard renders no longer crash! LLVM 16 is a relatively old release compared to ROCm 6.0's LLVM and I would assume many optimizations in AMD's fork are not upstreamed, so the performance degrade is sort of expected. I'm not sure about the shared memory though, not too familiar with this area.

@TNT3530 Glad to hear that standard renders no longer crash! LLVM 16 is a relatively old release compared to ROCm 6.0's LLVM and I would assume many optimizations in AMD's fork are not upstreamed, so the performance degrade is sort of expected. I'm not sure about the shared memory though, not too familiar with this area.

I bitsected to commit 30a3adf50e, which is weird because theoretically the order of optimization passes shouldn't affect the output.

I reverted this commit in the rocm-6.0.x branch and using that to compile HIP fatbins, Blender no longer crashes. I will file a bug in ROCm's LLVM repository.

Here's a quick way to reproduce, assuming you have ROCm development libraries installed already:

  1. Clone AMD's LLVM fork. To reduce clone time, I've pushed the rocm-6.0.x branch along with the revert commit to my local fork so you only need to clone with depth 2: git clone https://github.com/GZGavinZhao/llvm-project --depth 2 --branch blender-rocm-6.0-repro
  2. Enter llvm-project, run CMake setup and build:
cmake -G Ninja -B build \
     -DCMAKE_INSTALL_PREFIX=/usr \
     -S llvm \
	 -DCMAKE_BUILD_TYPE=Debug \
     -DCMAKE_SKIP_RPATH=ON \
     -DLLVM_INCLUDE_TESTS=OFF \
     -DLLVM_ENABLE_PROJECTS="clang;lld;llvm" \
     -DLLVM_ENABLE_RUNTIMES="compiler-rt" \
     -DLLVM_TARGETS_TO_BUILD="X86;AMDGPU" \
     -DLLVM_BUILD_DOCS=OFF

cmake --build build
  1. DESTDIR=$PWD/destdir ninja -C build install to create a directory with a layout that hipcc can recognize.
  2. Clone Blender.
  3. In the Blender directory, checkout the v4.1.0 tag. git switch --detach v4.1.0
  4. Run HIP_ROCCLR_PATH=<path-to-llvm-project>/destdir/usr HIP_CLANG_PATH=$HIP_ROCCLR_PATH/bin hipcc --offload-arch=<your-arch> --genco intern/cycles/kernel/device/hip/kernel.cpp -D CCL_NAMESPACE_BEGIN= -D CCL_NAMESPACE_END= -D HIPCC -I intern/cycles/kernel/.. -I intern/cycles/kernel/device/hip -ffast-math -o kernel_<your_arch>.fatbin -mcode-object-version=4 -Xclang -mcode-object-version=4 . Replace <your_arch> with the architecture of your GPU, e.g. gfx1030. If you want to run on multiple architectures, do step 6 and 7 for each architecture separately.
  5. Now, you should get the fatbin for your GPU kernel_<your_arch>.fatbin. Copy this file to the 4.1/scripts/addons/cycles/lib/ directory and test.
  6. You don't have to open Blender to test, running headless Blender is enough. After determining the desired device id using rocm-smi, you can run command HIP_VISIBLE_DEVICES=<device-id> blender -b <path-to-.blend-file> -f 0 -- --cycles-device HIP. If you want to test multiple device rendering though, I think you have to do it with the GUI.

If all of this doesn't crash, now revert the revert commit (i.e. go to the current state) by running git switch --detach HEAD~1, and start again starting from step 3. This time Blender should crash with the "memory access fault" message.

Please let me know if you can reproduce this or not.

I bitsected to commit https://github.com/ROCm/llvm-project/commit/30a3adf50e2d49dfc97c1b614d9b93638eba672d, which is weird because theoretically the order of optimization passes shouldn't affect the output. I reverted this commit in the `rocm-6.0.x` branch and using that to compile HIP fatbins, Blender no longer crashes. I will file a bug in ROCm's LLVM repository. Here's a quick way to reproduce, assuming you have ROCm development libraries installed already: 1. Clone AMD's LLVM fork. To reduce clone time, I've pushed the `rocm-6.0.x` branch along with the revert commit to my local fork so you only need to clone with depth 2: `git clone https://github.com/GZGavinZhao/llvm-project --depth 2 --branch blender-rocm-6.0-repro` 2. Enter `llvm-project`, run CMake setup and build: ```bash cmake -G Ninja -B build \ -DCMAKE_INSTALL_PREFIX=/usr \ -S llvm \ -DCMAKE_BUILD_TYPE=Debug \ -DCMAKE_SKIP_RPATH=ON \ -DLLVM_INCLUDE_TESTS=OFF \ -DLLVM_ENABLE_PROJECTS="clang;lld;llvm" \ -DLLVM_ENABLE_RUNTIMES="compiler-rt" \ -DLLVM_TARGETS_TO_BUILD="X86;AMDGPU" \ -DLLVM_BUILD_DOCS=OFF cmake --build build ``` 3. `DESTDIR=$PWD/destdir ninja -C build install` to create a directory with a layout that `hipcc` can recognize. 4. Clone Blender. 5. In the Blender directory, checkout the `v4.1.0` tag. `git switch --detach v4.1.0` 6. Run `HIP_ROCCLR_PATH=<path-to-llvm-project>/destdir/usr HIP_CLANG_PATH=$HIP_ROCCLR_PATH/bin hipcc --offload-arch=<your-arch> --genco intern/cycles/kernel/device/hip/kernel.cpp -D CCL_NAMESPACE_BEGIN= -D CCL_NAMESPACE_END= -D HIPCC -I intern/cycles/kernel/.. -I intern/cycles/kernel/device/hip -ffast-math -o kernel_<your_arch>.fatbin -mcode-object-version=4 -Xclang -mcode-object-version=4 `. Replace `<your_arch>` with the architecture of your GPU, e.g. `gfx1030`. If you want to run on multiple architectures, do step 6 and 7 for each architecture separately. 7. Now, you should get the fatbin for your GPU `kernel_<your_arch>.fatbin`. Copy this file to the `4.1/scripts/addons/cycles/lib/` directory and test. 8. You don't have to open Blender to test, running headless Blender is enough. After determining the desired device id using `rocm-smi`, you can run command `HIP_VISIBLE_DEVICES=<device-id> blender -b <path-to-.blend-file> -f 0 -- --cycles-device HIP`. If you want to test multiple device rendering though, I think you have to do it with the GUI. If all of this doesn't crash, now revert the revert commit (i.e. go to the current state) by running `git switch --detach HEAD~1`, and start again starting from step 3. This time Blender should crash with the "memory access fault" message. Please let me know if you can reproduce this or not.
Issue filed at ROCm: https://github.com/ROCm/llvm-project/issues/58
Author

I've tried replacing fatbins to the ones provided by @GZGavinZhao. Full Blender 3.2 splash screen scene now renders without issues. Thanks.

I've tried building LLVM and Blender using instructions by @GZGavinZhao, but due errors that I could not resolve, I didn't manage to build LLVM. So, I could not test this. But, since just replacing fatbins worked, I guess, I do not need to test this.

I've tried replacing fatbins to the ones provided by @GZGavinZhao. Full Blender 3.2 splash screen scene now renders without issues. Thanks. I've tried building LLVM and Blender using instructions by @GZGavinZhao, but due errors that I could not resolve, I didn't manage to build LLVM. So, I could not test this. But, since just replacing fatbins worked, I guess, I do not need to test this.

If you're having trouble with building LLVM due to compiler errors or insufficient RAM size, try using the Clang compiler instead of GCC. To do this, run this for the CMake setup instead:

cmake -G Ninja -B build \
     -DCMAKE_C_COMPILER=clang \
     -DCMAKE_CXX_COMPILER=clang++ \
     -DCMAKE_INSTALL_PREFIX=/usr \
     -S llvm \
	 -DCMAKE_BUILD_TYPE=Debug \
     -DCMAKE_SKIP_RPATH=ON \
     -DLLVM_INCLUDE_TESTS=OFF \
     -DLLVM_ENABLE_PROJECTS="clang;lld;llvm" \
     -DLLVM_ENABLE_RUNTIMES="compiler-rt" \
     -DLLVM_TARGETS_TO_BUILD="X86;AMDGPU" \
     -DLLVM_BUILD_DOCS=OFF

You can adjust clang/clang++ to the path of your desired clang compiler, e.g. clang-16

If you're having trouble with building LLVM due to compiler errors or insufficient RAM size, try using the Clang compiler instead of GCC. To do this, run this for the CMake setup instead: ``` cmake -G Ninja -B build \ -DCMAKE_C_COMPILER=clang \ -DCMAKE_CXX_COMPILER=clang++ \ -DCMAKE_INSTALL_PREFIX=/usr \ -S llvm \ -DCMAKE_BUILD_TYPE=Debug \ -DCMAKE_SKIP_RPATH=ON \ -DLLVM_INCLUDE_TESTS=OFF \ -DLLVM_ENABLE_PROJECTS="clang;lld;llvm" \ -DLLVM_ENABLE_RUNTIMES="compiler-rt" \ -DLLVM_TARGETS_TO_BUILD="X86;AMDGPU" \ -DLLVM_BUILD_DOCS=OFF ``` You can adjust `clang`/`clang++` to the path of your desired clang compiler, e.g. `clang-16`

Just popped by to add my two cents as I just started having this issue today. Before today I couldn't even use HIP as it didn't recognize my GPU as compatible. I powered on today to find I had many new updates to install, mostly related to AMD and HIP packages. After installing them I was able to enable HIP. It however, randomly crashes when I enable cycles rendered viewport. It renders a few passes and then crashes with the "Memory access fault by GPU node..." error. I was using the blender ocean test file if that matters. Also, I set experimental features to off under the render settings and it seems to have stopped the crashing.

I have since tried to recreate the issue in a blend file I have created that doesnt use any "experimental" features (tbh, adaptive subdivision modifier is the only one I know of), and I have not had any crashes with experimental enabled or not.

Kernel: 6.5.0-35-generic x86_64 bits: 64 compiler: N/A
Desktop: Cinnamon 6.0.4 tk: GTK 3.24.33 wm: muffin vt: 7 dm: LightDM 1.30.0
Distro: Linux Mint 21.3 Virginia base: Ubuntu 22.04 jammy
Graphics:
Device-1: AMD Navi 21 [Radeon RX 6900 XT] vendor: XFX driver: amdgpu
v: 6.7.0 pcie: speed: 16 GT/s lanes: 16 ports: active: DP-2,HDMI-A-1
empty: DP-1,DP-3,Writeback-1 bus-ID: 0c:00.0 chip-ID: 1002:73af
class-ID: 0300

blender-4.1.1-linux-x64

Just popped by to add my two cents as I just started having this issue today. Before today I couldn't even use HIP as it didn't recognize my GPU as compatible. I powered on today to find I had many new updates to install, mostly related to AMD and HIP packages. After installing them I was able to enable HIP. It however, randomly crashes when I enable cycles rendered viewport. It renders a few passes and then crashes with the "Memory access fault by GPU node..." error. I was using the blender ocean test file if that matters. Also, I set experimental features to off under the render settings and it seems to have stopped the crashing. I have since tried to recreate the issue in a blend file I have created that doesnt use any "experimental" features (tbh, adaptive subdivision modifier is the only one I know of), and I have not had any crashes with experimental enabled or not. Kernel: 6.5.0-35-generic x86_64 bits: 64 compiler: N/A Desktop: Cinnamon 6.0.4 tk: GTK 3.24.33 wm: muffin vt: 7 dm: LightDM 1.30.0 Distro: Linux Mint 21.3 Virginia base: Ubuntu 22.04 jammy Graphics: Device-1: AMD Navi 21 [Radeon RX 6900 XT] vendor: XFX driver: amdgpu v: 6.7.0 pcie: speed: 16 GT/s lanes: 16 ports: active: DP-2,HDMI-A-1 empty: DP-1,DP-3,Writeback-1 bus-ID: 0c:00.0 chip-ID: 1002:73af class-ID: 0300 blender-4.1.1-linux-x64
Sign in to join this conversation.
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset System
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Asset Browser Project
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
7 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#112084
No description provided.