Segfault after F12 render with multi-GPU Intel ARC A770 and experimental oneAPI packages on Debian #110504

Open
opened 2023-07-26 18:52:59 +02:00 by Jakub Jaszewski · 40 comments

System Information
Operating system: Linux-6.4.0-1-amd64-x86_64-with-glibc2.37 (Debian sid)
Graphics card: two identical Intel ARC A770 Limited Edition 16GB

Blender Version
Broken: 3.6.1, 4.0.0 Alpha, branch: main, commit date: 2023-07-25 19:23, hash: aebc743bf1
Worked: 3.4, 3.4.1, 3.5.1

Short description of error
Segfault when rendering Victor benchmark scene with F12 with second Intel ARC A770. This happens only when the second GPU is chosen as rendering device in Cycles. If both GPUs are selected first one seems to work correctly, but the second one causes a crash/error.
Embree on GPU does not seems to matter.

Both F12 and viewport renders with the second GPU give:
oneAPI kernel "shader_eval_curve_shadow_transparency" execution error: got runtime exception "Native API failed. Native API returns: -50 (PI_ERROR_INVALID_ARG_VALUE) -50 (PI_ERROR_INVALID_ARG_VALUE)"

Relevant info:
I'm using following packages from experimental sources (I'm impatient I know :) ), so this crash might be very well caused by their incompatibility with Debian sid:
libigc1 - 1.0.13822.1-1
libigdfcl1 - 1.0.13822.1-1
intel-opencl-icd - 23.13.26032.7-1
Rest of the packages, including all dependencies are from Debian sid.

Full backtrace:

(gdb) backtrace full
#0  0x0000000001cfee77 in ?? ()
No symbol table info available.
#1  0x0000000001cffab9 in ?? ()
No symbol table info available.
#2  0x0000000001b0fa9c in ?? ()
No symbol table info available.
#3  0x0000000000f376a3 in ?? ()
No symbol table info available.
#4  0x00007fffebaa63ec in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
        ret = <optimized out>
        pd = <optimized out>
        out = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737147199792, -8317621660586406251, -286296, 11, 140737488344416, 140734520795136, 8317937541230038677, 8317594907036290709}, mask_was_saved = 0}}, priv = {
            pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#5  0x00007fffebb26a1c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
No locals.
(gdb)

Exact steps for others to reproduce the error

  1. On Debian sid install experimental packages mentioned above.
  2. Set second ARC GPU as a rendering device in Blender.
  3. Open Victor benchmark scene.
  4. Hit F12 and wait for the crash.
**System Information** Operating system: Linux-6.4.0-1-amd64-x86_64-with-glibc2.37 (Debian sid) Graphics card: two identical Intel ARC A770 Limited Edition 16GB **Blender Version** Broken: 3.6.1, 4.0.0 Alpha, branch: main, commit date: 2023-07-25 19:23, hash: aebc743bf1c7 Worked: 3.4, 3.4.1, 3.5.1 **Short description of error** Segfault when rendering Victor benchmark scene with F12 with second Intel ARC A770. This happens only when the second GPU is chosen as rendering device in Cycles. If both GPUs are selected first one seems to work correctly, but the second one causes a crash/error. Embree on GPU does not seems to matter. Both F12 and viewport renders with the second GPU give: `oneAPI kernel "shader_eval_curve_shadow_transparency" execution error: got runtime exception "Native API failed. Native API returns: -50 (PI_ERROR_INVALID_ARG_VALUE) -50 (PI_ERROR_INVALID_ARG_VALUE)"` Relevant info: I'm using following packages from experimental sources (I'm impatient I know :) ), so this crash might be very well caused by their incompatibility with Debian sid: `libigc1` - 1.0.13822.1-1 `libigdfcl1` - 1.0.13822.1-1 `intel-opencl-icd` - 23.13.26032.7-1 Rest of the packages, including all dependencies are from Debian sid. Full backtrace: ``` (gdb) backtrace full #0 0x0000000001cfee77 in ?? () No symbol table info available. #1 0x0000000001cffab9 in ?? () No symbol table info available. #2 0x0000000001b0fa9c in ?? () No symbol table info available. #3 0x0000000000f376a3 in ?? () No symbol table info available. #4 0x00007fffebaa63ec in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444 ret = <optimized out> pd = <optimized out> out = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737147199792, -8317621660586406251, -286296, 11, 140737488344416, 140734520795136, 8317937541230038677, 8317594907036290709}, mask_was_saved = 0}}, priv = { pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> #5 0x00007fffebb26a1c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 No locals. (gdb) ``` **Exact steps for others to reproduce the error** 1. On Debian sid install experimental packages mentioned above. 2. Set second ARC GPU as a rendering device in Blender. 3. Open Victor benchmark scene. 4. Hit F12 and wait for the crash.
Jakub Jaszewski added the
Status
Needs Triage
Type
Report
Severity
Normal
labels 2023-07-26 18:52:59 +02:00
YimingWu added the
Platform
Linux
Interest
Render & Cycles
labels 2023-07-27 03:37:06 +02:00
Member

Not quite similar as 109195, but involving both Arc A770 GPUs.

@xavierh Could you take a look?

Not quite similar as 109195, but involving both Arc A770 GPUs. @xavierh Could you take a look?
YimingWu added
Status
Needs Info from Developers
and removed
Status
Needs Triage
labels 2023-07-27 03:38:50 +02:00
Member

Thanks for the report. It's not like anything I've seen before so I'll need a bit more debug info from your end.

Your system-info seems to list your GPU twice, is it the case also in the preferences pane ? What does sycl-ls from https://github.com/intel/llvm/releases/download/2022-12/dpcpp-compiler.tar.gz shows ? That's just a guess, but if Blender tries to use the same GPU twice, you may be running out of memory on it with Victor scene.

If that's not what's happening:

  1. is the issue happening only with Victor ? is it happening with and without embree enabled in the system preferences ? with 1 or 2 GPUs selected ?
  2. Your GPU has 8GB or 16GB?
  3. Can you run and share logs with ZE_ENABLE_VALIDATION_LAYER=1 ZE_ENABLE_PARAMETER_VALIDATION=1 SYCL_PI_TRACE=-1 env vars as well as --debug-cycles --verbose 4 options ?
  4. can you share dmesg output after the crash ?
Thanks for the report. It's not like anything I've seen before so I'll need a bit more debug info from your end. Your system-info seems to list your GPU twice, is it the case also in the preferences pane ? What does `sycl-ls` from https://github.com/intel/llvm/releases/download/2022-12/dpcpp-compiler.tar.gz shows ? That's just a guess, but if Blender tries to use the same GPU twice, you may be running out of memory on it with Victor scene. If that's not what's happening: 1. is the issue happening only with Victor ? is it happening with and without embree enabled in the system preferences ? with 1 or 2 GPUs selected ? 2. Your GPU has 8GB or 16GB? 3. Can you run and share logs with `ZE_ENABLE_VALIDATION_LAYER=1 ZE_ENABLE_PARAMETER_VALIDATION=1 SYCL_PI_TRACE=-1` env vars as well as `--debug-cycles --verbose 4` options ? 4. can you share dmesg output after the crash ?
Xavier Hallade added
Status
Needs Information from User
and removed
Status
Needs Info from Developers
labels 2023-07-27 10:33:42 +02:00

Thanks for the reply.

I forgot to mention in the ticket - I'm running two A770. And they are present in the Blender preferences panel. I can render with both of them with Blender 3.4 just fine (with non-experimental oneAPI packages that is).

  1. No. It is happening with all scenes. Embree does not seems to make a difference I tried on/off. I noticed that with smaller scenes (like default cube there is no hard crash and I can switch back to Workbench render and continue work in Blender. With more complicated scenes Blender can crash either with viewport or F12 render.

  2. 16GB (both).

3 & 4 - no problem. Both logs are from Victor scene, Embree on GPU if off. Tried to render on both GPUs.

Regarding sycl-ls unfortunately I'm gonna need some handholding. After unpacking and running it straight with ./sycl-ls I get some missing libraries: ./sycl-ls: error while loading shared libraries: libsycl.so.6: cannot open shared object file: No such file or directory.
maybe this is some missing -dev package that is available in Debian repos? I generally don't install those unless necessary. I'm not gonna lie - reading github DPC++ compiler install/setup instructions feel a lil daunting :)

=============================================================================================

ED: This seems like a multi-GPU issue. I just checked and I can render Victor with one GPU with and without Embree.

ED2: The crash happens when second GPU device is chosen in the Cycles preferences. Does not matter if the first one is checked or not. Rendering default cube with both GPUs leaves half rendered image (attached screenshot) which makes sense if the first one is OK and second is crashing. I will try to swap GPU on motherboard to check if this changes anything. In case this is not in the logs - both GPUs are connected to PCIe x16 slots.

Thanks for the reply. I forgot to mention in the ticket - I'm running two A770. And they are present in the Blender preferences panel. I can render with both of them with Blender 3.4 just fine (with non-experimental oneAPI packages that is). 1. No. It is happening with all scenes. Embree does not seems to make a difference I tried on/off. I noticed that with smaller scenes (like default cube there is no hard crash and I can switch back to Workbench render and continue work in Blender. With more complicated scenes Blender can crash either with viewport or F12 render. 2. 16GB (both). 3 & 4 - no problem. Both logs are from Victor scene, Embree on GPU if off. Tried to render on both GPUs. Regarding `sycl-ls` unfortunately I'm gonna need some handholding. After unpacking and running it straight with `./sycl-ls` I get some missing libraries: `./sycl-ls: error while loading shared libraries: libsycl.so.6: cannot open shared object file: No such file or directory`. maybe this is some missing `-dev` package that is available in Debian repos? I generally don't install those unless necessary. I'm not gonna lie - reading github DPC++ compiler install/setup instructions feel a lil daunting :) ============================================================================================= ED: This seems like a multi-GPU issue. I just checked and I can render Victor with one GPU with and without Embree. ED2: The crash happens when second GPU device is chosen in the Cycles preferences. Does not matter if the first one is checked or not. Rendering default cube with both GPUs leaves half rendered image (attached screenshot) which makes sense if the first one is OK and second is crashing. I will try to swap GPU on motherboard to check if this changes anything. In case this is not in the logs - both GPUs are connected to PCIe x16 slots.
Jakub Jaszewski changed title from Segfault after F12 render with Intel ARC A770 and experimental oneAPI packages on Debian to Segfault after F12 render with 2x Intel ARC A770 and experimental oneAPI packages on Debian 2023-07-27 12:21:58 +02:00
Member

Thanks for the additional testing, we're getting closer to the potential root causes.

I can render with both of them with Blender 3.4 just fine (with non-experimental oneAPI packages that is).

can you still render with both of them with Blender 3.4 with the experimental oneAPI packages ?

To make sycl-ls work, add the lib path from the same package: LD_LIBRARY_PATH=../lib/ ./sycl-ls

Thanks for the additional testing, we're getting closer to the potential root causes. > I can render with both of them with Blender 3.4 just fine (with non-experimental oneAPI packages that is). can you still render with both of them with Blender 3.4 with the experimental oneAPI packages ? To make sycl-ls work, add the lib path from the same package: `LD_LIBRARY_PATH=../lib/ ./sycl-ls`

can you still render with both of them with Blender 3.4 with the experimental oneAPI packages ?

Yes. I can render with second GPU (and multi-GPU) in Blender versions 3.4, 3.4.1 and in 3.5.1 too.
The crash/error only happens in versions 3.6.1 and newer.

Adding lib path helped with libsycl but now the executable complains about lack of libtinfo.so.5.
Searching in my system there is only libtinfo.so.6.

> can you still render with both of them with Blender 3.4 with the experimental oneAPI packages ? Yes. I can render with second GPU (and multi-GPU) in Blender versions `3.4`, `3.4.1` and in `3.5.1` too. The crash/error only happens in versions `3.6.1` and newer. Adding lib path helped with `libsycl` but now the executable complains about lack of `libtinfo.so.5`. Searching in my system there is only `libtinfo.so.6`.
Member

you can symlink libtinfo.so.6 to libtinfo.so.5 for the sake of this test, that should work, but no need to bother with this, you actually have two GPUs and both can be used by Blender 3.5 so there should be nothing else to learn from sycl-ls.

The second GPU works in Blender 3.5.1 with the exact same stack you've shared (libigc1 - 1.0.13822.1-1 / libigdfcl1 - 1.0.13822.1-1 /
intel-opencl-icd - 23.13.26032.7-1) ?

That sounds like a SYCL/L0 runtime issue in that case. We did update it from 20221019 to 2022-12 for 3.6. There has been lots of other changes but I don't see them as likely to cause such kind of issue. Can you give a try to these two custom builds?

you can symlink libtinfo.so.6 to libtinfo.so.5 for the sake of this test, that should work, but no need to bother with this, you actually have two GPUs and both can be used by Blender 3.5 so there should be nothing else to learn from `sycl-ls`. The second GPU works in Blender 3.5.1 with the exact same stack you've shared (libigc1 - 1.0.13822.1-1 / libigdfcl1 - 1.0.13822.1-1 / intel-opencl-icd - 23.13.26032.7-1) ? That sounds like a SYCL/L0 runtime issue in that case. We did update it from 20221019 to 2022-12 for 3.6. There has been lots of other changes but I don't see them as likely to cause such kind of issue. Can you give a try to these two custom builds? - Blender 3.6.1 with embree3: https://ph0b.com/uploads/blender-3.6.1-git20230727.embree3dpcpp202212-x86_64.tar.gz - Blender 3.6.1 with embree3 and sycl 20221019: https://ph0b.com/uploads/blender-3.6.1-git20230727.embree3dpcpp20221019-x86_64.tar.gz

The second GPU works in Blender 3.5.1 with the exact same stack you've shared (libigc1 - 1.0.13822.1-1 / libigdfcl1 - 1.0.13822.1-1 / intel-opencl-icd - 23.13.26032.7-1) ?

Correct.

Blender 3.6.1 with embree3 & SYCL 202212:

  • GPU1 - renders OK (viewport)
  • GPU2 - crash/error (viewport)
  • both - crash/error (viewport)

Blender 3.6.1 with embree3 & SYCL 20221019 (loading render kernels takes really long time here):

  • GPU1 - renders OK (viewport)
  • GPU2 - renders OK (viewport)
  • both - renders OK (viewport & F12)
    With this version I found some artifacts on Victor's trousers so this might be worth looking into later if it's not fixed already.
> The second GPU works in Blender 3.5.1 with the exact same stack you've shared (libigc1 - 1.0.13822.1-1 / libigdfcl1 - 1.0.13822.1-1 / intel-opencl-icd - 23.13.26032.7-1) ? Correct. Blender 3.6.1 with embree3 & SYCL 202212: - GPU1 - renders OK (viewport) - GPU2 - crash/error (viewport) - both - crash/error (viewport) Blender 3.6.1 with embree3 & SYCL 20221019 (loading render kernels takes really long time here): - GPU1 - renders OK (viewport) - GPU2 - renders OK (viewport) - both - renders OK (viewport & F12) With this version I found some artifacts on Victor's trousers so this might be worth looking into later if it's not fixed already.
Member

Good, so that's really the sycl compiler/runtime version that makes it break or not. I'll make some new builds with more recent versions to try, tomorrow or Monday.

Good, so that's really the sycl compiler/runtime version that makes it break or not. I'll make some new builds with more recent versions to try, tomorrow or Monday.
Jakub Jaszewski changed title from Segfault after F12 render with 2x Intel ARC A770 and experimental oneAPI packages on Debian to Segfault after F12 render with multi-GPU Intel ARC A770 and experimental oneAPI packages on Debian 2023-07-28 11:58:29 +02:00
Member

Here is a build with a newer version of SYCL (close to 2023-WW27): https://ph0b.com/uploads/blender-3.6.1-git20230728.embree3dpcpp20230707-x86_64.tar.gz can you let me know how it goes?

Here is a build with a newer version of SYCL (close to 2023-WW27): https://ph0b.com/uploads/blender-3.6.1-git20230728.embree3dpcpp20230707-x86_64.tar.gz can you let me know how it goes?

This build apparently require liibtinfo.so.5 and it throws same complaint as sycl-ls did.
I tried making a symlink in lib directory that points to /lib/x86_64-linux-gnu/libtinfo.so.6 but that didn't change anything, and I'm still getting the same error while loading shared libraries ....
I also tried symlinks to different libtinfo versions from Steam and Pytorch that I have on my system, but the result is the same.

This build apparently require `liibtinfo.so.5` and it throws same complaint as `sycl-ls` did. I tried making a symlink in `lib` directory that points to `/lib/x86_64-linux-gnu/libtinfo.so.6` but that didn't change anything, and I'm still getting the same `error while loading shared libraries ...`. I also tried symlinks to different `libtinfo` versions from Steam and Pytorch that I have on my system, but the result is the same.
Member

Weird, when I experienced this on rockylinux, ln -s /usr/lib64/libtinfo.so.6 /usr/lib64/libtinfo.so.5 worked.
Anyway, I've checked a bit further: https://packages.debian.org/sid/libtinfo5 exists, it'll be cleaner to just install it.

Weird, when I experienced this on rockylinux, `ln -s /usr/lib64/libtinfo.so.6 /usr/lib64/libtinfo.so.5` worked. Anyway, I've checked a bit further: https://packages.debian.org/sid/libtinfo5 exists, it'll be cleaner to just install it.

Why didn't I think of that in the first place...
Anyway, after installing that lib and running renders I got similar but different error with second GPU:

oneAPI test kernel execution: got a runtime exception "Native API failed. Native API returns: -30 (PI_ERROR_INVALID_VALUE) -30 (PI_ERROR_INVALID_VALUE)"

-30 instead -50 and lack of ARG.

Why didn't I think of that in the first place... Anyway, after installing that lib and running renders I got similar but different error with second GPU: `oneAPI test kernel execution: got a runtime exception "Native API failed. Native API returns: -30 (PI_ERROR_INVALID_VALUE) -30 (PI_ERROR_INVALID_VALUE)"` `-30` instead `-50` and lack of `ARG`.
Member

Now it's failing even with the simple test kernel, that will make debugging much easier.. if it's the same bug :).
I need a bit more debug info to progress.

Please set these environment variables to get useful logging information:

export PrintDebugMessages=1
export NEOReadDebugKeys=1
export ZE_DEBUG=-1
export SYCL_PI_TRACE=-1 

Then try the vector-add program from the attached archive for the 3 DPCPP versions, this way so it directly targets your second GPU:
ONEAPI_DEVICE_SELECTOR=level_zero:1 LD_LIBRARY_PATH=./ ./vector-add

If these runs fail, with the logs, I should have enough to report the bug internally, if they all run fine, still with same environment variables, I'll need the output logs for the two failing custom blender builds, ie. the one with DPCPP 2022-12, the one with 20230707.

Thanks ahead!

Now it's failing even with the simple test kernel, that will make debugging much easier.. if it's the same bug :). I need a bit more debug info to progress. Please set these environment variables to get useful logging information: ``` export PrintDebugMessages=1 export NEOReadDebugKeys=1 export ZE_DEBUG=-1 export SYCL_PI_TRACE=-1 ``` Then try the vector-add program from the attached archive for the 3 DPCPP versions, this way so it directly targets your second GPU: `ONEAPI_DEVICE_SELECTOR=level_zero:1 LD_LIBRARY_PATH=./ ./vector-add` If these runs fail, with the logs, I should have enough to report the bug internally, if they all run fine, still with same environment variables, I'll need the output logs for the two failing custom blender builds, ie. the one with DPCPP 2022-12, the one with 20230707. Thanks ahead!

I've run the test and it seems that 20230707 works correctly.

In the meantime I've done couple more tests with 20230707 build you've linked before. In short depending on the order in which the devices are turned on/off I've got different results. I'll post them in a moment.

20221019:

An exception is caught while adding two vectors. terminate called after throwing an instance of 'sycl::_V1::runtime_error' what(): No device of requested type available. -1 (PI_ERROR_DEVICE_NOT_FOUND) Aborted

2022-12:

Running on device: Intel(R) Arc(TM) A770 Graphics Vector size: 10000 [0]: 0 + 0 = 0 [1]: 1 + 1 = 2 [2]: 2 + 2 = 4 ... [9999]: 9999 + 9999 = 19998 Vector add successfully completed on device.

20230707:

Running on device: Intel(R) Arc(TM) A770 Graphics Vector size: 10000 [0]: 0 + 0 = 0 [1]: 1 + 1 = 2 [2]: 2 + 2 = 4 ... [9999]: 9999 + 9999 = 19998 Vector add successfully completed on device.

I've run the test and it seems that 20230707 works correctly. In the meantime I've done couple more tests with 20230707 build you've linked before. In short depending on the order in which the devices are turned on/off I've got different results. I'll post them in a moment. 20221019: `An exception is caught while adding two vectors. terminate called after throwing an instance of 'sycl::_V1::runtime_error' what(): No device of requested type available. -1 (PI_ERROR_DEVICE_NOT_FOUND) Aborted` 2022-12: `Running on device: Intel(R) Arc(TM) A770 Graphics Vector size: 10000 [0]: 0 + 0 = 0 [1]: 1 + 1 = 2 [2]: 2 + 2 = 4 ... [9999]: 9999 + 9999 = 19998 Vector add successfully completed on device.` 20230707: `Running on device: Intel(R) Arc(TM) A770 Graphics Vector size: 10000 [0]: 0 + 0 = 0 [1]: 1 + 1 = 2 [2]: 2 + 2 = 4 ... [9999]: 9999 + 9999 = 19998 Vector add successfully completed on device.`

In 20230707 build I found that depending on the order in which A770 devices are turned on/off in Preferences -> System -> Cycles Devices I can trigger different results:

first render with both GPUs = oneAPI test kernel execution: got runtime exception ""
second render with GPU1 = oneAPI test kernel execution: got a runtime exception "Native API failed. Native API returns: -30 (PI_ERROR_INVALID_VALUE) -30 (PI_ERROR_INVALID_VALUE)"

first render with both GPUs = oneAPI test kernel execution: got runtime exception ""
second render with GPU2 = successful render
third render with GPU1 = oneAPI test kernel execution: got a runtime exception "Native API failed. Native API returns: -30 (PI_ERROR_INVALID_VALUE) -30 (PI_ERROR_INVALID_VALUE)"

first render with GPU1 = successful render
second render with GPU2 = oneAPI test kernel execution: got a runtime exception "Native API failed. Native API returns: -30 (PI_ERROR_INVALID_VALUE) -30 (PI_ERROR_INVALID_VALUE)"

first render with GPU2 = successful render
second render with GPU1 = instant segfault

first render with GPU1 = successful render
second render with both GPUs = oneAPI test kernel execution: got a runtime exception "Native API failed. Native API returns: -30 (PI_ERROR_INVALID_VALUE) -30 (PI_ERROR_INVALID_VALUE)"

first render with GPU2 = successful render
second render with both GPUs = instant segfault

With the -30 exceptions Blender is still responsive and I can switch to different device or viewport shading, etc.

In 20230707 build I found that depending on the order in which A770 devices are turned on/off in Preferences -> System -> Cycles Devices I can trigger different results: first render with both GPUs = `oneAPI test kernel execution: got runtime exception ""` second render with GPU1 = `oneAPI test kernel execution: got a runtime exception "Native API failed. Native API returns: -30 (PI_ERROR_INVALID_VALUE) -30 (PI_ERROR_INVALID_VALUE)"` first render with both GPUs = `oneAPI test kernel execution: got runtime exception ""` second render with GPU2 = successful render third render with GPU1 = `oneAPI test kernel execution: got a runtime exception "Native API failed. Native API returns: -30 (PI_ERROR_INVALID_VALUE) -30 (PI_ERROR_INVALID_VALUE)"` first render with GPU1 = successful render second render with GPU2 = `oneAPI test kernel execution: got a runtime exception "Native API failed. Native API returns: -30 (PI_ERROR_INVALID_VALUE) -30 (PI_ERROR_INVALID_VALUE)"` first render with GPU2 = successful render second render with GPU1 = instant segfault first render with GPU1 = successful render second render with both GPUs = `oneAPI test kernel execution: got a runtime exception "Native API failed. Native API returns: -30 (PI_ERROR_INVALID_VALUE) -30 (PI_ERROR_INVALID_VALUE)"` first render with GPU2 = successful render second render with both GPUs = instant segfault With the `-30` exceptions Blender is still responsive and I can switch to different device or viewport shading, etc.
Member

that's.. interesting, so with 20230707, you can render on each gpu once, but not on both concurrently nor switching for a next render.
can you get the backtrace for the segfault below ?

first render with GPU2 = successful render
second render with GPU1 = instant segfault

the logs for the 2022-12 build can still be helpful as well.

that's.. interesting, so with 20230707, you can render on each gpu once, but not on both concurrently nor switching for a next render. can you get the backtrace for the segfault below ? > first render with GPU2 = successful render > second render with GPU1 = instant segfault the logs for the 2022-12 build can still be helpful as well.

Sorry for the delay. I took weekend off.

2022-12 doesn't segfault and leaves nothing in dmesg. All I could get was blender debug log. It's from rendering on both GPUs. I checked and running only on GPU2 leaves identical debug output. I cannot get this build to crash, so no backtrace from gdb.

20230707:

you can render on each gpu once, but not on both concurrently nor switching for a next render.

Looks like it.
This case is really strange, I checked it again, and in this combination GPU2 works every time you set it no matter what devices were chosen before:

first render with both GPUs = oneAPI test kernel execution: got runtime exception ""
second render with GPU2 = successful render
third render with GPU1 = ...-30 ...
fourth render with GPU2 = successful render

can you get the backtrace for the segfault below ?

No problem. It looks that that crash happens in libzero loader. Just for the record - I have version 1.12.0-1 installed. As far as I can see this Debian package doesn't have newer version.
There is nothing in dmesg output after this crash too - double checked it after power cycling the machine.

Hope this helps.

Sorry for the delay. I took weekend off. 2022-12 doesn't segfault and leaves nothing in `dmesg`. All I could get was blender debug log. It's from rendering on both GPUs. I checked and running only on GPU2 leaves identical debug output. I cannot get this build to crash, so no backtrace from gdb. 20230707: > you can render on each gpu once, but not on both concurrently nor switching for a next render. Looks like it. This case is really strange, I checked it again, and in this combination GPU2 works every time you set it no matter what devices were chosen before: first render with both GPUs = oneAPI test kernel execution: got runtime exception "" second render with GPU2 = successful render third render with GPU1 = ...-30 ... fourth render with GPU2 = successful render > can you get the backtrace for the segfault below ? No problem. It looks that that crash happens in libzero loader. Just for the record - I have version 1.12.0-1 installed. As far as I can see this Debian package doesn't have newer version. There is nothing in `dmesg` output after this crash too - double checked it after power cycling the machine. Hope this helps.
Member

The behavior with 20230707 is maybe not that much related to the issue with 2022-12. the crash is in level zero runtime, not loader, and according to your backtrace, happens during the first memory transfers. I have some idea what's going on with this one (and maybe it's already fixed in a newer version) but really need more logging from the 2022-12 build.
I understand it doesn't crash but the logs you've provided are without the environment variables:

export PrintDebugMessages=1
export NEOReadDebugKeys=1
export ZE_DEBUG=-1
export SYCL_PI_TRACE=-1 

can you share logs with these ? I'm specifically expecting logging of all lower level calls to sycl/level-zero, like so:

---> piDeviceGetInfo(
: 000001BDA97B85A0
: 4162
: 8
: 000001BDA9768AE0

) ---> pi_result : PI_SUCCESS
[out] ** : 000001BDA9768AE0[ 0000000000000000 ... ]

that will help understanding what's happening right before the issue.

The behavior with 20230707 is maybe not that much related to the issue with 2022-12. the crash is in level zero runtime, not loader, and according to your backtrace, happens during the first memory transfers. I have some idea what's going on with this one (and maybe it's already fixed in a newer version) but really need more logging from the 2022-12 build. I understand it doesn't crash but the logs you've provided are without the environment variables: ``` export PrintDebugMessages=1 export NEOReadDebugKeys=1 export ZE_DEBUG=-1 export SYCL_PI_TRACE=-1 ``` can you share logs with these ? I'm specifically expecting logging of all lower level calls to sycl/level-zero, like so: > ---> piDeviceGetInfo( <unknown> : 000001BDA97B85A0 <unknown> : 4162 <unknown> : 8 <unknown> : 000001BDA9768AE0 <nullptr> ) ---> pi_result : PI_SUCCESS [out]<unknown> ** : 000001BDA9768AE0[ 0000000000000000 ... ] that will help understanding what's happening right before the issue.

can you share logs with these ?

No problem. I was thinking that I enabled those env vars before and I don't need to set them again. My bad.

> can you share logs with these ? No problem. I was thinking that I enabled those env vars before and I don't need to set them again. My bad.
Member

Thanks, it's getting us closer to what went wrong. test kernel ran fine with similar command but next kernel failed

---> piextKernelSetArgPointer(
: 0x7f2e08082280
: 1
: 8
: 0x7f2d9b13fb38
PI ---> piKernelSetArg(Kernel, ArgIndex, ArgSize, ArgValue)
ioctl(PRIME_FD_TO_HANDLE) failed with -1. errno=95(Operation not supported)
ZE ---> zeKernelSetArgumentValue(pi_cast<ze_kernel_handle_t>(Kernel->ZeKernel), pi_cast<uint32_t>(ArgIndex), pi_cast<size_t>(ArgSize), pi_cast<const void *>(ArgValue))
Error (ZE_RESULT_ERROR_INVALID_ARGUMENT) in zeKernelSetArgumentValue
) ---> pi_result : -50
[out]void * : 0x7f2d9b13fb38

debug string comes from 02aa4b6acc/shared/source/os_interface/linux/drm_memory_manager.cpp (L807)
or
02aa4b6acc/shared/source/os_interface/linux/drm_memory_manager.cpp (L957)

going to consult with compute-runtime devs

Thanks, it's getting us closer to what went wrong. test kernel ran fine with similar command but next kernel failed > ---> piextKernelSetArgPointer( <unknown> : 0x7f2e08082280 <unknown> : 1 <unknown> : 8 <unknown> : 0x7f2d9b13fb38 PI ---> piKernelSetArg(Kernel, ArgIndex, ArgSize, ArgValue) **ioctl(PRIME_FD_TO_HANDLE) failed with -1. errno=95(Operation not supported)** ZE ---> zeKernelSetArgumentValue(pi_cast<ze_kernel_handle_t>(Kernel->ZeKernel), pi_cast<uint32_t>(ArgIndex), pi_cast<size_t>(ArgSize), pi_cast<const void *>(ArgValue)) Error (ZE_RESULT_ERROR_INVALID_ARGUMENT) in zeKernelSetArgumentValue ) ---> pi_result : -50 [out]void * : 0x7f2d9b13fb38 debug string comes from https://github.com/intel/compute-runtime/blob/02aa4b6accbef98c4291a0175fe61bf6bbbcb60a/shared/source/os_interface/linux/drm_memory_manager.cpp#L807 or https://github.com/intel/compute-runtime/blob/02aa4b6accbef98c4291a0175fe61bf6bbbcb60a/shared/source/os_interface/linux/drm_memory_manager.cpp#L957 going to consult with compute-runtime devs
Nikita Sirgienko was assigned by Xavier Hallade 2023-08-02 16:54:14 +02:00
Xavier Hallade self-assigned this 2023-08-02 16:54:14 +02:00
Xavier Hallade added
Status
Confirmed
and removed
Status
Needs Information from User
labels 2023-08-02 16:54:31 +02:00
Jesse Yurkovich added
Module
Render & Cycles
and removed
Interest
Render & Cycles
labels 2023-08-03 23:08:44 +02:00
Member

If you have a graphical environment, do you know if it's taking memory on GPU 2 specifically ?
Can you monitor the graphics memory usage while trying to render on both with the 2022-12 sycl build ?
xpu-smi from https://github.com/intel/xpumanager can output it every seconds from both devices:
xpu-smi dump -d 0,1 -m 1,5,18,

If you have a graphical environment, do you know if it's taking memory on GPU 2 specifically ? Can you monitor the graphics memory usage while trying to render on both with the 2022-12 sycl build ? xpu-smi from https://github.com/intel/xpumanager can output it every seconds from both devices: `xpu-smi dump -d 0,1 -m 1,5,18,`

XPU-SMI installation seems to be quite complicated on the Debian system, and actually there are more easy way to get information about used GPU memory - via DRM registry-files in /sys/class/drm/cardX. For memory we are interested in lmem_avail_bytes and lmem_total_bytes which are free GPU memory and total GPU memory respectively.
I have make little command which is printing amount of available device memory for particular GPU each 0.1 second:
while true; do echo [$(date '+TIME:%T.%N')] GPU free memory: $(bc <<<"$(cat /sys/class/drm/card0/lmem_avail_bytes)*100/$(cat /sys/class/drm/card0/lmem_total_bytes)")%; sleep 0.1; done.
I have only one Intel GPU in my system, so this is why card0 here is used, but in case of two GPU I suppose there will be be card0 and card1 in the /sys/class/drm/ DRM directory.

@silex, in case if you won't able to install xpu-smi due its dependency complexity, can you please check that you have card0 and card1 in the /sys/class/drm/ and if so then run the command below while trying to render on both GPU with the 2022-12 sycl build?
The command: while true; do echo [$(date '+TIME:%T.%N')] GPU 1 free memory: $(bc <<<"$(cat /sys/class/drm/card0/lmem_avail_bytes)*100/$(cat /sys/class/drm/card0/lmem_total_bytes)")%, GPU 2 free memory: $(bc <<<"$(cat /sys/class/drm/card1/lmem_avail_bytes)*100/$(cat /sys/class/drm/card1/lmem_total_bytes)")%; sleep 0.1; done

XPU-SMI installation seems to be quite complicated on the Debian system, and actually there are more easy way to get information about used GPU memory - via DRM registry-files in `/sys/class/drm/cardX`. For memory we are interested in `lmem_avail_bytes` and `lmem_total_bytes` which are free GPU memory and total GPU memory respectively. I have make little command which is printing amount of available device memory for particular GPU each 0.1 second: `while true; do echo [$(date '+TIME:%T.%N')] GPU free memory: $(bc <<<"$(cat /sys/class/drm/card0/lmem_avail_bytes)*100/$(cat /sys/class/drm/card0/lmem_total_bytes)")%; sleep 0.1; done`. I have only one Intel GPU in my system, so this is why `card0` here is used, but in case of two GPU I suppose there will be be `card0` and `card1` in the `/sys/class/drm/` DRM directory. @silex, in case if you won't able to install xpu-smi due its dependency complexity, can you please check that you have `card0` and `card1` in the `/sys/class/drm/` and if so then run the command below while trying to render on both GPU with the 2022-12 sycl build? The command: `while true; do echo [$(date '+TIME:%T.%N')] GPU 1 free memory: $(bc <<<"$(cat /sys/class/drm/card0/lmem_avail_bytes)*100/$(cat /sys/class/drm/card0/lmem_total_bytes)")%, GPU 2 free memory: $(bc <<<"$(cat /sys/class/drm/card1/lmem_avail_bytes)*100/$(cat /sys/class/drm/card1/lmem_total_bytes)")%; sleep 0.1; done`

Hi Nikita, thanks for the answer!

Indeed XPU-SMI on Debian can cause problems with dependencies.

I looked into /sys/class/drm/cardX. I have both card0 and card1 directories, but unfortunately inside there is no trace of lmem_avail_bytes or lmem_total_bytes. Also I used locate and find and there is no sign of any file in the system named that way.
I attached screenshot of what card0 directory contains. Card1 is basically the same.

First script you provided is giving me:

cat: /sys/class/drm/card0/lmem_avail_bytes: No such file or directory
cat: /sys/class/drm/card0/lmem_total_bytes: No such file or directory
(standard_in) 1: syntax error
Hi Nikita, thanks for the answer! Indeed XPU-SMI on Debian can cause problems with dependencies. I looked into `/sys/class/drm/cardX`. I have both `card0` and `card1` directories, but unfortunately inside there is no trace of `lmem_avail_bytes` or `lmem_total_bytes`. Also I used `locate` and `find` and there is no sign of any file in the system named that way. I attached screenshot of what `card0` directory contains. `Card1` is basically the same. First script you provided is giving me: ``` cat: /sys/class/drm/card0/lmem_avail_bytes: No such file or directory cat: /sys/class/drm/card0/lmem_total_bytes: No such file or directory (standard_in) 1: syntax error ```

Interesting, can you please try to install "intel-gpu-tools" package from Debian sid and recheck if the DRM files will appear?

Interesting, can you please try to install "intel-gpu-tools" package from Debian sid and recheck if the DRM files will appear?

It's even weirder, because I have it installed (version 1.27.1-1).
System is up to date.

It's even weirder, because I have it installed (version 1.27.1-1). System is up to date.

Weird, but maybe DRM controls have changed in recent versions like Linux-6.4.0-1-amd64-x86_64-with-glibc2.37 for some reason (I am using this special Linux 5.15 kernel with Intel GPU support embedded). Can you please check that you can see in "/sys/class/drm/device" and "/sys/class/drm/card0-DP-1/device" directories?

Weird, but maybe DRM controls have changed in recent versions like Linux-6.4.0-1-amd64-x86_64-with-glibc2.37 for some reason (I am using this special Linux 5.15 kernel with Intel GPU support embedded). Can you please check that you can see in "/sys/class/drm/device" and "/sys/class/drm/card0-DP-1/device" directories?

There is no /sys/class/drm/device directory.

/sys/class/drm/card0-DP-1/device exists, but its a symlink to /sys/class/drm/card0

Also double checked if anything related to GPU memory is in /sys/class/hwmon, but I haven't found anything.

There is no `/sys/class/drm/device` directory. `/sys/class/drm/card0-DP-1/device` exists, but its a symlink to `/sys/class/drm/card0` Also double checked if anything related to GPU memory is in `/sys/class/hwmon`, but I haven't found anything.

I see, it seems that this functionality for some reason is not exposed in Linux-6.4.0 with Intel GPUs for some reason. I then will try to take a look on the Debian installation process for xpu-smi in next few days - may be I will figure out some easy to way to get it on the Debian based setup.

I see, it seems that this functionality for some reason is not exposed in Linux-6.4.0 with Intel GPUs for some reason. I then will try to take a look on the Debian installation process for xpu-smi in next few days - may be I will figure out some easy to way to get it on the Debian based setup.

Unfortunately, I haven't found any reasonable easy way to install the software on Debian due to all this dependencies.
So, probably easiest way here is to check, if this information is even exposed by driver/kernel via simple program, like this:

#include <iostream>
#include <sycl/sycl.hpp>
#include <chrono>
#include <thread>
#include <cstring>
#include <vector>
#include <cstdarg>
#include <ctime>

#ifdef _WIN32
#  ifndef vsnprintf
#    define vsnprintf _vsnprintf
#  endif
#endif /* _WIN32 */

void print_date_and_time()
{
   time_t now = time(0);

   char str[126];
   struct tm buf;
   struct tm * p = localtime(&now);
   strftime(str, sizeof str, "%H:%M:%S", p);

   std::cout << str;
}

std::string string_printf(const char *format, ...)
{
  std::vector<char> str(128, 0);

  while (1) {
    va_list args;
    int result;

    va_start(args, format);
    result = vsnprintf(&str[0], str.size(), format, args);
    va_end(args);

    if (result == -1) {
      /* not enough space or formatting error */
      if (str.size() > 65536) {
        assert(0);
        return std::string("");
      }

      str.resize(str.size() * 2, 0);
      continue;
    }
    else if (result >= (int)str.size()) {
      /* not enough space */
      str.resize(result + 1, 0);
      continue;
    }

    return std::string(&str[0]);
  }
}

std::string string_human_readable_size(size_t size)
{
  static const char suffixes[] = "BKMGTPEZY";

  const char *suffix = suffixes;
  size_t r = 0;

  while (size >= 1024) {
    r = size % 1024;
    size /= 1024;
    suffix++;
  }

  if (*suffix != 'B')
    return string_printf("%.2f%c", double(size * 1024 + r) / 1024.0, *suffix);
  else
    return string_printf("%zu", size);
}


int main()
{
  const std::vector<sycl::platform> &oneapi_platforms = sycl::platform::get_platforms();

  std::vector<sycl::device> available_devices;
  for (const sycl::platform &platform : oneapi_platforms) {
    if (platform.get_backend() == sycl::backend::opencl) {
      continue;
    }

    const std::vector<sycl::device> &oneapi_devices = platform.get_devices(sycl::info::device_type::gpu);

    for (const sycl::device &device : oneapi_devices) {
      bool filter_out = false;
      if (true) {
        /* For now we support all Intel(R) Arc(TM) devices and likely any future GPU,
         * assuming they have either more than 96 Execution Units or not 7 threads per EU.
         * Official support can be broaden to older and smaller GPUs once ready. */
        if (!device.is_gpu() || platform.get_backend() != sycl::backend::ext_oneapi_level_zero) {
          filter_out = true;
        }
        else {
          /* Filtered-out defaults in-case these values aren't available. */
          int number_of_eus = 96;
          int threads_per_eu = 7;
          if (device.has(sycl::aspect::ext_intel_gpu_eu_count)) {
            number_of_eus = device.get_info<sycl::ext::intel::info::device::gpu_eu_count>();
          }
          if (device.has(sycl::aspect::ext_intel_gpu_hw_threads_per_eu)) {
            threads_per_eu =
                device.get_info<sycl::ext::intel::info::device::gpu_hw_threads_per_eu>();
          }
          /* This filters out all Level-Zero supported GPUs from older generation than Arc. */
          if (number_of_eus <= 96 && threads_per_eu == 7) {
            filter_out = true;
          }
        }
      }
      if (!filter_out) {
        available_devices.push_back(device);
      }
    }
  }

  while (true) {
          std::cout << "[";
          print_date_and_time();
          std::cout << "] ";
      for (const sycl::device &device : available_devices) {
        const std::string &name = device.get_info<sycl::info::device::name>();
        std::cout << name << ": ";
        if (device.has(sycl::aspect::ext_intel_free_memory))
          std::cout << string_human_readable_size(device.get_info<sycl::ext::intel::info::device::free_memory>());
            else
                  std::cout << "unknown bytes";
            std::cout << "; ";
      }
          std::cout << std::endl;
          std::this_thread::sleep_for(std::chrono::milliseconds(100));
  }

  return 0;
}

@silex, can you please try to compile and run this small program with this steps/commands?

$ <get source code above in the file, named main.cpp>
$ svn checkout https://svn.blender.org/svnroot/bf-blender/trunk/lib/linux_x86_64_glibc_228/dpcpp
$ export PATH=$PWD/dpcpp/bin:$PATH
$ export LD_LIBRARY_PATH=$PWD/dpcpp/lib/:$LD_LIBRARY_PATH
$ clang++ -fsycl main.cpp -o main && ./main

If kernel/driver are exposing this information, then this application will able to get it throw Level0 software layer and you will able to see some numbers of free memory in the program output. And if it is not a case, you will see "unknown bytes" messages.

Unfortunately, I haven't found any reasonable easy way to install the software on Debian due to all this dependencies. So, probably easiest way here is to check, if this information is even exposed by driver/kernel via simple program, like this: ```` #include <iostream> #include <sycl/sycl.hpp> #include <chrono> #include <thread> #include <cstring> #include <vector> #include <cstdarg> #include <ctime> #ifdef _WIN32 # ifndef vsnprintf # define vsnprintf _vsnprintf # endif #endif /* _WIN32 */ void print_date_and_time() { time_t now = time(0); char str[126]; struct tm buf; struct tm * p = localtime(&now); strftime(str, sizeof str, "%H:%M:%S", p); std::cout << str; } std::string string_printf(const char *format, ...) { std::vector<char> str(128, 0); while (1) { va_list args; int result; va_start(args, format); result = vsnprintf(&str[0], str.size(), format, args); va_end(args); if (result == -1) { /* not enough space or formatting error */ if (str.size() > 65536) { assert(0); return std::string(""); } str.resize(str.size() * 2, 0); continue; } else if (result >= (int)str.size()) { /* not enough space */ str.resize(result + 1, 0); continue; } return std::string(&str[0]); } } std::string string_human_readable_size(size_t size) { static const char suffixes[] = "BKMGTPEZY"; const char *suffix = suffixes; size_t r = 0; while (size >= 1024) { r = size % 1024; size /= 1024; suffix++; } if (*suffix != 'B') return string_printf("%.2f%c", double(size * 1024 + r) / 1024.0, *suffix); else return string_printf("%zu", size); } int main() { const std::vector<sycl::platform> &oneapi_platforms = sycl::platform::get_platforms(); std::vector<sycl::device> available_devices; for (const sycl::platform &platform : oneapi_platforms) { if (platform.get_backend() == sycl::backend::opencl) { continue; } const std::vector<sycl::device> &oneapi_devices = platform.get_devices(sycl::info::device_type::gpu); for (const sycl::device &device : oneapi_devices) { bool filter_out = false; if (true) { /* For now we support all Intel(R) Arc(TM) devices and likely any future GPU, * assuming they have either more than 96 Execution Units or not 7 threads per EU. * Official support can be broaden to older and smaller GPUs once ready. */ if (!device.is_gpu() || platform.get_backend() != sycl::backend::ext_oneapi_level_zero) { filter_out = true; } else { /* Filtered-out defaults in-case these values aren't available. */ int number_of_eus = 96; int threads_per_eu = 7; if (device.has(sycl::aspect::ext_intel_gpu_eu_count)) { number_of_eus = device.get_info<sycl::ext::intel::info::device::gpu_eu_count>(); } if (device.has(sycl::aspect::ext_intel_gpu_hw_threads_per_eu)) { threads_per_eu = device.get_info<sycl::ext::intel::info::device::gpu_hw_threads_per_eu>(); } /* This filters out all Level-Zero supported GPUs from older generation than Arc. */ if (number_of_eus <= 96 && threads_per_eu == 7) { filter_out = true; } } } if (!filter_out) { available_devices.push_back(device); } } } while (true) { std::cout << "["; print_date_and_time(); std::cout << "] "; for (const sycl::device &device : available_devices) { const std::string &name = device.get_info<sycl::info::device::name>(); std::cout << name << ": "; if (device.has(sycl::aspect::ext_intel_free_memory)) std::cout << string_human_readable_size(device.get_info<sycl::ext::intel::info::device::free_memory>()); else std::cout << "unknown bytes"; std::cout << "; "; } std::cout << std::endl; std::this_thread::sleep_for(std::chrono::milliseconds(100)); } return 0; } ```` @silex, can you please try to compile and run this small program with this steps/commands? ```` $ <get source code above in the file, named main.cpp> $ svn checkout https://svn.blender.org/svnroot/bf-blender/trunk/lib/linux_x86_64_glibc_228/dpcpp $ export PATH=$PWD/dpcpp/bin:$PATH $ export LD_LIBRARY_PATH=$PWD/dpcpp/lib/:$LD_LIBRARY_PATH $ clang++ -fsycl main.cpp -o main && ./main ```` If kernel/driver are exposing this information, then this application will able to get it throw Level0 software layer and you will able to see some numbers of free memory in the program output. And if it is not a case, you will see "unknown bytes" messages.

Unfortunately I'm getting unknown bytes. I assume we will need to wait for a new Linux kernel with this function patched?

Unfortunately I'm getting `unknown bytes`. I assume we will need to wait for a new Linux kernel with this function patched?
Member

It's possible something else is missing higher level: can you re-run the binary with ZES_ENABLE_SYSMAN=1 environment variable set, see if it helps getting a value ?

It's possible something else is missing higher level: can you re-run the binary with `ZES_ENABLE_SYSMAN=1` environment variable set, see if it helps getting a value ?

That worked. Now I'm getting: [12:42:46] Intel(R) Arc(TM) A770 Graphics: 15.91G; Intel(R) Arc(TM) A770 Graphics: 15.91G;

That worked. Now I'm getting: `[12:42:46] Intel(R) Arc(TM) A770 Graphics: 15.91G; Intel(R) Arc(TM) A770 Graphics: 15.91G; `
Member

great! by running this tool in parallel, can you then share how much GPU memory is left from both GPUs during the failure ?

great! by running this tool in parallel, can you then share how much GPU memory is left from both GPUs during the failure ?

I've run VRAM monitoring script from a separate directory than Blender builds, and no matter what scene or load I test the VRAM counter sits at 15.91G.
I tried benchmark scene, heavy productions scenes and also opening lots of YT videos.

I've run VRAM monitoring script from a separate directory than Blender builds, and no matter what scene or load I test the VRAM counter sits at 15.91G. I tried benchmark scene, heavy productions scenes and also opening lots of YT videos.
Member

Hi! we've been able to reproduce it internally and it doesn't seem related to a high VRAM usage - it's in the queue to get debugged further.

Hi! we've been able to reproduce it internally and it doesn't seem related to a high VRAM usage - it's in the queue to get debugged further.

Just a heads up - graphics compiler (libigc1) has been upgraded in Debian unstable. Unfortunately there is a conflict and the upgrade removes intel-opencl-icd and makes it uninstallable: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1051189
I've put libigc1 and libigdfcl1 on hold for the time being.

Just a heads up - graphics compiler (libigc1) has been upgraded in Debian unstable. Unfortunately there is a conflict and the upgrade removes `intel-opencl-icd` and makes it uninstallable: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1051189 I've put `libigc1` and `libigdfcl1` on hold for the time being.

New packages were just uploaded to Debian unstable and this bug: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1051189 is fixed:

libigc1 - 1.0.15136.3-1
libigdfcl1 - 1.0.15136.3-1
intel-opencl-icd - 23.35~git20230926-1+b1
libze1 - 1.13.5-1
libze-intel-gpu1 - 23.35~git20230926-1+b1

After the upgrade I cannot see any difference when it comes to the main bug with multi GPU rendering. So I guess it is still being worked on.

New packages were just uploaded to Debian unstable and this bug: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1051189 is fixed: libigc1 - 1.0.15136.3-1 libigdfcl1 - 1.0.15136.3-1 intel-opencl-icd - 23.35~git20230926-1+b1 libze1 - 1.13.5-1 libze-intel-gpu1 - 23.35~git20230926-1+b1 After the upgrade I cannot see any difference when it comes to the main bug with multi GPU rendering. So I guess it is still being worked on.

Some time ago I was checking Intel's repo for any signs of movement with respect to this bug and I found the following commit that might be connected: f8eefbd020

Some time ago I was checking Intel's repo for any signs of movement with respect to this bug and I found the following commit that might be connected: https://github.com/intel/compute-runtime/commit/f8eefbd02095c0245e78a74cebfa0802b23e7c9f
Member

This bug is still in the queue, I've asked for it to get a bit more priority.
You're very welcome in trying new updates when they land in case they luckily fix this bug but in case the submitted issue is getting explicitly worked on and fixed, I'll be notified and give an update here.

This bug is still in the queue, I've asked for it to get a bit more priority. You're very welcome in trying new updates when they land in case they luckily fix this bug but in case the submitted issue is getting explicitly worked on and fixed, I'll be notified and give an update here.

Thanks!

Thanks!
Brecht Van Lommel added
Type
Bug
and removed
Type
Report
labels 2024-06-14 16:00:29 +02:00
Sign in to join this conversation.
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset System
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Viewport & EEVEE
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Asset Browser Project
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Module
Viewport & EEVEE
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Severity
High
Severity
Low
Severity
Normal
Severity
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
4 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#110504
No description provided.