Overlay drawing issues on Intel HD4400/4600 (Only?) #111162

Closed
opened 2023-08-16 05:57:14 +02:00 by YimingWu · 50 comments
Member

This report tries to categorize recent bugs around weird overlay drawings for camera, light, empty (including axis) objects.

They all seems to happen with integrated graphics or virtualized GPU drivers.

  • #111161 #106445 #104706 #103038 #111497 Camera draws like a square box. (linux Mesa Intel(R) HD Graphics 4400/4600 (HSW GT2))
    • Mesa 23.1.4 resolved (some of) the issue for Blender 4.0 (on 4600). For Blender 3.6.x the issue persists
  • #111025 Object axis display broken, all aligned. (linux Mesa Intel(R) HD Graphics 4600 (HSW GT2))
    • nvidia-prime to nvidia solved the issue, iGPU still broken.
  • #109780 Overlays not displayed for UV seams, freestyle edges and sharp edges (Mesa Intel(R) HD Graphics 4600 (HSW GT2) Intel 4.6 (Core Profile) Mesa 23.0.4)
  • #110636 Camera focal length and light size gizmo draws on mirrored coordinates (as if the viewport is inverted in Y direction in screen space). (D3d12 Software GL, WSL) Resolved using llvmpipe driver under WSL.

UPDATE:

From comment #111162 (comment) , it seems to confirm that this artefact is related to the introduction of OneAPI, or at least involving WITH_CYCLES_DEVICE_ONEAPI option.


Looks like HD Graphics 4400/4600 (HSW GT2) seems particularly troublesome. Mesa does have a bug but in that report it didn't really pin point what it is. Some change in blender probably triggered some corner case in the mesa driver.

Likely broken between 3.5 to 3.6 since 3.5 worked fine on those supposedly "broken" mesa drivers.

This report tries to categorize recent bugs around weird overlay drawings for camera, light, empty (including axis) objects. They all seems to happen with integrated graphics or virtualized GPU drivers. - #111161 #106445 #104706 #103038 #111497 Camera draws like a square box. (linux Mesa Intel(R) HD Graphics 4400/4600 (HSW GT2)) - Mesa 23.1.4 resolved (some of) the issue for Blender 4.0 (on 4600). For Blender 3.6.x the issue persists - #111025 Object axis display broken, all aligned. (linux Mesa Intel(R) HD Graphics 4600 (HSW GT2)) - nvidia-prime to nvidia solved the issue, iGPU still broken. - #109780 Overlays not displayed for UV seams, freestyle edges and sharp edges (Mesa Intel(R) HD Graphics 4600 (HSW GT2) Intel 4.6 (Core Profile) Mesa 23.0.4) - ~~#110636 Camera focal length and light size gizmo draws on mirrored coordinates (as if the viewport is inverted in Y direction in screen space). (D3d12 Software GL, WSL)~~ Resolved using `llvmpipe` driver under WSL. UPDATE: From comment https://projects.blender.org/blender/blender/issues/111162#issuecomment-1004676 , it seems to confirm that this artefact is related to the introduction of OneAPI, or at least involving `WITH_CYCLES_DEVICE_ONEAPI` option. ------- Looks like `HD Graphics 4400/4600 (HSW GT2)` seems particularly troublesome. Mesa does have a [bug](https://gitlab.freedesktop.org/mesa/mesa/-/issues/9290) but in that report it didn't really pin point what it is. Some change in blender probably triggered some corner case in the mesa driver. Likely broken between 3.5 to 3.6 since 3.5 worked fine on those supposedly "broken" mesa drivers.
YimingWu changed title from Overlay drawing issues on iGPUs and some software OpenGL drivers. to Overlay drawing issues on Intel HD4400/4600 (Only?) 2023-08-20 15:34:47 +02:00
Author
Member

I reported this problem to Mesa, but I'm not sure if this is gonna be helpful given very limited information available.

I reported this problem to [Mesa](https://gitlab.freedesktop.org/mesa/mesa/-/issues/9656), but I'm not sure if this is gonna be helpful given very limited information available.

Just to say that I observe all the above mentioned artefacts on my hardware: Mesa Intel(R) HD Graphics 4600 (HSW GT2)/Mesa 23.0.4 and Mesa 23.1.5, Manjaro Linux.

And also that compiling blender on this hardware solves all the issues for me - the broken camera, lights and axis gizmos and the missing overlays (one may have to clear mesa shader's cache).

Just to say that I observe all the above mentioned artefacts on my hardware: Mesa Intel(R) HD Graphics 4600 (HSW GT2)/Mesa 23.0.4 and Mesa 23.1.5, Manjaro Linux. And also that compiling blender on this hardware solves all the issues for me - the broken camera, lights and axis gizmos and the missing overlays (one may have to clear mesa shader's cache).

And also playing with apitrace:

  • I get a trace for Blender 3.6.0 (and the issue is there).
  • I clear mesa shaders' cache.
  • I use apitrace drump-images to "replay" the trace, the resulting images don't show the issue.

Does that mean that something somewhere is affecting the way shaders are compiled, but it's not GL code?

Sorry if my comment is useless, I don't know enough to be sure it isn't.

And also playing with apitrace: - I get a trace for Blender 3.6.0 (and the issue is there). - I clear mesa shaders' cache. - I use apitrace drump-images to "replay" the trace, the resulting images don't show the issue. Does that mean that something somewhere is affecting the way shaders are compiled, but it's not GL code? Sorry if my comment is useless, I don't know enough to be sure it isn't.
Author
Member

And also that compiling blender on this hardware solves all the issues for me - the broken camera, lights and axis gizmos and the missing overlays (one may have to clear mesa shader's cache).

That does sound very interesting

I use apitrace drump-images to "replay" the trace, the resulting images don't show the issue.

This has to mean that apitrace compiles the shader differently than blender? (Or is it using software rendering? I had problems with renderdoc a while back, that my program is showing nothing due to OpenGL errors, while renderdoc shows normally, since renderdoc replays api with a software rasterizer or something like that)

@aperitero if you can compile blender, see if you could somehow add a bit code and get glDebugMessageCallback to work with it... Details here, but since your own compiled blender version is correct, idk if you can find anything useful. It's best we could add some debug output to buildbot builds but then it would be hard for you to set breakpoints 🤔

> And also that compiling blender on this hardware solves all the issues for me - the broken camera, lights and axis gizmos and the missing overlays (one may have to clear mesa shader's cache). That does sound very interesting > I use apitrace drump-images to "replay" the trace, the resulting images don't show the issue. This has to mean that apitrace compiles the shader differently than blender? (Or is it using software rendering? I had problems with renderdoc a while back, that my program is showing nothing due to OpenGL errors, while renderdoc shows normally, since renderdoc replays api with a software rasterizer or something like that) @aperitero if you can compile blender, see if you could somehow add a bit code and get `glDebugMessageCallback` to work with it... [Details here](https://www.khronos.org/opengl/wiki/Debug_Output), but since your own compiled blender version is correct, idk if you can find anything useful. It's best we could add some debug output to buildbot builds but then it would be hard for you to set breakpoints 🤔

This has to mean that apitrace compiles the shader differently than blender? (Or is it using software rendering? I had problems with renderdoc a while back, that my program is showing nothing due to OpenGL errors, while renderdoc shows normally, since renderdoc replays api with a software rasterizer or something like that)

I also thought it could be software rendering, but it seems that the same GPU backend is used by apitrace when replaying the trace (at least it's what LIBGL_DEBUG=verbose shows).


Another observation:

  • my self-compiled version of Blender (that does not show the bugs) was compiled WITH_CYCLES_DEVICE_ONEAPI=OFF and WITH_CYCLES_ONEAPI_BINARIES=OFF because I got an error with oneapi that I couldn't fix, so I simply disabled it.
  • the bugs first appeared in blender 3.4.0. When I compare official releases 3.3.0 and 3.4.0, I see that 3.4.0 is the first one to ship with oneapi (there is "libcycles_kernel_oneapi_aot.so" in the lib folder).

So is it possible that oneapi somehow affects the shader compilation stack?

> This has to mean that apitrace compiles the shader differently than blender? (Or is it using software rendering? I had problems with renderdoc a while back, that my program is showing nothing due to OpenGL errors, while renderdoc shows normally, since renderdoc replays api with a software rasterizer or something like that) I also thought it could be software rendering, but it seems that the same GPU backend is used by apitrace when replaying the trace (at least it's what LIBGL_DEBUG=verbose shows). ---- Another observation: - my self-compiled version of Blender (that does not show the bugs) was compiled WITH_CYCLES_DEVICE_ONEAPI=OFF and WITH_CYCLES_ONEAPI_BINARIES=OFF because I got an error with oneapi that I couldn't fix, so I simply disabled it. - the bugs first appeared in blender 3.4.0. When I compare official releases 3.3.0 and 3.4.0, I see that 3.4.0 is the first one to ship with oneapi (there is "libcycles_kernel_oneapi_aot.so" in the lib folder). So is it possible that oneapi somehow affects the shader compilation stack?
Author
Member

@Sergey Is it possible to ask buildbot to build with custom configuration like WITH_CYCLES_DEVICE_ONEAPI=OFF?

@Sergey Is it possible to ask buildbot to build with custom configuration like `WITH_CYCLES_DEVICE_ONEAPI=OFF`?

@ChengduLittleA You can create PR and modify build_files/config/pipeline_config.yaml where you specify cmake -> default -> overrides to something like overrides: { "WITH_CYCLES_DEVICE_ONEAPI": "OFF" } (i'm just typing it, you might need to validate this is a valid YAML syntax).

@ChengduLittleA You can create PR and modify `build_files/config/pipeline_config.yaml` where you specify cmake -> default -> overrides to something like `overrides: { "WITH_CYCLES_DEVICE_ONEAPI": "OFF" }` (i'm just typing it, you might need to validate this is a valid YAML syntax).

@ChengduLittleA Thanks for the build! I've just tested it and I confirm that the bugs are absent.

@ChengduLittleA Thanks for the build! I've just tested it and I confirm that the bugs are absent.
Author
Member

@aperitero That's some really interesting info. Then at least we know there's something to do with OneAPI.

It's possible that it will involve some #ifdefs in the code, not sure if any of them affects the OpenGL side, still suspecting it involves different code path for matrix and stuff so it causes such artefacts.

@aperitero That's some really interesting info. Then at least we know there's something to do with `OneAPI`. It's possible that it will involve some `#ifdef`s in the code, not sure if any of them affects the OpenGL side, still suspecting it involves different code path for matrix and stuff so it causes such artefacts.
Member

Blender 3.3 already had oneAPI support but was linking differently if I remember well.. the bug sounds more like a runtime symbols conflict issue, with the graphics compiler maybe loading earlier in case of a oneAPI build?

Can you run run a regular build with this environment variable:
export ONEAPI_DEVICE_SELECTOR="!*:cpu"

and another time with these environment variables:

export NEOReadDebugKeys=1
export DisableDeepBind=1

?

Blender 3.3 already had oneAPI support but was linking differently if I remember well.. the bug sounds more like a runtime symbols conflict issue, with the graphics compiler maybe loading earlier in case of a oneAPI build? Can you run run a regular build with this environment variable: `export ONEAPI_DEVICE_SELECTOR="!*:cpu"` and another time with these environment variables: ``` export NEOReadDebugKeys=1 export DisableDeepBind=1 ``` ?

@xavierh

Can you run run a regular build with this environment variable:

Sorry, it's not clear to me. Do you mean run blender (from a regular build) with these variables? Or compile it with them?

If you meant run blender from a regular build, I tried both options and I see no difference.

@xavierh > Can you run run a regular build with this environment variable: Sorry, it's not clear to me. Do you mean run blender (from a regular build) with these variables? Or compile it with them? If you meant run blender from a regular build, I tried both options and I see no difference.
Member

I meant running Blender from a regular build yes, these commands set environment variables in the terminal you're using, you then need to launch blender from the same terminal. Can you confirm these make no difference?

I meant running Blender from a regular build yes, these commands set environment variables in the terminal you're using, you then need to launch blender from the same terminal. Can you confirm these make no difference?

Yes, I confirm, unfortunately they make no difference.

Yes, I confirm, unfortunately they make no difference.

Ok, I managed to self-compile a version of blender that has the bug (so it includes oneapi). But I have no idea of what I could try to understand where exactly the problem comes from. If you have any suggestions, please me know, I'll try my best.

Ok, I managed to self-compile a version of blender that has the bug (so it includes oneapi). But I have no idea of what I could try to understand where exactly the problem comes from. If you have any suggestions, please me know, I'll try my best.
Member

From your previous test, there is no proof it's a symbols conflict issue, but I struggle finding another explanation at the moment. Still, HD4000 is too old to have level-zero support so blender trying to use oneAPI or not shouldn't even make a difference at runtime.

Can you remove libpi_level_zero.so from a regular blender build and try it again ?
If there is still no change, can you get logs from a run with a regular build, and the same version without oneapi, with LD_DEBUG=bindings environment variable set in both cases, and share/compare the output ?

From your previous test, there is no proof it's a symbols conflict issue, but I struggle finding another explanation at the moment. Still, HD4000 is too old to have level-zero support so blender trying to use oneAPI or not shouldn't even make a difference at runtime. Can you remove libpi_level_zero.so from a regular blender build and try it again ? If there is still no change, can you get logs from a run with a regular build, and the same version without oneapi, with `LD_DEBUG=bindings` environment variable set in both cases, and share/compare the output ?

Removing libpi_level_zero.so made no differences.

However trying LD_DEBUG seems to give interesting results (see attached diff): if I understand well, there are some conflicting symbols between libOpenColorIO and libcycles_kernel_oneapi_aot.so.

Removing libpi_level_zero.so made no differences. However trying LD_DEBUG seems to give interesting results (see attached diff): if I understand well, there are some conflicting symbols between libOpenColorIO and libcycles_kernel_oneapi_aot.so.
Member

there are symbols previously coming from libOpenColorIO, now coming from libcycles_kernel_oneapi_aot, hard to say if they're conflicting yet, but it seems worth digging further into that direction at least!

What's the behavior (and LD_DEBUG diff) with these additional environment variables ?

LD_PRELOAD=/usr/lib/libLLVM-15.so
LD_PRELOAD=/usr/lib/libLLVM-15.so:blender-4.0.0-alpha/lib/libOpenColorIO.so.2.2
there are symbols previously coming from libOpenColorIO, now coming from libcycles_kernel_oneapi_aot, hard to say if they're conflicting yet, but it seems worth digging further into that direction at least! What's the behavior (and LD_DEBUG diff) with these additional environment variables ? ``` LD_PRELOAD=/usr/lib/libLLVM-15.so ``` ``` LD_PRELOAD=/usr/lib/libLLVM-15.so:blender-4.0.0-alpha/lib/libOpenColorIO.so.2.2 ```

Hi Xavier, so I tried both, but with no results - same visual bugs…

I attached the diff:

  • oneapi vs no oneapi, same ld_preload
  • oneapi with ld_preload vs no ld_preload.
Hi Xavier, so I tried both, but with no results - same visual bugs… I attached the diff: - oneapi vs no oneapi, same ld_preload - oneapi with ld_preload vs no ld_preload.

I don't know if it's useful: when I replace "libcycles_kernel_oneapi_aot.so" with an empty library file, it also makes the bugs disappear. Here is attached the diff of running regular blender build with and without this library.

I don't know if it's useful: when I replace "libcycles_kernel_oneapi_aot.so" with an empty library file, it also makes the bugs disappear. Here is attached the diff of running regular blender build with and without this library.

@aperitero

I don't know if it's useful: when I replace "libcycles_kernel_oneapi_aot.so" with an empty library file, it also makes the bugs disappear. Here is attached the diff of running regular blender build with and without this library.

Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz
HD Intel® 4400
Extended renderer info (GLX_MESA_query_renderer):
Vendor: Intel (0x8086)
Device: Mesa Intel(R) HD Graphics 4400 (HSW GT2) (0xa16)
Version: 22.2.5
Accelerated: yes
Video memory: 1536MB
Unified memory: yes
Preferred profile: core (0x1)
Max core profile version: 4.6
Max compat profile version: 4.6
Max GLES1 profile version: 1.1
Max GLES[23] profile version: 3.2
OpenGL vendor string: Intel
OpenGL renderer string: Mesa Intel(R) HD Graphics 4400 (HSW GT2)
OpenGL core profile version string: 4.6 (Core Profile) Mesa 22.2.5-0ubuntu0.1~22.04.3
OpenGL core profile shading language version string: 4.60
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile

OpenGL version string: 4.6 (Compatibility Profile) Mesa 22.2.5-0ubuntu0.1~22.04.3
OpenGL shading language version string: 4.60
OpenGL context flags: (none)
OpenGL profile mask: compatibility profile

OpenGL ES profile version string: OpenGL ES 3.2 Mesa 22.2.5-0ubuntu0.1~22.04.3
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20

Linux 6.2.0-26-generic

Blender 3.4.1, 3.6.3, 4.0.0

I tried it and still the same problem

@aperitero > I don't know if it's useful: when I replace "libcycles_kernel_oneapi_aot.so" with an empty library file, it also makes the bugs disappear. Here is attached the diff of running regular blender build with and without this library. Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz HD Intel® 4400 Extended renderer info (GLX_MESA_query_renderer): Vendor: Intel (0x8086) Device: Mesa Intel(R) HD Graphics 4400 (HSW GT2) (0xa16) Version: 22.2.5 Accelerated: yes Video memory: 1536MB Unified memory: yes Preferred profile: core (0x1) Max core profile version: 4.6 Max compat profile version: 4.6 Max GLES1 profile version: 1.1 Max GLES[23] profile version: 3.2 OpenGL vendor string: Intel OpenGL renderer string: Mesa Intel(R) HD Graphics 4400 (HSW GT2) OpenGL core profile version string: 4.6 (Core Profile) Mesa 22.2.5-0ubuntu0.1~22.04.3 OpenGL core profile shading language version string: 4.60 OpenGL core profile context flags: (none) OpenGL core profile profile mask: core profile OpenGL version string: 4.6 (Compatibility Profile) Mesa 22.2.5-0ubuntu0.1~22.04.3 OpenGL shading language version string: 4.60 OpenGL context flags: (none) OpenGL profile mask: compatibility profile OpenGL ES profile version string: OpenGL ES 3.2 Mesa 22.2.5-0ubuntu0.1~22.04.3 OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20 Linux 6.2.0-26-generic Blender 3.4.1, 3.6.3, 4.0.0 I tried it and still the same problem
Member

I don't know if it's useful: when I replace "libcycles_kernel_oneapi_aot.so" with an empty library file, it also makes the bugs disappear. Here is attached the diff of running regular blender build with and without this library.

I'm confused to hear this is enough to make the bug disappear, I'm unsure what conclusion I can make from this test while it's definitely useful if true :|

Thanks for the additional logs.
I'm not seeing any issue at a first glance but there is an interesting difference though, aren't you getting "GHOST: Wayland: unable to connect to display!" in the bad case only?

> I don't know if it's useful: when I replace "libcycles_kernel_oneapi_aot.so" with an empty library file, it also makes the bugs disappear. Here is attached the diff of running regular blender build with and without this library. I'm confused to hear this is enough to make the bug disappear, I'm unsure what conclusion I can make from this test while it's definitely useful if true :| Thanks for the additional logs. I'm not seeing any issue at a first glance but there is an interesting difference though, aren't you getting "GHOST: Wayland: unable to connect to display!" in the bad case only?

@Bernard-Antonio-Obando-Mena

I tried it and still the same problem

You may need to clear Mesa shader's cache before you start Blender. On my distrib (Manjaro) it's in "~/.cache" folder:
rm -rf ~/.cache/mesa_shader_cache/*

@Bernard-Antonio-Obando-Mena > I tried it and still the same problem You may need to clear Mesa shader's cache before you start Blender. On my distrib (Manjaro) it's in "~/.cache" folder: `rm -rf ~/.cache/mesa_shader_cache/*`

@xavierh

I'm not seeing any issue at a first glance but there is an interesting difference though, aren't you getting "GHOST: Wayland: unable to connect to display!" in the bad case only?

Well, I don't think so. I have this message only in that 4.0.0-alpha build (though I don't know why the build without oneapi doesn't show it), but I have other versions that show the bug without this message (and if I use a dummy "libcycles_kernel_oneapi_aot.so" with that build, I still have the message but not the bugs).

@xavierh > I'm not seeing any issue at a first glance but there is an interesting difference though, aren't you getting "GHOST: Wayland: unable to connect to display!" in the bad case only? Well, I don't think so. I have this message only in that 4.0.0-alpha build (though I don't know why the build without oneapi doesn't show it), but I have other versions that show the bug without this message (and if I use a dummy "libcycles_kernel_oneapi_aot.so" with that build, I still have the message but not the bugs).

@Bernard-Antonio-Obando-Mena

I tried it and still the same problem

You may need to clear Mesa shader's cache before you start Blender. On my distrib (Manjaro) it's in "~/.cache" folder:
rm -rf ~/.cache/mesa_shader_cache/*

It works. I'll resume the work I was doing in 3.3.10 and see if any more issues arise.

thank you so much

> @Bernard-Antonio-Obando-Mena > > I tried it and still the same problem > > You may need to clear Mesa shader's cache before you start Blender. On my distrib (Manjaro) it's in "~/.cache" folder: > `rm -rf ~/.cache/mesa_shader_cache/*` It works. I'll resume the work I was doing in 3.3.10 and see if any more issues arise. thank you so much

In case some other people want to try the dummy "libcycles_kernel_oneapi_aot.so" trick:

  1. Download Blender (if you want to change Blender that is on your system it's also possible, you have to locate "libcycles_kernel_oneapi_aot.so" and use sudo when appropriate).
  2. Extract the archive somewhere.
  3. Use the following commands (replace [folder] with the path to the folder where you extracted Blender):
cd [folder]/lib
mv libcycles_kernel_oneapi_aot.so libcycles_kernel_oneapi_aot.so.orig
echo "" | clang++ -shared -fPIC -x c++ - -o libcycles_kernel_oneapi_aot.so.dummy
ln -s libcycles_kernel_oneapi_aot.so.dummy libcycles_kernel_oneapi_aot.so
  1. You may need to clear Mesa shaders' cache, on some distributions it's in "~/.cache/mesa_shader_cache" but it might be somewhere else:
    rm -rf ~/.cache/mesa_shader_cache/*
  2. Run blender from the [folder].
In case some other people want to try the dummy "libcycles_kernel_oneapi_aot.so" trick: 1. [Download Blender](https://www.blender.org/download/) (if you want to change Blender that is on your system it's also possible, you have to locate "libcycles_kernel_oneapi_aot.so" and use sudo when appropriate). 2. Extract the archive somewhere. 3. Use the following commands (replace [folder] with the path to the folder where you extracted Blender): ``` cd [folder]/lib mv libcycles_kernel_oneapi_aot.so libcycles_kernel_oneapi_aot.so.orig echo "" | clang++ -shared -fPIC -x c++ - -o libcycles_kernel_oneapi_aot.so.dummy ln -s libcycles_kernel_oneapi_aot.so.dummy libcycles_kernel_oneapi_aot.so ``` 4. You may need to clear Mesa shaders' cache, on some distributions it's in "~/.cache/mesa_shader_cache" but it might be somewhere else: `rm -rf ~/.cache/mesa_shader_cache/*` 5. Run blender from the [folder].
Author
Member

So apparently it is some problem with libcycles_kernel_oneapi_aot.so? I'm glad you guys found some solutions that's working, but I guess the one api part should be resolved still on the Intel side.

So apparently it _is_ some problem with `libcycles_kernel_oneapi_aot.so`? I'm glad you guys found some solutions that's working, but I guess the one api part should be resolved still on the Intel side.
Member

Thanks for the confirmation it works reliably with the dummy lib.

Can you check if this build : https://builder.blender.org/download/patch/blender-4.0.0-alpha+main-PR111606.e3a0c72c94f3-linux.x86_64-release.tar.xz, (from this WIP PR #111606 ) works for you?

Thanks for the confirmation it works reliably with the dummy lib. Can you check if this build : https://builder.blender.org/download/patch/blender-4.0.0-alpha+main-PR111606.e3a0c72c94f3-linux.x86_64-release.tar.xz, (from this WIP PR https://projects.blender.org/blender/blender/pulls/111606 ) works for you?

@xavierh

Can you check if this build : https://builder.blender.org/download/patch/blender-4.0.0-alpha+main-PR111606.e3a0c72c94f3-linux.x86_64-release.tar.xz, (from this WIP PR #111606 ) works for you?

Thanks Xavier. Unfortunately it still doesn't work :(.

I attached the bindings diff between running this build with the normal lib (for some reason it's now the JIT version) and the dummy lib.

@xavierh > Can you check if this build : https://builder.blender.org/download/patch/blender-4.0.0-alpha+main-PR111606.e3a0c72c94f3-linux.x86_64-release.tar.xz, (from this WIP PR https://projects.blender.org/blender/blender/pulls/111606 ) works for you? Thanks Xavier. Unfortunately it still doesn't work :(. I attached the bindings diff between running this build with the normal lib (for some reason it's now the JIT version) and the dummy lib.
Member

Sorry I've initially put a link to another build (PR110656 instead of PR111606) and edited it right after posting... not quickly enough for you apparently :) can you try with the PR111606 one ?

patch builds use JIT for oneAPI in order to gain some GPU compile time, that's expected and shouldn't change the outcome (and your test with that other build confirmed it!).

Sorry I've initially put a link to another build (PR110656 instead of PR111606) and edited it right after posting... not quickly enough for you apparently :) can you try with the PR111606 one ? patch builds use JIT for oneAPI in order to gain some GPU compile time, that's expected and shouldn't change the outcome (and your test with that other build confirmed it!).

Unfortunately the bugs are still in the PR111606 build. I attached the diff (between normal lib and dummy lib). I don't get it, I hope I'm not doing anything wrong when I'm trying, but what could I do wrong?

Somehow there are still some symbols bound to libcycles instead of libOpenColorIO. Using LD_PRELOAD shows that the bindings are removed but not the bugs.

Is it possible that it's not a symbol conflict but something else? Upon loading, can libcycles call something on the GPU/shaders pipeline (apparently not through open GL though) that will affect the subsequent compilation of shaders?

Unfortunately the bugs are still in the PR111606 build. I attached the diff (between normal lib and dummy lib). I don't get it, I hope I'm not doing anything wrong when I'm trying, but what could I do wrong? Somehow there are still some symbols bound to libcycles instead of libOpenColorIO. Using LD_PRELOAD shows that the bindings are removed but not the bugs. Is it possible that it's not a symbol conflict but something else? Upon loading, can libcycles call something on the GPU/shaders pipeline (apparently not through open GL though) that will affect the subsequent compilation of shaders?
Member

it turns out that fvisibility=hidden isn't enough to hide the STL symbols, so the test build didn't do much, but if using LD_PRELOAD aligns the bindings but doesn't fix the issue, you're right it should be something else than symbols.. but what ?

some device enumeration could call some of the GPU pipeline but all this is happening from here: https://projects.blender.org/blender/blender/src/branch/main/intern/cycles/device/oneapi/device_impl.cpp that isn't part of the cycles kernel library, and isn't even exposed through sycl/level-zero to HD4000.

the kernel lib source code is here: https://projects.blender.org/blender/blender/src/branch/main/intern/cycles/kernel/device/oneapi/kernel.cpp, it does nothing fancy upon loading.

it's hard to guess what's going wrong now... maybe the loading order of this lib dependencies causes troubles ? blender links in the exact same libs already, so that'd be weird, but I'm not sure what to try next.
Maybe the crash will reproduce again with a slightly less dummy lib?
From lib/linux_x86_64_glibc_228/dpcpp you can compile one that uses same compiler and also links to sycl:
echo "" | PATH=./bin:$PATH LD_LIBRARY_PATH=./lib clang++ -shared -fPIC -fsycl -x c++ - -o libcycles_kernel_oneapi_aot.so.dummyfsycl
I'm attaching it if you can give it a try.

it turns out that `fvisibility=hidden` isn't enough to hide the STL symbols, so the test build didn't do much, but if using LD_PRELOAD aligns the bindings but doesn't fix the issue, you're right it should be something else than symbols.. but what ? some device enumeration could call some of the GPU pipeline but all this is happening from here: https://projects.blender.org/blender/blender/src/branch/main/intern/cycles/device/oneapi/device_impl.cpp that isn't part of the cycles kernel library, and isn't even exposed through sycl/level-zero to HD4000. the kernel lib source code is here: https://projects.blender.org/blender/blender/src/branch/main/intern/cycles/kernel/device/oneapi/kernel.cpp, it does nothing fancy upon loading. it's hard to guess what's going wrong now... maybe the loading order of this lib dependencies causes troubles ? blender links in the exact same libs already, so that'd be weird, but I'm not sure what to try next. Maybe the crash will reproduce again with a slightly less dummy lib? From lib/linux_x86_64_glibc_228/dpcpp you can compile one that uses same compiler and also links to sycl: ` echo "" | PATH=./bin:$PATH LD_LIBRARY_PATH=./lib clang++ -shared -fPIC -fsycl -x c++ - -o libcycles_kernel_oneapi_aot.so.dummyfsycl ` I'm attaching it if you can give it a try.

Thanks for the detailed answer.

Maybe the crash will reproduce again with a slightly less dummy lib?

I tried it and unfortunately the bugs still appear.

Let me know if you got further ideas.

Thanks for the detailed answer. > Maybe the crash will reproduce again with a slightly less dummy lib? I tried it and unfortunately the bugs still appear. Let me know if you got further ideas.

A quick observation, I don't know what it is worth:

When comparing the bindings with the dummy lib vs the normal lib, I see that there are some additional bindings for libsycl when normal libcycles_kernel_oneapi is loaded (around 10 of them):

binding file blender-4.0.0-alpha/lib/libsycl.so.6 [0] to /usr/lib/libc.so.6 [0]: normal symbol `dl_iterate_phdr' [GLIBC_2.2.5]
binding file blender-4.0.0-alpha/lib/libsycl.so.6 [0] to /usr/lib/libc.so.6 [0]: normal symbol `getenv' [GLIBC_2.2.5]
binding file blender-4.0.0-alpha/lib/libsycl.so.6 [0] to /usr/lib/libc.so.6 [0]: normal symbol `memcmp' [GLIBC_2.2.5]
[...]

I attached the full bindings files, in case you want to compare by yourself.

A quick observation, I don't know what it is worth: When comparing the bindings with the dummy lib vs the normal lib, I see that there are some additional bindings for libsycl when normal libcycles_kernel_oneapi is loaded (around 10 of them): ``` binding file blender-4.0.0-alpha/lib/libsycl.so.6 [0] to /usr/lib/libc.so.6 [0]: normal symbol `dl_iterate_phdr' [GLIBC_2.2.5] binding file blender-4.0.0-alpha/lib/libsycl.so.6 [0] to /usr/lib/libc.so.6 [0]: normal symbol `getenv' [GLIBC_2.2.5] binding file blender-4.0.0-alpha/lib/libsycl.so.6 [0] to /usr/lib/libc.so.6 [0]: normal symbol `memcmp' [GLIBC_2.2.5] [...] ``` I attached the full bindings files, in case you want to compare by yourself.
Member

Thanks for the detailed answer.

Maybe the crash will reproduce again with a slightly less dummy lib?

I tried it and unfortunately the bugs still appear.

You mean fortunately ? that was my hope and it's "good progress" if the lessdummy lib reproduces the issue. It makes the number of differences between the OK and the KO cases quite thin now.

Can you share the binding logs for the libcycles_kernel_oneapi_aot.so.dummyfsycl ?

> Thanks for the detailed answer. > > > Maybe the crash will reproduce again with a slightly less dummy lib? > > I tried it and unfortunately the bugs still appear. > You mean fortunately ? that was my hope and it's "good progress" if the lessdummy lib reproduces the issue. It makes the number of differences between the OK and the KO cases quite thin now. Can you share the binding logs for the libcycles_kernel_oneapi_aot.so.dummyfsycl ?

Edit: :D What's wrong with me? The bugs don't appear with the fsycl dummy lib, it's definitively what I wanted to say!

Edit: :D What's wrong with me? **The bugs don't appear with the fsycl dummy lib**, it's definitively what I wanted to say!

Hi Xavier, following your example of the dummy lib using the same compiler and linking to sycl, I tried the exact command ran by cmake and removed one by one the flags until I found a faulty one: so without -ffast-math the resulting libcycles doesn't cause the bugs. Does it make sense to you?

Hi Xavier, following your example of the dummy lib using the same compiler and linking to sycl, I tried the exact command ran by cmake and removed one by one the flags until I found a faulty one: so without _-ffast-math_ the resulting libcycles doesn't cause the bugs. Does it make sense to you?
Member

great finding! What is happening isn't 100% clear yet, but It could make sense if some libs check/set fp env flags or load an implementation that got compiled with unexpected flag.

Can you pinpoint which of ffast-math implied flags breaks everything? if we're lucky, one of them alone will.
They're listed here: https://clang.llvm.org/docs/UsersManual.html#cmdoption-ffast-math
I'd recommend trying these:

  • -fno-math-errno
  • -fno-trapping-math
  • -ffinite-math-only
great finding! What is happening isn't 100% clear yet, but It could make sense if some libs check/set fp env flags or load an implementation that got compiled with unexpected flag. Can you pinpoint which of ffast-math implied flags breaks everything? if we're lucky, one of them alone will. They're listed here: https://clang.llvm.org/docs/UsersManual.html#cmdoption-ffast-math I'd recommend trying these: - -fno-math-errno - -fno-trapping-math - -ffinite-math-only

So first I tried all the flags implied by fast-math together, but the bugs were absent. Then I added "crtfastmath.o" to the compilation (without -ffast-math) and the bugs were back. So I guess this is the culprit. Here it says that it "adds a static constructor that sets the FTZ/DAZ bits in MXCSR, affecting not only the current compilation unit but all static and shared libraries included in the program".

So first I tried all the flags implied by fast-math together, but the bugs were absent. Then I added "crtfastmath.o" to the compilation (without -ffast-math) and the bugs were back. So I guess this is the culprit. [Here it says](https://clang.llvm.org/docs/UsersManual.html#crtfastmath-o) that it "adds a static constructor that sets the FTZ/DAZ bits in MXCSR, affecting not only the current compilation unit but all static and shared libraries included in the program".
Member

Awesome, we have the root cause now!
And a Blender-side fix in sight as there is absolutely 0 valid reason for tuning the FTZ/DAZ bits from this library,
I'll work on a PR to make it use a subset of -ffast-math instead.

I wasn't expecting the compiler to do that... This is fixed in GCC https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55522 - not yet in clang: https://github.com/llvm/llvm-project/issues/57589

Awesome, we have the root cause now! And a Blender-side fix in sight as there is absolutely 0 valid reason for tuning the FTZ/DAZ bits from this library, I'll work on a PR to make it use a subset of -ffast-math instead. I wasn't expecting the compiler to do that... This is fixed in GCC https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55522 - not yet in clang: https://github.com/llvm/llvm-project/issues/57589
Member

can you confirm this test build works ? https://builder.blender.org/download/patch/PR111708/

can you confirm this test build works ? https://builder.blender.org/download/patch/PR111708/

Yes, I confirm that it works ! Thank you very much (specially for your patience and your explanations :) ).

Yes, I confirm that it works ! Thank you very much (specially for your patience and your explanations :) ).
Blender Bot added
Status
Resolved
and removed
Status
Confirmed
labels 2023-08-31 10:43:40 +02:00

Hi Guys! For me I think it's time to buy a new machine for the upcoming Blender 4.0 xD
I was not able to run this Blender patch ( https://builder.blender.org/download/patch/PR111708/ ) version in my sys (Ubuntu 22.04 inside an eleven years old machine).

Terminal entries:

Wayland:

chicortiz@chicortiz:~/Downloads/blender-4.0.0-alpha+main-PR111708.fa9bf4a65c4d-linux.x86_64-release$ ./blender
EGL Error (0x3009): EGL_BAD_MATCH: Arguments are inconsistent (for example, a valid context requires buffers not supplied by a valid surface).
EGL Error (0x3009): EGL_BAD_MATCH: Arguments are inconsistent (for example, a valid context requires buffers not supplied by a valid surface).
EGL Error (0x3009): EGL_BAD_MATCH: Arguments are inconsistent (for example, a valid context requires buffers not supplied by a valid surface).
EGL Error (0x3009): EGL_BAD_MATCH: Arguments are inconsistent (for example, a valid context requires buffers not supplied by a valid surface).
Warning: No OpenGL vendor detected.
blender: ../external_epoxy/src/dispatch_common.c:872: epoxy_get_proc_address: Assertion `0 && "Couldn't find current GLX or EGL context.\n"' failed.
Aborted (core dumped)
chicortiz@chicortiz:~/Downloads/blender-4.0.0-alpha+main-PR111708.fa9bf4a65c4d-linux.x86_64-release$

Xorg:

chicortiz@chicortiz:~/Downloads/blender-4.0.0-alpha+main-PR111708.fa9bf4a65c4d-linux.x86_64-release$ ./blender
GHOST: Wayland: unable to connect to display!
Writing: /tmp/blender.crash.txt
Segmentation fault (core dumped)

Hi Guys! For me I think it's time to buy a new machine for the upcoming Blender 4.0 xD I was not able to run this Blender patch ( https://builder.blender.org/download/patch/PR111708/ ) version in my sys (Ubuntu 22.04 inside an eleven years old machine). Terminal entries: Wayland: chicortiz@chicortiz:~/Downloads/blender-4.0.0-alpha+main-PR111708.fa9bf4a65c4d-linux.x86_64-release$ ./blender EGL Error (0x3009): EGL_BAD_MATCH: Arguments are inconsistent (for example, a valid context requires buffers not supplied by a valid surface). EGL Error (0x3009): EGL_BAD_MATCH: Arguments are inconsistent (for example, a valid context requires buffers not supplied by a valid surface). EGL Error (0x3009): EGL_BAD_MATCH: Arguments are inconsistent (for example, a valid context requires buffers not supplied by a valid surface). EGL Error (0x3009): EGL_BAD_MATCH: Arguments are inconsistent (for example, a valid context requires buffers not supplied by a valid surface). Warning: No OpenGL vendor detected. blender: ../external_epoxy/src/dispatch_common.c:872: epoxy_get_proc_address: Assertion `0 && "Couldn't find current GLX or EGL context.\n"' failed. Aborted (core dumped) chicortiz@chicortiz:~/Downloads/blender-4.0.0-alpha+main-PR111708.fa9bf4a65c4d-linux.x86_64-release$ Xorg: chicortiz@chicortiz:~/Downloads/blender-4.0.0-alpha+main-PR111708.fa9bf4a65c4d-linux.x86_64-release$ ./blender GHOST: Wayland: unable to connect to display! Writing: /tmp/blender.crash.txt Segmentation fault (core dumped)
Author
Member

Note: this fix does seem to have fixed #109780 too, tested by @aperitero (#109780 (comment)).

Note: this fix does seem to have fixed #109780 too, tested by @aperitero (https://projects.blender.org/blender/blender/issues/109780#issuecomment-1010181).
Author
Member

@xavierh we probably need some further action on this and #111161 since they don't seem to have resolved for every situation.

@Francisco-Ortiz-de-Carvalho and @alba-delba still reported problem

@xavierh we probably need some further action on this and #111161 since they don't seem to have resolved for every situation. @Francisco-Ortiz-de-Carvalho and @alba-delba still reported problem
YimingWu reopened this issue 2023-09-05 06:58:42 +02:00
Blender Bot added
Status
Needs Triage
and removed
Status
Resolved
labels 2023-09-05 06:58:45 +02:00
Author
Member

Seem to be also partly related #110631

Seem to be also partly related #110631
Pratik Borhade added
Status
Needs Info from Developers
and removed
Status
Needs Triage
labels 2023-09-05 07:06:00 +02:00
Member

@ChengduLittleA the issue and fix for the current issue are in 4.0, and in the list to be backported to 3.6. If there are still similar rendering issues happening, it's best to open a new ticket and track these here as the root causing, the cause, the fix, the developers working on it, will all be different.

@ChengduLittleA the issue and fix for the current issue are in 4.0, and in the list to be backported to 3.6. If there are still similar rendering issues happening, it's best to open a new ticket and track these here as the root causing, the cause, the fix, the developers working on it, will all be different.

@ChengduLittleA the issue and fix for the current issue are in 4.0, and in the list to be backported to 3.6. If there are still similar rendering issues happening, it's best to open a new ticket and track these here as the root causing, the cause, the fix, the developers working on it, will all be different.

is anyone working on backporting it to 3.6.x? Blender 4a is too unstable to use right now.
can someone point me in the right direction, i'd love to be notified if/when it gets backported...
or at least check its status from time to time.

> @ChengduLittleA the issue and fix for the current issue are in 4.0, and in the list to be backported to 3.6. If there are still similar rendering issues happening, it's best to open a new ticket and track these here as the root causing, the cause, the fix, the developers working on it, will all be different. is anyone working on backporting it to 3.6.x? Blender 4a is too unstable to use right now. can someone point me in the right direction, i'd love to be notified if/when it gets backported... or at least check its status from time to time.

@alba-delba
As said before it's in the list of backports to 3.6. Since 3.6.3 is already in candidate phase, I guess you will have to wait for 3.6.4. Considering the calendar for previous fix releases (1/month?), it should happen in the second half of October. You can see the list of backports here (look for #111162): #109399

[correction] It has been backported to 3.6.3.

In the meantime, you can always use the dummy lib fix:

  • download Blender 3.6 and extract it somewhere.
  • replace the file libcycles_kernel_oneapi_aot.so in lib directory with the attached file.
  • clear mesa shader cache if needed and run the downloaded blender.
@alba-delba As said before it's in the list of backports to 3.6. Since 3.6.3 is already in candidate phase, I guess you will have to wait for 3.6.4. Considering the calendar for previous fix releases (1/month?), it should happen in the second half of October. You can see the list of backports here (look for #111162): https://projects.blender.org/blender/blender/issues/109399 [correction] It has been backported to 3.6.3. In the meantime, you can always use the dummy lib fix: - [download Blender 3.6](https://www.blender.org/download/lts/3-6/) and extract it somewhere. - replace the file _libcycles_kernel_oneapi_aot.so_ in _lib_ directory with the attached file. - clear mesa shader cache if needed and run the downloaded blender.

@aperitero i tried this dummy lib fix and it works in 3.6.3c, but not in 3.6;
which is okay by me, 3.6.3c is stable enough for production use.
<3 thank you for the fix.

@aperitero i tried this dummy lib fix and it works in 3.6.3c, but not in 3.6; which is okay by me, 3.6.3c is stable enough for production use. <3 thank you for the fix.
Author
Member

I believe this issue can be closed.

Note to later visitors: this is supposedly fixed with 09df1f4caf , but you need to clear mesa shader cache for it to eventually work correctly. See comments above for detailed instructions.

I believe this issue can be closed. Note to later visitors: this is supposedly fixed with 09df1f4cafb996802af4c89ab8af0630f750d599 , but you need to clear mesa shader cache for it to eventually work correctly. See comments above for detailed instructions.
Blender Bot added
Status
Archived
and removed
Status
Needs Info from Developers
labels 2023-09-09 10:44:47 +02:00
Sign in to join this conversation.
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset System
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Viewport & EEVEE
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Asset Browser Project
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Module
Viewport & EEVEE
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Severity
High
Severity
Low
Severity
Normal
Severity
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
7 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#111162
No description provided.