Build Bot: MacOS X test fails #81077

Closed
opened 2020-09-23 09:44:46 +02:00 by Jeroen Bakker · 27 comments
Member

Since we released Blender 2.90.0 the tests of the build bot are failing for the mac.

        Start  51: script_pyapi_prop_array
 51/157 Test  #51: script_pyapi_prop_array ...................   Passed    0.38 sec
        Start  52: id_management
 52/157 Test  #52: id_management .............................***Exception: SegFault  2.89 sec
......Blender 2.90.1 (hash 3e85bb34d0d7 built 2020-09-23 07:24:50)
found bundled python: /Users/blender/blender-buildbot/macos_290/install/Blender.app/Contents/Resources/2.90/python

----------------------------------------------------------------------
Ran 6 tests in 2.646s

OK
Writing: /var/folders/5s/6pmgq7ns62ng1r77k17kv6fm0000gn/T/blender.crash.txt

        Start  53: blendfile_io
 53/157 Test  #53: blendfile_io ..............................   Passed    0.35 sec
        Start  54: blendfile_liblink
 54/157 Test  #54: blendfile_liblink .........................   Passed    0.35 sec
        Start  55: bmesh_bevel

stdio

It seems to be the case most of the time, happens only on the mac build bot. always on the same test, but not related to a specific commit.
the 2.90.0 was released with passing tests, but the day after that the test started to fail. Strange enough the test did pass once 2 days ago. After we added all the fixes of 2.90.1.

This needs investigation. I set it to Unbreak now as this halts the release for 2.90.1. Is there anything I can do?

Since we released Blender 2.90.0 the tests of the build bot are failing for the mac. ``` Start 51: script_pyapi_prop_array 51/157 Test #51: script_pyapi_prop_array ................... Passed 0.38 sec Start 52: id_management 52/157 Test #52: id_management .............................***Exception: SegFault 2.89 sec ......Blender 2.90.1 (hash 3e85bb34d0d7 built 2020-09-23 07:24:50) found bundled python: /Users/blender/blender-buildbot/macos_290/install/Blender.app/Contents/Resources/2.90/python ---------------------------------------------------------------------- Ran 6 tests in 2.646s OK Writing: /var/folders/5s/6pmgq7ns62ng1r77k17kv6fm0000gn/T/blender.crash.txt Start 53: blendfile_io 53/157 Test #53: blendfile_io .............................. Passed 0.35 sec Start 54: blendfile_liblink 54/157 Test #54: blendfile_liblink ......................... Passed 0.35 sec Start 55: bmesh_bevel ``` [stdio](https://archive.blender.org/developer/F8911387/stdio) It seems to be the case most of the time, happens only on the mac build bot. always on the same test, but not related to a specific commit. the 2.90.0 was released with passing tests, but the day after that the test started to fail. Strange enough the test did pass once 2 days ago. After we added all the fixes of 2.90.1. This needs investigation. I set it to Unbreak now as this halts the release for 2.90.1. Is there anything I can do?
Author
Member

Changed status from 'Needs Triage' to: 'Confirmed'

Changed status from 'Needs Triage' to: 'Confirmed'
Author
Member

Added subscribers: @Jeroen-Bakker, @mont29, @Sergey

Added subscribers: @Jeroen-Bakker, @mont29, @Sergey
Author
Member

Added subscriber: @sebbas

Added subscriber: @sebbas

I am not sure why I'm in the subsribers. This is not specific to the buildbot setup, it will happen on any macOS build. Compiling with ASAN will make it easier to catch the issue.

Is there anything I can do?

I do not think so. Someone on a mac should dig into it and see if it's something wrong is going on in the test itself, or in the code.

I set it to Unbreak now as this halts the release for 2.90.1

Not sure why to do it at the day of release. This is not a newly introduced issue. For the release is safer to NOT do changes in code at this point.

Doesn't mean we should not fix the issue, is just to me this is not a stopper for 2.90.1.

I am not sure why I'm in the subsribers. This is not specific to the buildbot setup, it will happen on any macOS build. Compiling with ASAN will make it easier to catch the issue. > Is there anything I can do? I do not think so. Someone on a mac should dig into it and see if it's something wrong is going on in the test itself, or in the code. > I set it to Unbreak now as this halts the release for 2.90.1 Not sure why to do it at the day of release. This is not a newly introduced issue. For the release is safer to NOT do changes in code at this point. Doesn't mean we should not fix the issue, is just to me this is not a stopper for 2.90.1.
Author
Member

Lowering the prio as after testing with the dmg we decided to continue with the release as is.

Lowering the prio as after testing with the dmg we decided to continue with the release as is.

I got these tests failing on my macOS machine with today's master (a6b16cfd80):

11 - id_management (SEGFAULT)
29 - export_ply_vertices (Failed)
50 - cycles_volume (Failed)

Needs further investigation ..

I got these tests failing on my macOS machine with today's master (a6b16cfd801f): ``` 11 - id_management (SEGFAULT) 29 - export_ply_vertices (Failed) 50 - cycles_volume (Failed) ``` Needs further investigation ..

Regarding id_management, did someone check that it was not a mere 'out of RAM' issue? Those tests are run in parallel now, iirc this can consume quite a lot of memory… Would also explain why it passes sometimes, and sometimes not?

Not sure how valid this remark is though, don't know the specs of our buildbots.

Regarding id_management, did someone check that it was not a mere 'out of RAM' issue? Those tests are run in parallel now, iirc this can consume quite a lot of memory… Would also explain why it passes sometimes, and sometimes not? *Not sure how valid this remark is though, don't know the specs of our buildbots.*

Those tests are run in parallel now

Where this information is coming from?

The issue can be easily reproduced on macOS by:

  1. Compile Blender with ASAN
  2. ctest -R id_management

Please look into actual problems rather than speculating that something is wrong on the buildbot.

> Those tests are run in parallel now Where this information is coming from? The issue can be easily reproduced on macOS by: 1. Compile Blender with ASAN 2. `ctest -R id_management` Please look into actual problems rather than speculating that something is wrong on the buildbot.

In #81077#1022200, @Sergey wrote:

Those tests are run in parallel now

Where this information is coming from?

The issue can be easily reproduced on macOS by:

  1. Compile Blender with ASAN
  2. ctest -R id_management

Please look into actual problems rather than speculating that something is wrong on the buildbot.

I am not speculating, I am asking a question, after facing same out-of-memory issue here. And I would like to know how I am supposed to investigate an issue that only shows on an OS I have absolutely no access to.

> In #81077#1022200, @Sergey wrote: >> Those tests are run in parallel now > > Where this information is coming from? > > The issue can be easily reproduced on macOS by: > 1. Compile Blender with ASAN > 2. `ctest -R id_management` > > Please look into actual problems rather than speculating that something is wrong on the buildbot. I am not speculating, I am asking a question, after facing same out-of-memory issue here. And I would like to know how I am supposed to investigate an issue that only shows on an OS I have absolutely no access to.

@mont29, I'm not sure why you're the one who is supposed to look into the issue: as I've mentioned above that someone on macOS is to look into it, that it is easy to reproduce, and that it is not specific to buildbot.
At this time I don't think you should be looking into this issue. Give some time for the mac people to dig deeper, and, maybe, eventually assist with addressing the root cause (after it is identified).

@mont29, I'm not sure why you're the one who is supposed to look into the issue: as I've mentioned above that someone on macOS is to look into it, that it is easy to reproduce, and that it is not specific to buildbot. At this time I don't think you should be looking into this issue. Give some time for the mac people to dig deeper, and, maybe, eventually assist with addressing the root cause (after it is identified).
Member

Added subscriber: @ankitm

Added subscriber: @ankitm
Member

Removed. I couldn't redo the original crash and thought what I fixed was happening on buildbot.

Removed. I couldn't redo the original crash and thought what I fixed was happening on buildbot.
Member

Please ignore the previous comment, it is a separate issue.
Debug build didn't crash at all, so built Release with ASan, and got a heap use after free due to the experimental method batch_remove(..) : P1659

Please ignore the previous comment, it is a separate issue. Debug build didn't crash at all, so built Release with ASan, and got a heap use after free due to the experimental method `batch_remove(..) ` : [P1659](https://archive.blender.org/developer/P1659.txt)

Added subscriber: @dfelinto

Added subscriber: @dfelinto

@ankitm do we have any updates on that?

@ankitm do we have any updates on that?
Member

Added subscriber: @JulianEisel

Added subscriber: @JulianEisel
Member

The day started with P1726#8937 showing that id->us is not 0 and MECube was being freed when there was still a user. Output was:

id_delete: deleting MECube (1)

Later, @JulianEisel shared a patch that had fixed it for him, but not for me. https://pasteall.org/4OjK/slim

Later, after a lot of debug statements and misguided breakpoints, I found that the code in the for-loop for (id = last_remapped_id->next; id; id = id->next) { is not even being executed. So while trying to debug why that is, surprisingly P1726#8932 fixed the test, and also fixed the id->us from being 1 to 0.
Ray suggested P1726#8936 and that is also a fix.

Crash/ test failure happens only in release and relwithdebinfo builds, not debug ones. (ASAN doesn't affect that)

The day started with [P1726](https://archive.blender.org/developer/P1726.txt)#8937 showing that `id->us` is not 0 and `MECube` was being freed when there was still a user. Output was: > id_delete: deleting MECube (1) Later, @JulianEisel shared a patch that had fixed it for him, but not for me. https://pasteall.org/4OjK/slim Later, after a lot of debug statements and misguided breakpoints, I found that the code in the for-loop `for (id = last_remapped_id->next; id; id = id->next) {` is not even being executed. So while trying to debug why that is, surprisingly [P1726](https://archive.blender.org/developer/P1726.txt)#8932 fixed the test, and also fixed the `id->us` from being 1 to 0. Ray suggested [P1726](https://archive.blender.org/developer/P1726.txt)#8936 and that is also a fix. Crash/ test failure happens only in release and relwithdebinfo builds, not debug ones. (ASAN doesn't affect that)

From my uneducated point of view, this sounds like Clang optimizer being over aggressive here, to say the least...

Those patches are nice to investigate, but none are acceptable fixes of course, they are all ways to 'hide' it with extra processing forcing somehow the compiler to generate correct code again ( or disabling any optimization).

I will try with clang on linux tomorrow out of curiosity (what is the version on OSX btw?), but did you try a full explicit init of tagged_deleted_ids, with two NULL pointers? That's the only obvious thing I can see from quickly checking the code again?

From my uneducated point of view, this sounds like Clang optimizer being over aggressive here, to say the least... Those patches are nice to investigate, but none are acceptable fixes of course, they are all ways to 'hide' it with extra processing forcing somehow the compiler to generate correct code again ( or disabling any optimization). I will try with clang on linux tomorrow out of curiosity (what is the version on OSX btw?), but did you try a full explicit init of `tagged_deleted_ids`, with two NULL pointers? That's the only obvious thing I can see from quickly checking the code again?

And obviously, big thanks to everybody for investigating this hairy issue!

And obviously, big thanks to everybody for investigating this hairy issue!
Member

buildbot is using "AppleClang 12.0.0.12000032"
Julian is using "Apple clang version 12.0.0 (clang-1200.0.32.21)"
I'm using LLVM "clang version 12.0.0 (https://github.com/llvm/llvm-project.git e139450166a7c23ad42f839eddb1e34553967d78)"
I also tested "AppleClang 10.0.1.10010046".
Same results in all four.

did you try a full explicit init of tagged_deleted_ids, with two NULL pointers?

P1726#8940 this ? It crashes with this patch applied.

buildbot is using "AppleClang 12.0.0.12000032" Julian is using "Apple clang version 12.0.0 (clang-1200.0.32.21)" I'm using LLVM "clang version 12.0.0 (https://github.com/llvm/llvm-project.git e139450166a7c23ad42f839eddb1e34553967d78)" I also tested "AppleClang 10.0.1.10010046". Same results in all four. > did you try a full explicit init of tagged_deleted_ids, with two NULL pointers? [P1726](https://archive.blender.org/developer/P1726.txt)#8940 this ? It crashes with this patch applied.

In #81077#1039111, @ankitm wrote:

did you try a full explicit init of tagged_deleted_ids, with two NULL pointers?

P1726#8940 this ? It crashes with this patch applied.

Yes, it was the only potentially fuzzy think I could spot (though I would not have expected it to be an issue)...

No issues here with clang 9, trying with clang 11 now...

> In #81077#1039111, @ankitm wrote: >> did you try a full explicit init of tagged_deleted_ids, with two NULL pointers? > [P1726](https://archive.blender.org/developer/P1726.txt)#8940 this ? It crashes with this patch applied. Yes, it was the only potentially fuzzy think I could spot (though I would not have expected it to be an issue)... No issues here with clang 9, trying with clang 11 now...

And Clang 11 also passes fine here :(

And Clang 11 also passes fine here :(
Member

https://godbolt.org/z/nKzaqE comparison of clang and gcc.
code of interest:

      if (last_remapped_id == NULL) {
        dummy_link.next = tagged_deleted_ids.first;
        last_remapped_id = (ID *)(&dummy_link);
      }
https://godbolt.org/z/nKzaqE comparison of clang and gcc. code of interest: ``` if (last_remapped_id == NULL) { dummy_link.next = tagged_deleted_ids.first; last_remapped_id = (ID *)(&dummy_link); } ```
Member

Julian found this fix P1726#8941 (making last_remapped_id volatile)

Julian found this fix [P1726](https://archive.blender.org/developer/P1726.txt)#8941 (making `last_remapped_id` volatile)

This issue was referenced by 2ddecfffc3

This issue was referenced by 2ddecfffc3d3a3a1db4ae45e8665caa2a85ab43a
Member

Changed status from 'Confirmed' to: 'Resolved'

Changed status from 'Confirmed' to: 'Resolved'
Ankit Meel self-assigned this 2020-10-26 10:33:27 +01:00

This issue was referenced by 30ec0753c7

This issue was referenced by 30ec0753c75ca4c4ca8744727b7ac70b12d074f6
Sign in to join this conversation.
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset System
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Code Documentation
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
FBX
Interest
Freestyle
Interest
Geometry Nodes
Interest
glTF
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Viewport & EEVEE
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Asset Browser Project
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Asset System
Module
Core
Module
Development Management
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Module
Viewport & EEVEE
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Severity
High
Severity
Low
Severity
Normal
Severity
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
7 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#81077
No description provided.