Build Bot: MacOS X test fails #81077

New Issue

Jeroen Bakker · 2020-09-23T09:44:46+02:00

Jeroen Bakker commented

2020-09-23 09:44:46 +02:00

Since we released Blender 2.90.0 the tests of the build bot are failing for the mac.

        Start  51: script_pyapi_prop_array
 51/157 Test  #51: script_pyapi_prop_array ...................   Passed    0.38 sec
        Start  52: id_management
 52/157 Test  #52: id_management .............................***Exception: SegFault  2.89 sec
......Blender 2.90.1 (hash 3e85bb34d0d7 built 2020-09-23 07:24:50)
found bundled python: /Users/blender/blender-buildbot/macos_290/install/Blender.app/Contents/Resources/2.90/python

----------------------------------------------------------------------
Ran 6 tests in 2.646s

OK
Writing: /var/folders/5s/6pmgq7ns62ng1r77k17kv6fm0000gn/T/blender.crash.txt

        Start  53: blendfile_io
 53/157 Test  #53: blendfile_io ..............................   Passed    0.35 sec
        Start  54: blendfile_liblink
 54/157 Test  #54: blendfile_liblink .........................   Passed    0.35 sec
        Start  55: bmesh_bevel

stdio

It seems to be the case most of the time, happens only on the mac build bot. always on the same test, but not related to a specific commit.
the 2.90.0 was released with passing tests, but the day after that the test started to fail. Strange enough the test did pass once 2 days ago. After we added all the fixes of 2.90.1.

This needs investigation. I set it to Unbreak now as this halts the release for 2.90.1. Is there anything I can do?

Since we released Blender 2.90.0 the tests of the build bot are failing for the mac. ``` Start 51: script_pyapi_prop_array 51/157 Test #51: script_pyapi_prop_array ................... Passed 0.38 sec Start 52: id_management 52/157 Test #52: id_management .............................***Exception: SegFault 2.89 sec ......Blender 2.90.1 (hash 3e85bb34d0d7 built 2020-09-23 07:24:50) found bundled python: /Users/blender/blender-buildbot/macos_290/install/Blender.app/Contents/Resources/2.90/python ---------------------------------------------------------------------- Ran 6 tests in 2.646s OK Writing: /var/folders/5s/6pmgq7ns62ng1r77k17kv6fm0000gn/T/blender.crash.txt Start 53: blendfile_io 53/157 Test #53: blendfile_io .............................. Passed 0.35 sec Start 54: blendfile_liblink 54/157 Test #54: blendfile_liblink ......................... Passed 0.35 sec Start 55: bmesh_bevel ``` [stdio](https://archive.blender.org/developer/F8911387/stdio) It seems to be the case most of the time, happens only on the mac build bot. always on the same test, but not related to a specific commit. the 2.90.0 was released with passing tests, but the day after that the test started to fail. Strange enough the test did pass once 2 days ago. After we added all the fixes of 2.90.1. This needs investigation. I set it to Unbreak now as this halts the release for 2.90.1. Is there anything I can do?

Jeroen Bakker commented

2020-09-23 09:44:46 +02:00

Changed status from 'Needs Triage' to: 'Confirmed'

Jeroen Bakker commented

2020-09-23 09:44:46 +02:00

Added subscribers: @Jeroen-Bakker, @mont29, @Sergey

Jeroen Bakker commented

2020-09-23 09:56:12 +02:00

Added subscriber: @sebbas

Sergey Sharybin commented

2020-09-23 10:10:58 +02:00

I am not sure why I'm in the subsribers. This is not specific to the buildbot setup, it will happen on any macOS build. Compiling with ASAN will make it easier to catch the issue.

Is there anything I can do?

I do not think so. Someone on a mac should dig into it and see if it's something wrong is going on in the test itself, or in the code.

I set it to Unbreak now as this halts the release for 2.90.1

Not sure why to do it at the day of release. This is not a newly introduced issue. For the release is safer to NOT do changes in code at this point.

Doesn't mean we should not fix the issue, is just to me this is not a stopper for 2.90.1.

I am not sure why I'm in the subsribers. This is not specific to the buildbot setup, it will happen on any macOS build. Compiling with ASAN will make it easier to catch the issue. > Is there anything I can do? I do not think so. Someone on a mac should dig into it and see if it's something wrong is going on in the test itself, or in the code. > I set it to Unbreak now as this halts the release for 2.90.1 Not sure why to do it at the day of release. This is not a newly introduced issue. For the release is safer to NOT do changes in code at this point. Doesn't mean we should not fix the issue, is just to me this is not a stopper for 2.90.1.

Jeroen Bakker commented

2020-09-23 10:32:53 +02:00

Lowering the prio as after testing with the dmg we decided to continue with the release as is.

Sebastián Barschkis commented

2020-09-23 11:07:32 +02:00

I got these tests failing on my macOS machine with today's master (a6b16cfd80):

11 - id_management (SEGFAULT)
29 - export_ply_vertices (Failed)
50 - cycles_volume (Failed)

Needs further investigation ..

I got these tests failing on my macOS machine with today's master (a6b16cfd801f): ``` 11 - id_management (SEGFAULT) 29 - export_ply_vertices (Failed) 50 - cycles_volume (Failed) ``` Needs further investigation ..

Bastien Montagne commented

2020-09-24 10:35:54 +02:00

Regarding id_management, did someone check that it was not a mere 'out of RAM' issue? Those tests are run in parallel now, iirc this can consume quite a lot of memory… Would also explain why it passes sometimes, and sometimes not?

Not sure how valid this remark is though, don't know the specs of our buildbots.

Regarding id_management, did someone check that it was not a mere 'out of RAM' issue? Those tests are run in parallel now, iirc this can consume quite a lot of memory… Would also explain why it passes sometimes, and sometimes not? *Not sure how valid this remark is though, don't know the specs of our buildbots.*

Sergey Sharybin commented

2020-09-24 11:00:41 +02:00

Those tests are run in parallel now

Where this information is coming from?

The issue can be easily reproduced on macOS by:

Compile Blender with ASAN
ctest -R id_management

Please look into actual problems rather than speculating that something is wrong on the buildbot.

> Those tests are run in parallel now Where this information is coming from? The issue can be easily reproduced on macOS by: 1. Compile Blender with ASAN 2. `ctest -R id_management` Please look into actual problems rather than speculating that something is wrong on the buildbot.

Bastien Montagne commented

2020-09-24 14:09:16 +02:00

In #81077#1022200, @Sergey wrote:

Those tests are run in parallel now

Where this information is coming from?

The issue can be easily reproduced on macOS by:

Compile Blender with ASAN

ctest -R id_management

Please look into actual problems rather than speculating that something is wrong on the buildbot.

I am not speculating, I am asking a question, after facing same out-of-memory issue here. And I would like to know how I am supposed to investigate an issue that only shows on an OS I have absolutely no access to.

> In #81077#1022200, @Sergey wrote: >> Those tests are run in parallel now > > Where this information is coming from? > > The issue can be easily reproduced on macOS by: > 1. Compile Blender with ASAN > 2. `ctest -R id_management` > > Please look into actual problems rather than speculating that something is wrong on the buildbot. I am not speculating, I am asking a question, after facing same out-of-memory issue here. And I would like to know how I am supposed to investigate an issue that only shows on an OS I have absolutely no access to.

Sergey Sharybin commented

2020-09-24 15:51:42 +02:00

@mont29, I'm not sure why you're the one who is supposed to look into the issue: as I've mentioned above that someone on macOS is to look into it, that it is easy to reproduce, and that it is not specific to buildbot.
At this time I don't think you should be looking into this issue. Give some time for the mac people to dig deeper, and, maybe, eventually assist with addressing the root cause (after it is identified).

@mont29, I'm not sure why you're the one who is supposed to look into the issue: as I've mentioned above that someone on macOS is to look into it, that it is easy to reproduce, and that it is not specific to buildbot. At this time I don't think you should be looking into this issue. Give some time for the mac people to dig deeper, and, maybe, eventually assist with addressing the root cause (after it is identified).

Ankit Meel commented

2020-09-25 13:51:42 +02:00

Added subscriber: @ankitm

Ankit Meel commented

2020-09-25 13:51:42 +02:00

Removed. I couldn't redo the original crash and thought what I fixed was happening on buildbot.

Ankit Meel commented

2020-09-26 01:48:38 +02:00

Please ignore the previous comment, it is a separate issue.
Debug build didn't crash at all, so built Release with ASan, and got a heap use after free due to the experimental method batch_remove(..) : P1659

Please ignore the previous comment, it is a separate issue. Debug build didn't crash at all, so built Release with ASan, and got a heap use after free due to the experimental method `batch_remove(..) ` : [P1659](https://archive.blender.org/developer/P1659.txt)

Dalai Felinto commented

2020-10-21 19:14:29 +02:00

Added subscriber: @dfelinto

Dalai Felinto commented

2020-10-21 19:14:29 +02:00

@ankitm do we have any updates on that?

Ankit Meel commented

2020-10-21 21:49:27 +02:00

Added subscriber: @JulianEisel

Ankit Meel commented

2020-10-21 21:49:27 +02:00

The day started with P1726 #8937 showing that id->us is not 0 and MECube was being freed when there was still a user. Output was:

id_delete: deleting MECube (1)

Later, @JulianEisel shared a patch that had fixed it for him, but not for me. https://pasteall.org/4OjK/slim

Later, after a lot of debug statements and misguided breakpoints, I found that the code in the for-loop for (id = last_remapped_id->next; id; id = id->next) { is not even being executed. So while trying to debug why that is, surprisingly P1726 #8932 fixed the test, and also fixed the id->us from being 1 to 0.
Ray suggested P1726 #8936 and that is also a fix.

Crash/ test failure happens only in release and relwithdebinfo builds, not debug ones. (ASAN doesn't affect that)

The day started with [P1726](https://archive.blender.org/developer/P1726.txt)#8937 showing that `id->us` is not 0 and `MECube` was being freed when there was still a user. Output was: > id_delete: deleting MECube (1) Later, @JulianEisel shared a patch that had fixed it for him, but not for me. https://pasteall.org/4OjK/slim Later, after a lot of debug statements and misguided breakpoints, I found that the code in the for-loop `for (id = last_remapped_id->next; id; id = id->next) {` is not even being executed. So while trying to debug why that is, surprisingly [P1726](https://archive.blender.org/developer/P1726.txt)#8932 fixed the test, and also fixed the `id->us` from being 1 to 0. Ray suggested [P1726](https://archive.blender.org/developer/P1726.txt)#8936 and that is also a fix. Crash/ test failure happens only in release and relwithdebinfo builds, not debug ones. (ASAN doesn't affect that)

Bastien Montagne commented

2020-10-21 22:44:37 +02:00

From my uneducated point of view, this sounds like Clang optimizer being over aggressive here, to say the least...

Those patches are nice to investigate, but none are acceptable fixes of course, they are all ways to 'hide' it with extra processing forcing somehow the compiler to generate correct code again ( or disabling any optimization).

I will try with clang on linux tomorrow out of curiosity (what is the version on OSX btw?), but did you try a full explicit init of tagged_deleted_ids, with two NULL pointers? That's the only obvious thing I can see from quickly checking the code again?

From my uneducated point of view, this sounds like Clang optimizer being over aggressive here, to say the least... Those patches are nice to investigate, but none are acceptable fixes of course, they are all ways to 'hide' it with extra processing forcing somehow the compiler to generate correct code again ( or disabling any optimization). I will try with clang on linux tomorrow out of curiosity (what is the version on OSX btw?), but did you try a full explicit init of `tagged_deleted_ids`, with two NULL pointers? That's the only obvious thing I can see from quickly checking the code again?

Bastien Montagne commented

2020-10-21 22:45:24 +02:00

And obviously, big thanks to everybody for investigating this hairy issue!

Ankit Meel commented

2020-10-22 08:54:18 +02:00

buildbot is using "AppleClang 12.0.0.12000032"
Julian is using "Apple clang version 12.0.0 (clang-1200.0.32.21)"
I'm using LLVM "clang version 12.0.0 (https://github.com/llvm/llvm-project.git e139450166a7c23ad42f839eddb1e34553967d78)"
I also tested "AppleClang 10.0.1.10010046".
Same results in all four.

did you try a full explicit init of tagged_deleted_ids, with two NULL pointers?

P1726 #8940 this ? It crashes with this patch applied.

buildbot is using "AppleClang 12.0.0.12000032" Julian is using "Apple clang version 12.0.0 (clang-1200.0.32.21)" I'm using LLVM "clang version 12.0.0 (https://github.com/llvm/llvm-project.git e139450166a7c23ad42f839eddb1e34553967d78)" I also tested "AppleClang 10.0.1.10010046". Same results in all four. > did you try a full explicit init of tagged_deleted_ids, with two NULL pointers? [P1726](https://archive.blender.org/developer/P1726.txt)#8940 this ? It crashes with this patch applied.

Bastien Montagne commented

2020-10-22 11:12:31 +02:00

In #81077#1039111, @ankitm wrote:

did you try a full explicit init of tagged_deleted_ids, with two NULL pointers?

P1726 #8940 this ? It crashes with this patch applied.

Yes, it was the only potentially fuzzy think I could spot (though I would not have expected it to be an issue)...

No issues here with clang 9, trying with clang 11 now...

> In #81077#1039111, @ankitm wrote: >> did you try a full explicit init of tagged_deleted_ids, with two NULL pointers? > [P1726](https://archive.blender.org/developer/P1726.txt)#8940 this ? It crashes with this patch applied. Yes, it was the only potentially fuzzy think I could spot (though I would not have expected it to be an issue)... No issues here with clang 9, trying with clang 11 now...

Bastien Montagne commented

2020-10-22 11:23:47 +02:00

And Clang 11 also passes fine here :(

Ankit Meel commented

2020-10-22 11:33:07 +02:00

https://godbolt.org/z/nKzaqE comparison of clang and gcc.
code of interest:

      if (last_remapped_id == NULL) {
        dummy_link.next = tagged_deleted_ids.first;
        last_remapped_id = (ID *)(&dummy_link);
      }

https://godbolt.org/z/nKzaqE comparison of clang and gcc. code of interest: ``` if (last_remapped_id == NULL) { dummy_link.next = tagged_deleted_ids.first; last_remapped_id = (ID *)(&dummy_link); } ```

Ankit Meel commented

2020-10-22 11:42:35 +02:00

Julian found this fix P1726 #8941 (making last_remapped_id volatile)

Julian found this fix [P1726](https://archive.blender.org/developer/P1726.txt)#8941 (making `last_remapped_id` volatile)

blender-admin commented

2020-10-26 10:32:20 +01:00

This issue was referenced by 2ddecfffc3

This issue was referenced by 2ddecfffc3d3a3a1db4ae45e8665caa2a85ab43a

Ankit Meel commented

2020-10-26 10:33:27 +01:00

Changed status from 'Confirmed' to: 'Resolved'

Ankit Meel closed this issue

2020-10-26 10:33:27 +01:00

Ankit Meel self-assigned this 2020-10-26 10:33:27 +01:00

blender-admin commented

2020-10-28 16:24:10 +01:00

This issue was referenced by 30ec0753c7

This issue was referenced by 30ec0753c75ca4c4ca8744727b7ac70b12d074f6

Sign in to join this conversation.

No Label

Download

What's New

Blender Studio

Manual

Developers Blog

Documentation

Benchmark

Blender Conference

Development Fund

One-time Donations

Build Bot: MacOS X test fails #81077