Creating and removing many objects very quickly causes a crash #84397
Labels
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset System
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Code Documentation
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
FBX
Interest
Freestyle
Interest
Geometry Nodes
Interest
glTF
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Viewport & EEVEE
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Asset Browser Project
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Asset System
Module
Core
Module
Development Management
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Module
Viewport & EEVEE
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Severity
High
Severity
Low
Severity
Normal
Severity
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
12 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: blender/blender#84397
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
System Information
Operating system: Linux-5.10.4-arch2-1-x86_64-with-arch 64 Bits
Graphics card: GeForce GTX 1080 Ti/PCIe/SSE2 NVIDIA Corporation 4.5.0 NVIDIA 455.45.01
Blender Version
Broken:
b71eb3a105
Worked:
82645ff739
Short description of error
Creating and removing many objects very quickly causes a crash. I created a .blend file to illustrate the issue. The script in the file implements a simple operator that adds a specified (operator property) amount of objects to the active collection.
Changing the number of objects sometimes causes a crash. I found the crash to be more likely the more objects are added. For the default count of 800 objects it crashes reliably.
Exact steps for others to reproduce the error
Things I tested
I bisected the history and identified
b71eb3a105
as the first bad commit.The crash happens because subdiv_ccg is 0x1
{F9551145}{F9551152}many_object_update_crash.crash.txt
Added subscriber: @oweissbarth
Added subscriber: @rjg
Changed status from 'Needs Triage' to: 'Confirmed'
I'm not certain if the commit actually introduced the bug or only made an underlying problem apparent in the undo system or dependency graph (e.g. similar to #80203 which could only be reproduced on macOS). I can reproduce the crash on Windows.
@oweissbarth Are you able to reproduce this in a debug build with ASAN on Linux, because I haven't been able to?
I can confirm that it does not crash with ASAN on Linux.
Added subscriber: @LazyDodo
looks like heap corruption to MSVC but the issue goes away as soon as you add any kind of heap validation, so it's a race-y kind of corruption? neat!
my stack is radically different from the one in the opening post though.
Added subscriber: @ankitm
Even the macOS one #80203 goes away if asan is enabled.
Added subscriber: @JacquesLucke
I couldn't figure out the root cause yet. However, I have some more information.
I get the same backtrace as @oweissbarth (
subdiv_ccg
was0x1
for me as well).Also I was able to reproduce the issue reliably in
b71eb3a105
, but not in the commit before that.I was able to track down what change in that commit is responsible for breaking the given example file: 16 new bytes have been added to
ID
.If I checkout
b71eb3a105
(the commit before the one above), the test file works fine initially.When I now apply this diff, the test file starts to fail.
Instead of adding these bytes to
ID
I could also add them anywhere in theMesh
struct. Just adding 8 byte was not enough. I both cases many times, it was a very reliable way to reproduce the crash, only in release builds though.Unfortunately, while interesting, this information is not enough to fix the bug yet.
I also tried creating lights and cameras instead of meshes in the test file, and was able to get similar but slightly different crashes. I didn't have enough time to bisect this issue in older commits yet, might do it tomorrow.
Furthermore, I wondered if there is maybe some
offsetof
call that is not recompiled when certain parts of dna change. That was not the case though, this should have been fixed by a clean compilation, but it wasn't.also small update on my end, the crash i'm seeing on windows appears to be a different one? I can repro it in all hashes mentioned in this ticket including the one listed as working.
Added subscriber: @JulianEisel
I just noticed that it might be related to rBL62457, which also isn't very useful...
I checked out
9db4e44961
+ one of the patches below. This compiles with rBL62457 (broken) and rBL62402 (works). Maybe some cmake options have to be disabled if it does not work immediately.Interestingly, the crash is still caused by
subdiv_ccg
being invalid, so the crash is very predictable.Another weird thing is that when P1868 is applied,
subdiv_ccg
will always be0x1
.When I apply P1869 instead.
subdiv_ccg
will always be0x200200002002
.I don't know how that is possible. But I can reproduce this every time.
It would be interesting to see if @oweissbarth can reproduce my findings on his machine.
My system:
Operating system: Linux-5.9.14-arch1-1-x86_64-with-arch 64 Bits
Graphics card: AMD Radeon RX 5700 (NAVI10, DRM 3.39.0, 5.9.14-arch1-1, LLVM 11.0.0) AMD 4.6 (Core Profile) Mesa 20.3.1
I'm removing @JulianEisel as assignee, because while his commit introduced the error, the issue seems to be somewhere else.
Unfortunately, I don't know how to investigate any further currently. And while I found some interesting stuff, I'm not sure if this will actually be useful to solve the core issue.
The issue found by @LazyDodo looks quite different indeed. I have no idea if they are related, but it could well be, somehow. @LazyDodo, were you able to find the oldest commit that contains the error?
Yes,
b852db57ba
, which is not terribly useful in tracking down the origin of the corruption , the issues seem "different" yet somehow connected, this is fun :)I tested it and i can confirm your findings.
9db4e44961
9db4e44961
9db4e44961
9db4e44961
I also noticed that the does not also ways happen in
BKE_subdiv_ccg_destroy
. I also got crashes inIDP_foreach_property
(less often).Great thanks, that's good to know! Still don't know how to continue from here..
P1870: (An Untitled Masterwork)
makes it crash reliably for me rather than jumping all over the place, if you take out the
context.collection.objects.link(obj)
line the crash goes away so kinda feels there's definitely an issue there, is it the same issue we have been chasing? no idea! could be something unrelated.....or not...Added subscriber: @Sergey
some good news some bad news
bad : the script above seems to be a different issue than what i had been chasing before
good : I managed to capture the heap corruption
bad : Beyond a rough indication where the corruption is occuring i'm not any closer to understanding it
so here we go :)
I finally managed to capture a crash and did a quick diagnostic of the heap corruption using windbg's timetravel feature (imagine a debugger where you can step backwards and forwards)
The process terminates after RtlpLowFragHeapAllocFromContext detects a corrupted heap and calls RtlpLogHeapFailure to report it
so lets break on RtlpLogHeapFailure and run backwards until we hit the breakpoint
a few instruction steps back gets us to this bit of code in RtlpLowFragHeapAllocFromContext
rbx+0Fh
is getting compared for a sane value it is not deemed sane and we terminate, allright, let see what's therewell that's awesome, but how did that
0xff
get there? well given we can see through time, that is not too hard of a question to answer9 writes in total, lets look at the stacks
frame 0/1/2/3 : deg allocates and RtlpAllocateHeapInternal seemingly sets some housekeeping vars
frame 4 : DepsgraphNodeBuilder's dtor frees some ram, and the heap updates some of the house keeping flags, seems fair..
frame 5/6/7/8/9: we copy some data to it? which is odd, since this was clearly in an internal house keeping area of the heap, not an area we ought to be writing to, it writes to this address multiple times in various stages of BKE_id_copy_ex this is the stack from frame 9 but all originate from
BKE_id_copy_ex
Alight now that the "where" is known (
BKE_id_copy_ex
), drawing the issue out of the shadows is rather straight forward,000001E46BFF37C8 allocated 1416 clearing 1792
The only real mystery is why asan is not picking up on this, I'm somewhat out of my depth with this ID Management code, so I'll leave fixing the bug for someone else
Added subscriber: @ideasman42
Looked into this bug, the problem is caused by the depsgraph using a
Map
(DepsgraphNodeBuilder.id_info_hash_
) keeping a map ofID
toIDInfo
data between undo steps.On redo, new ID's are freed and created, never updating
id_info_hash_
.When creating many ID's, an Object ID (in my case) is getting allocated at the address previously used for a mesh, causing the
IDNode.id_cow
to copy mesh data into an object pointer (hence the buffer overrun inBKE_id_copy_ex
).We could try fix this by updating the depsgraphs runtime data to keep it valid, however I'm not sure if this is worth doing. It's already being rebuilt when adding/deleting objects for example.
A simpler solution could be to keep the optimization as-is, but do a full rebuild if undo adds/removes ID data-blocks. So stale ID data never gets used.
This is some quick-hack patch that does this - for reference: P1872
Nice find! That fixes the issue for me as well.
Great work everybody!
Runtime ID pointers over undo/redo are a reoccurring issue... We have
ID.session_uuid
now and it would be trivial to solve such issues if we somehow registered asession_uuid
->ID *
map that gets updated on destructive main changes (undo, redo, deletion, file reading, ...). Instead of storing a pointer, you'd store thesession_uuid
and query the pointer if needed. There'sBKE_main_idmap
already, but these don't get updated on main changes.Such a registry could be quite expensive in big files if it contained all IDs. We could lazy create the entries, so e.g. when the depsgraph adds a new ID to
id_info_hash_
it could ensure the ID is registered in the global map (would still create many entries though). Ideally you could do O(1) lookups bysession_uuid
right withinMain
but that's not a simple change.Just throwing this idea out there, it would make a number of things easier.
neither the paged heap nor asan re-use memory (impossible to detect after use issues otherwise) so that would explain why the issue doesn't show with those tools, thanks for the explanation! that part was bugging me much more than i'd like to admit :)
Added subscriber: @brecht
@JulianEisel, I expect the depsgraph can use session UUIDs as key values for
id_info_hash_
and any similar maps directly, without the need to maintain an additional map.Added subscriber: @mont29
I will also first try to make depsgraph use those session uuids first, this looks like the most obvious solution indeed.
What happens on ID deletion, do we rebuild the depsgraph or remove the nodes to be deleted? If it's the latter I guess the ID should be removed from
id_info_hash_
?I assumed the
IDInfo.id_cow
could be an issue over undos, but didn't look into it much and trust your judgement there.Anyway, if using
session_uuid
solves the issue: yay!The dependency graph is fully rebuilt on changes. Evaluated datablocks are either reused, or discarded if they end up unused after the rebuild.
Can people able to reproduce the issue confirm if D10077: Fix #84397, #80203: use
session_uuid
instead of ID pointers in depsgraph storage. fix it for them? thanks.It does not fix it for me unfortunately. I still get the same error.
also not fixed here, however the crash moved to a different location, I attached a stack trace in D10077
Even if the D10077 does not solve this crash, I think it's good to wrap it up a bit, and commit anyway. It is proper thing to do. See my comment there.
To eliminate possibility of "stale" pointers used in the depsgraph you can replace
IDInfo *id_info = id_info_hash_.lookup_default(id, nullptr);
withIDInfo *id_info = nullptr
.From Campbell's comment sounds like there is some confusion of
id_info_hash_
. This hash is only used during depsgraph relations update, to "transfer" evaluated state of IDs from old depsgraph to the new one. It is not possible to "update"id_info_hash_
, as this is a temporary storage for during relations update.I'm not sure this is a root of the problem though, because neither proper use of
session_uuid
for theid_info_hash_
nor complete ignoring of evaluated state transfer fixes crash to me. But the crash is different for me:BKE_scene_object_base_flag_sync_from_base
has a base which object isnullptr
.That's the same crash is i get from P1870 which appeared (to me) to be a different problem than the one in the opening post,
I can confirm
IDInfo *id_info = nullptr
fixes the repro in the opening post but not P1870 , I wasn't convinced P1870 wasn't my fault by writing bad python, so I had not pushed the issue very hard@LazyDodo, ah ok, good to know. Can you test whether P1884 fixes the original issue?
That hits the same crash in D10077, in
void DepsgraphNodeBuilder::begin_build()
Given this even hits with the page heap, i'm pretty hopeful asan will catch it on linux and may shed some light on why the pointer is bogus
@LazyDodo, ok, managed to crash. Is trivial, actually: do not de-reference
id_orig
, store the uuid in theIDNode
. Mind checking P1886 ?@Sergey I tried it with P1886 on master and it works! My crash is gone. Thank you alot!
can confirm P1886 fixes the opening post, but not the crash inside
BKE_scene_object_base_flag_sync_from_base
(P1870) given how "muddy" this ticket is already that should perhaps move to its own ticket?This issue was referenced by
96336007e9
This issue was referenced by
f6c7da5759
This issue was referenced by
abbc43e4e4
Changed status from 'Confirmed' to: 'Resolved'
@LazyDodo yes please report that
BKE_scene_object_base_flag_sync_from_base
issue in a new task. :)Added subscriber: @rayiik-1
just want to pass this on as it still appears to be an issue in 2.92 so i wanted to give you guys some crash/debug files in hopes that it might help in this case its 100% reproducible every time regardless of how long ive been in blender even fresh start, but this time i create an object using bpy.ops.mesh.primitive_uv_sphere_add (4 or 5) however does not seem to happen with non primitave object creation and deletion
hope this help guys and thanks for all the great hard work. if you need any more tests run shoot me msg.
3blender.crash.txt
blender_debug_output1.txt
blender.crash.txt
blender_system_info1.txt
blender_debug_output1.txt
blender_system_info.txt
blender3.crash.txt
blender4.crash.txt
blender5.crash.txt
@rayiik-1 Please create a new bug report through Help > Report a Bug in Blender and add the precise steps that lead to the crash.