Regression: Crash when user opens directory instead of file on USD import #105160
Operating system: macOS 13.3
Graphics card: M1 Max
1e6ed77896, main, 2023-02-13, debug build
Worked: 3.5, 3.4.x release builds
Short description of error
I am consistently hitting a crash in Blender on macOS if I (originally accidentally, and now intentionally) misuse the USD import dialog by attempting to import a directory instead of a .usd file. However, it seems I am only able to reproduce the issue with debug builds of Blender.
Exact steps for others to reproduce the error
- start Blender
- File->Import, Universal Scene Description
- browse to any directory
- click the Import USD button (without a file selected)
Analysis of bug
There is a specific sequence of events that appears to inevitably and logically lead to the crash:
At conclusion of the failed USD import, the main thread is calling
wm_job_end(), which is being executed in the context of iterating through a linked list of timers in
LISTBASE_FOREACH_MUTABLE(wmTimer *, wt, &wm->timers). As part of this linked list iteration, it stores the next value to evaluate in
WM_report()to report a helpful error message to the user.
As part of
WM_report_banner_show(), the report timer is reset and removed.
BLI_remlink()to remove the report timer from
The timer that is removed is in fact the same timer that was already assigned to
wt_iter_nextin step 1.
The memory for that timer is deallocated via
Very soon, still in the same main thread, we arrive back at the
LISTBASE_FOREACH_MUTABLEloop, where the now deallocated
wt_iter_nexthas an invalid value and is assigned as the next
wtto evaluate. This macro attempts to dereference
wt->next, which causes an access violation, and Blender crashes.
It is not yet clear to me how the release builds seem to get away with this bug without crashing, but my expectation is that this is likely due to the debug-build deallocator overwriting memory that the release-build deallocator does not touch. In that case the release build is just getting lucky.
See attached screenshot of crash. See also attached screenshot showing the problematic bit of code just prior to the crash.
Thanks for the report and analysis, I can confirm the regression.
Something that might also help is detecting which commit introduced this problem.
I'll try to bisect.
This seems to be one of the cases that I can't replicate in a local build. So I can't bisect.
Moving to Core module, since this seems to be an issue with the WM Jobs handling itself.
This issue is already fairly bad, but it also raises another one: the whole
WM_report API is called from the non-main worker thread, when it's absolutely not threadsafe at all.
In general, unless explicitly documented as such, the WM APIs should not be considered threadsafe, and therefore only ever called from the main thread.
Proper fix for both issues I think is to have own
ReportList as part of the job's data, and directly using BKE_report API to populate it (with proper locking if needed).
Then calls to
WM_report & co should only ever happen from the main thread, and outside of any wmTimer handling codepath, i.e. once the job itself is finished.
If we want to generate user-visible reports while jobs are running, then I think the wmJobs system needs some update (something like each job having their own
ReportList, which can then be processed by the WM management code in a safe way, i.e. by accumulating all reports somewhere, and showing them in the UI outside of the Timers handling loop).
(I will update the documentation regarding usage of
Looking at existing code, a lot of other jobs are using the same callings to
WM_report in their code... Wonder how such an issue has remained unknown for so long :|
I would consider this issue as a blocking one for 3.5 release.
Will create a design task asap regarding how to fix this set of entangled issues (it involves at least
NOTE: Even though the underlying issue leading to the crash is going to be fixed soon, think this kind of error (invalid path) should also be detected by the operator itself, instead of starting a job and immediately failing...
This issue is fixed by
I have created another report for the USD project, to not start import job with 'obviously' invalid input data (like invalid file path and such, #105404).
No due date set.
No dependencies set.
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?