Make WM_report/WM_timer thread safe (or forbid their usage from non-main thread, and add dedicated API for such usages). #112537

Closed
opened 2023-09-18 18:39:12 +02:00 by Bastien Montagne · 6 comments

Essentially, addressing #105369 was only covering the tip of the iceberg - the whole iceberg being that the whole handling of WM_report (in therefore WM_timer) is totally not thread-safe.

This leads to invalid memory accesses (use after free and the like) when flooding reports from worker threads. Both issues described below have been encountered while trying to export a massive production file (040_0050.anim.blend from r3055) to USD, which generates a lot of reports.

use-after-free of wmTimer:

Lost the backtrace...
But what was happening was a use-after-free error in WM_event_timer_remove(), when calling WM_event_timer_free_data. The issue here was likely that that timer had been freed by the main thread after being tagged, but before the call to WM_event_timer_free_data?

Use-after-free when removing timers

9  __asan::__asan_report_store8                      asan_rtl.cpp                127  0x7fd8e6ce2d1f 
10 BLI_remlink                                       listbase.cc                 142  0x13527f0b     
11 wm_window_timers_delete_removed                   wm_window.cc                2016 0x13cf9229     
12 wm_window_timers_process                          wm_window.cc                1681 0x13cf6f27     
13 wm_window_events_process                          wm_window.cc                1700 0x13cf70f1     
14 WM_main                                           wm.cc                       606  0x13ba1c86     
15 main                                              creator.cc                  590  0xf3c332f      

This is likely caused by another thread adding a report (and therefore a timer) while 'to-be-freed' timers are being deleted from main thread.

TL; DR:

We cannot keep allowing usage of WM_report API from non-main threads as-is. The code is simply not thread-safe.

Proposal

  1. Poison usage of WM_report and WM_timer from non-main thread (add asserts in main entry points).
  2. 'low-level' threaded code is responsible to use their own ReportList data, and ensure it's properly synced with the WM one from main thread only.
  3. Extend the wmJob API to provide easy-to-use handling of reports (each job would get their private reports list, and job update code would be responsible to ensure proper update with the WM one).

Note that the other 'obvious' solution would be to make the whole WM_report/WM_timer code thread-safe. While possible (probably even relatively easy) to do, think it's better to follow the good pattern of each thread working on its own data as much as possible. It makes code having to be aware of synchronization issues smaller (the sync code between worker and main thread, instead of a whole set of API functions). And although this is not a main concern here, it's also usually better for performances (much less locking needed).

Essentially, addressing #105369 was only covering the tip of the iceberg - the whole iceberg being that the whole handling of WM_report (in therefore WM_timer) is totally not thread-safe. This leads to invalid memory accesses (use after free and the like) when flooding reports from worker threads. Both issues described below have been encountered while trying to export a massive production file (`040_0050.anim.blend` from `r3055`) to USD, which generates _a lot_ of reports. ### use-after-free of wmTimer: *Lost the backtrace...* But what was happening was a use-after-free error in `WM_event_timer_remove()`, when calling `WM_event_timer_free_data`. The issue here was likely that that timer had been freed by the main thread after being tagged, but before the call to `WM_event_timer_free_data`? ### Use-after-free when removing timers ``` 9 __asan::__asan_report_store8 asan_rtl.cpp 127 0x7fd8e6ce2d1f 10 BLI_remlink listbase.cc 142 0x13527f0b 11 wm_window_timers_delete_removed wm_window.cc 2016 0x13cf9229 12 wm_window_timers_process wm_window.cc 1681 0x13cf6f27 13 wm_window_events_process wm_window.cc 1700 0x13cf70f1 14 WM_main wm.cc 606 0x13ba1c86 15 main creator.cc 590 0xf3c332f ``` This is likely caused by another thread adding a report (and therefore a timer) while 'to-be-freed' timers are being deleted from main thread. # TL; DR: We cannot keep allowing usage of `WM_report` API from non-main threads as-is. The code is simply _not_ thread-safe. # Proposal 1. Poison usage of `WM_report` and `WM_timer` from non-main thread (add asserts in main entry points). 2. 'low-level' threaded code is responsible to use their own ReportList data, and ensure it's properly synced with the WM one from main thread only. 3. Extend the wmJob API to provide easy-to-use handling of reports (each job would get their private reports list, and job update code would be responsible to ensure proper update with the WM one). Note that the other 'obvious' solution would be to make the whole WM_report/WM_timer code thread-safe. While possible (probably even relatively easy) to do, think it's better to follow the good pattern of each thread working on its own data as much as possible. It makes code having to be aware of synchronization issues smaller (the sync code between worker and main thread, instead of a whole set of API functions). And although this is not a main concern here, it's also usually better for performances (much less locking needed).
Bastien Montagne added the
Type
Design
Module
Core
Interest
User Interface
labels 2023-09-18 18:39:12 +02:00
Author
Owner

@ideasman42 @brecht @Sergey you guys may be interested/have other ideas?

@ideasman42 @brecht @Sergey you guys may be interested/have other ideas?

I didn't really dig into the roots of the code yet, but feeling is: having two ways to report warnings from operations sounds error-prone. If some operation is refactored and moved to a thread it adds an extra point of failure. In reality I do not really see that all cases of reporting will be tested during development/review.

For being a good API you need to have a single point of entry for it, and require passing something that is intrinsicly local to the thread (to the main thread included).

What I am not sure about is that it feels that the WM_report should be less coupled to the timer reset. That'd need to be done anyway for the codepath which allows to report from threads anyway though.

I didn't really dig into the roots of the code yet, but feeling is: having two ways to report warnings from operations sounds error-prone. If some operation is refactored and moved to a thread it adds an extra point of failure. In reality I do not really see that all cases of reporting will be tested during development/review. For being a good API you need to have a single point of entry for it, and require passing something that is intrinsicly local to the thread (to the main thread included). What I am not sure about is that it feels that the `WM_report` should be less coupled to the timer reset. That'd need to be done anyway for the codepath which allows to report from threads anyway though.
Author
Owner

I do not really see the issue with multiple API entry points here? We already have dedicated report API for generic, low-level BKE, WM, operators, modifiers... Don't really see why Jobs could not have their own report system as well.

And wmTimer is not the only issue here, nothing is threadsafe in WM_report or BKE_report either. I don't think having UI/WM-related API threadsafe should even be a goal at all tbh.

I do not really see the issue with multiple API entry points here? We already have dedicated report API for generic, low-level BKE, WM, operators, modifiers... Don't really see why Jobs could not have their own report system as well. And `wmTimer` is not the only issue here, nothing is threadsafe in `WM_report` or `BKE_report` either. I don't think having UI/WM-related API threadsafe should even be a goal at all tbh.

The operators don't really have own way of reporting things: they use BKE_report(), it is just the reports which comes from the op. If the job has its own job->report and BKE_report(job->reports, ...) is used then things are rather clear.

The operators don't really have own way of reporting things: they use `BKE_report()`, it is just the `reports` which comes from the `op`. If the job has its own `job->report` and `BKE_report(job->reports, ...)` is used then things are rather clear.
Author
Owner

@Sergey that's exactly what I want to do - although it will require some 'smartness' to be thread-safe (enough) while not requiring any locking...

@Sergey that's exactly what I want to do - although it will require some 'smartness' to be thread-safe (enough) while not requiring any locking...
Author
Owner

Committed core parts of the change as 9727373821 and c5a408f5a1. USD part is being reviewed in !113883.

Committed core parts of the change as 9727373821 and c5a408f5a1. USD part is being reviewed in !113883.
Blender Bot added the
Status
Archived
label 2023-10-18 13:12:51 +02:00
Sign in to join this conversation.
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#112537
No description provided.