Fix #129661: Wait for GPU to complete to avoid use-after-free issues. #129686

Merged
Member

In some cases the MTLContext was being destroyed before all GPU work was completed causing the (outstanding) command buffer completion event handler to update a command buffer that had already been freed. This behaviour was introduced by this change which updated the event handler to track the number of outstanding command buffers per context as well as system-wide.

Reproduced the issue with ASAN enabled and confirmed that waiting for the GPU to complete fixes the issue.

Also contains a minor fix for unitiiliased values in MTLAttachments identified by ASAN.

Authored by Apple: James McCarthy"

In some cases the MTLContext was being destroyed before all GPU work was completed causing the (outstanding) command buffer completion event handler to update a command buffer that had already been freed. This behaviour was introduced by [this](https://projects.blender.org/blender/blender/commit/6da42e9c951b) change which updated the event handler to track the number of outstanding command buffers per context as well as system-wide. Reproduced the issue with ASAN enabled and confirmed that waiting for the GPU to complete fixes the issue. Also contains a minor fix for unitiiliased values in MTLAttachments identified by ASAN. Authored by Apple: James McCarthy"
Jason Fielder added 1 commit 2024-11-01 16:02:03 +01:00
Jason Fielder requested review from Clément Foucault 2024-11-01 16:11:28 +01:00
Jason Fielder requested review from Weizhen Huang 2024-11-01 16:12:11 +01:00
Jason Fielder requested review from Sergey Sharybin 2024-11-01 16:12:36 +01:00
Sergey Sharybin reviewed 2024-11-01 16:26:37 +01:00
Sergey Sharybin left a comment
Owner

The PR fixes heap-use-after-free.
However, there is a memory leak somewhere: just repeatedly go into different shading modes in viewport. Not sure if it's a separate issue, or whether it indicates that something is not complete with this PR.

The PR fixes `heap-use-after-free`. However, there is a memory leak somewhere: just repeatedly go into different shading modes in viewport. Not sure if it's a separate issue, or whether it indicates that something is not complete with this PR.
@ -675,0 +675,4 @@
void wait_until_active_command_buffers_complete()
{
while (get_active_command_buffer_count()) {
std::this_thread::sleep_for(std::chrono::milliseconds(1));

I don't personally find this a good or correct paradigm to spread into more and more of Blender code. This is something that you'd typically be implementing using conditional variables, letting the OS to do all the scheduling for you. Having busy loop with such a small wait time results in a high CPU usage.

I don't personally find this a good or correct paradigm to spread into more and more of Blender code. This is something that you'd typically be implementing using conditional variables, letting the OS to do all the scheduling for you. Having busy loop with such a small wait time results in a high CPU usage.
Member

spec doesn't specify what should happen there, could be a busy loop, it could just yield, implementation is free to do whatever it likes, std::this_thread::yield would be guaranteed to yield though. Having some other synchronization here would indeed be better, but if that's not an option yield would be preferable to sleep_for imho

spec doesn't specify what should happen there, could be a busy loop, it could just yield, implementation is free to do whatever it likes, `std::this_thread::yield` would be guaranteed to yield though. Having some other synchronization here would indeed be better, but if that's not an option `yield` would be preferable to `sleep_for` imho

If that's not an option I'd really like to hear explanation why is it so :)

If that's not an option I'd really like to hear explanation why is it so :)

@Sergey maybe this PR is based on main before #129117 was merged.

@Sergey maybe this PR is based on main before #129117 was merged.

@fclem Perhaps. I've tested the patch again and it works fine!

The confusing part in the PR you've mentioned is that it explicitly mentions leaks identified by the instruments, so I thought those are about something much more hidden.

It would still be good to avoid the loops and use conditional variables. Or, at least, do explicit yield(). But the former is what we use everywhere in similar situations. So it needs to be something really compelling to use the loop+yield.

@fclem Perhaps. I've tested the patch again and it works fine! The confusing part in the PR you've mentioned is that it explicitly mentions leaks identified by the instruments, so I thought those are about something much more hidden. It would still be good to avoid the loops and use conditional variables. Or, at least, do explicit `yield()`. But the former is what we use everywhere in similar situations. So it needs to be something really compelling to use the loop+yield.

@Sergey:
"The confusing part in the PR you've mentioned is that it explicitly mentions leaks identified by the instruments, so I thought those are about something much more hidden."
You are correct, Blender's internal memory leak tracking does not identify Obj-C retain/release leaks hence the use of the Xcode tools.

"It would still be good to avoid the loops and use conditional variables. Or, at least, do explicit yield(). But the former is what we use everywhere in similar situations. So it needs to be something really compelling to use the loop+yield."
This particular case should not cause any performance issue since MTLContext teardown is an infrequent operation. However I understand your point and I can change this to use a preferred approach, I was simply copying the existing GPU block/wait code I found in the existing source code :)
If you have a preferred template for block waiting on HW elsewhere in the Blender source code you can point me at I'm happy to use that approach instead.

@Sergey: _"The confusing part in the PR you've mentioned is that it explicitly mentions leaks identified by the instruments, so I thought those are about something much more hidden."_ You are correct, Blender's internal memory leak tracking does not identify Obj-C retain/release leaks hence the use of the Xcode tools. _"It would still be good to avoid the loops and use conditional variables. Or, at least, do explicit yield(). But the former is what we use everywhere in similar situations. So it needs to be something really compelling to use the loop+yield."_ This particular case should not cause any performance issue since MTLContext teardown is an infrequent operation. However I understand your point and I can change this to use a preferred approach, I was simply copying the existing GPU block/wait code I found in the existing source code :) If you have a preferred template for block waiting on HW elsewhere in the Blender source code you can point me at I'm happy to use that approach instead.

You are correct, Blender's internal memory leak tracking does not identify Obj-C retain/release leaks hence the use of the Xcode tools.

The leaks I've seen were reported by the Blender's memory tracking (when you run Blender from the terminal it was reporting multiple unfreed memory blocks). So probably was some other leak. In any case, I did not see any detected unfreed data-blocks when was testing the code today.

I was simply copying the existing GPU block/wait code I found in the existing source code :)

We need to have some healthy amount of skepticism about existing code. The fact that it is there does not mean the code is the best ;)

For the simplicity of moving forward with this PR and fixing issue that affects other developers adding std::this_thread::yield() would be acceptable to me.

For the existing patterns we use condition variable. You can see examples in Cycles (and manyre), draw manager and possibly more.

> You are correct, Blender's internal memory leak tracking does not identify Obj-C retain/release leaks hence the use of the Xcode tools. The leaks I've seen were reported by the Blender's memory tracking (when you run Blender from the terminal it was reporting multiple unfreed memory blocks). So probably was some other leak. In any case, I did not see any detected unfreed data-blocks when was testing the code today. > I was simply copying the existing GPU block/wait code I found in the existing source code :) We need to have some healthy amount of skepticism about existing code. The fact that it is there does not mean the code is the best ;) For the simplicity of moving forward with this PR and fixing issue that affects other developers adding `std::this_thread::yield()` would be acceptable to me. For the existing patterns we use [condition variable](https://en.cppreference.com/w/cpp/thread/condition_variable). You can see examples in [Cycles](https://projects.blender.org/blender/blender/src/branch/main/intern/cycles/integrator/path_trace.cpp#L840) (and manyre), [draw manager](https://projects.blender.org/blender/blender/src/commit/b4f59c5348fe4a25cf5fdf24fbc43e458cd74535/source/blender/draw/intern/draw_manager_shader.cc#L162) and possibly more.
Jason Fielder added 1 commit 2024-11-12 15:57:01 +01:00
Sergey Sharybin approved these changes 2024-11-12 16:27:18 +01:00
Sergey Sharybin left a comment
Owner

Explicit yield() is definitely better than sleep for 1 millisecond. Solving the immediate crash is good, but we should revisit those waits in a foreseeable future.

@Jason-Fielder Please separate merge commits from the functional changes commits. It makes it easier to follow incremental changes to the PR.

Explicit `yield()` is definitely better than sleep for 1 millisecond. Solving the immediate crash is good, but we should revisit those waits in a foreseeable future. @Jason-Fielder Please separate merge commits from the functional changes commits. It makes it easier to follow incremental changes to the PR.

@Jason-Fielder Please separate merge commits from the functional changes commits. It makes it easier to follow incremental changes to the PR.

Ah, apologies. Will do.

> @Jason-Fielder Please separate merge commits from the functional changes commits. It makes it easier to follow incremental changes to the PR. Ah, apologies. Will do.
Clément Foucault approved these changes 2024-11-12 16:48:24 +01:00
Clément Foucault merged commit 4fc2e1c842 into main 2024-11-12 16:49:05 +01:00
Sign in to join this conversation.
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset System
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Code Documentation
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Viewport & EEVEE
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Asset Browser Project
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Asset System
Module
Core
Module
Development Management
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Module
Viewport & EEVEE
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Severity
High
Severity
Low
Severity
Normal
Severity
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
5 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#129686
No description provided.