Metal: Resolve race condition in memory manager #105254

Merged
Jeroen Bakker merged 8 commits from Jason-Fielder/blender:MetalSafeFreeList_Fix_Rel3_5 into blender-v3.5-release 2023-03-16 08:25:26 +01:00
Member

Fix race condition if several competing threads are inserting Metal
buffers into the MTLSafeFreeList simultaneously while a new list
chunk is being created.

Also raise the limit for an MTLSafeFreeListChunk size to optimize
for interactivity when releasing lots of memory simultaneously.

Authored by Apple: Michael Parkin-White

Related to #96261

Fix race condition if several competing threads are inserting Metal buffers into the MTLSafeFreeList simultaneously while a new list chunk is being created. Also raise the limit for an MTLSafeFreeListChunk size to optimize for interactivity when releasing lots of memory simultaneously. Authored by Apple: Michael Parkin-White Related to #96261
Jason Fielder added 1 commit 2023-02-27 16:14:38 +01:00
d67aeb5e1f Metal: Resolve race condition in memory manager.
Fix race condition if several competing threads are inserting Metal
buffers into the MTLSafeFreList simultaneously while a new list
chunk is being created.

Also raise the limit for an MTLSafeFreeListChunk size to optimise
for interactivity when releasing lots of memory simultaneously.

Authored by Apple: Michael Parkin-White

Related to #96261
Jason Fielder requested review from Clément Foucault 2023-02-27 16:14:48 +01:00
Sergey Sharybin requested changes 2023-02-27 17:09:07 +01:00
Sergey Sharybin left a comment
Owner

The usage of conditional variable seems a bit strange. If it is intended way of using it, a comment is needed explaining why is it valid.
From just reading the code it is not obious how the conditional variable ever gets woken up.

The usage of conditional variable seems a bit strange. If it is intended way of using it, a comment is needed explaining why is it valid. From just reading the code it is not obious how the conditional variable ever gets woken up.
@ -419,0 +430,4 @@
* so if the resource is not ready, we can hold until it is available. */
std::condition_variable_any wait_cond;
std::unique_lock<std::recursive_mutex> cond_lock(lock_);
wait_cond.wait(cond_lock, [&] { return (next_list = next_.load()) != nullptr; });

Not really sure how this can ever finish: there is no notification sent to the conditional variable.

Not really sure how this can ever finish: there is no notification sent to the conditional variable.
First-time contributor

Thanks for highlighting this, this was likely working due to an oversight as it would appear wait(..) can release early in unintended situations, but as the condition is satisfied when this happens, execution will carry on.

Though yes, I'll look at this in more detail and refactor with the explicit notification once the condition is satisfied. However, I am certainly open to a more appropriate, different, approach if needed, as this setup does feel like it may add more complexity than is needed for this particular case.

While not at all neat, the original conditional lock, to prevent the contendin threads from spinning:

while(!condition) {
   lock.lock()
   lock.unlock()
}```

worked correctly without incuring additional overhead, so if there is a better synchronization utiltiy which covers this simplified use-case(?), then this could be a better solution.
Thanks for highlighting this, this was likely working due to an oversight as it would appear `wait(..)` can release early in unintended situations, but as the condition is satisfied when this happens, execution will carry on. Though yes, I'll look at this in more detail and refactor with the explicit notification once the condition is satisfied. However, I am certainly open to a more appropriate, different, approach if needed, as this setup does feel like it may add more complexity than is needed for this particular case. While not at all neat, the original conditional lock, to prevent the contendin threads from spinning: ``` while(!condition) { lock.lock() lock.unlock() }``` worked correctly without incuring additional overhead, so if there is a better synchronization utiltiy which covers this simplified use-case(?), then this could be a better solution.

The canonical way of doing such synchronization would involve the following simple steps from the current state of code:

  • Move std::condition_variable_any wait_cond; definition next to the lock_, so that the same conditional variable is available from all threads.
  • After the resource have been created in the if (has_created_next_chunk == 0) { branch and after the lock_.unlock(); you do wait_cond_.notify_all().

You can see example from the (the condition_variable_any is essentially the same as condition_variable, just supports more types of mutexes):
https://en.cppreference.com/w/cpp/thread/condition_variable

The canonical way of doing such synchronization would involve the following simple steps from the current state of code: - Move `std::condition_variable_any wait_cond;` definition next to the `lock_`, so that the same conditional variable is available from all threads. - After the resource have been created in the `if (has_created_next_chunk == 0) {` branch and after the `lock_.unlock();` you do `wait_cond_.notify_all()`. You can see example from the (the `condition_variable_any` is essentially the same as `condition_variable`, just supports more types of mutexes): https://en.cppreference.com/w/cpp/thread/condition_variable
First-time contributor

Thanks for the info on this, that's helpful!

it would appear that a caveat with using conditional_variable or synchronization primitives following this style is that notify_all only applies to threads which are actively stalled within wait at the time this is called, whereas our desired intent would be to stop any threads waiting if this condition is already set.

The notification could result in a situation where the notify happens before all threads have reached the wait. If a thread hits the wait after, then this will still stall and be subject to the same issues.

As for this particular case, the object being waited on is only created once, then I'll give a solution using a std::future a try, as this feels more appropriate for this particular case.

Thanks for the info on this, that's helpful! it would appear that a caveat with using `conditional_variable` or synchronization primitives following this style is that notify_all only applies to threads which are actively stalled within wait at the time this is called, whereas our desired intent would be to stop any threads waiting if this condition is already set. The notification could result in a situation where the notify happens before all threads have reached the wait. If a thread hits the wait after, then this will still stall and be subject to the same issues. As for this particular case, the object being waited on is only created once, then I'll give a solution using a `std::future` a try, as this feels more appropriate for this particular case.

Ok, I am slowly getting into understanding the actual algorithm and structure used here (before was just looking from the synchronization primitives usage point of view).

I am not really sure why do we need anything more than just a double-checked lock here. The amount of waits seems to be the same as with the any proposed change here, but the code is much easier on many levels:

void MTLSafeFreeList::insert_buffer(gpu::MTLBuffer *buffer)
{
  BLI_assert(in_free_queue_ == false);

  /* Lockless list insert. */
  uint insert_index = current_list_index_++;

  /* If the current MTLSafeFreeList size is exceeded, we ripple down the linked-list chain and
   * insert the buffer into the next available chunk. */
  if (insert_index >= MTLSafeFreeList::MAX_NUM_BUFFERS_) {
    MTLSafeFreeList *next_list = next_.load();

    if (!next_list) {
      std::unique_lock lock(lock_);

      next_list = next_.load();
      if (!next_list) {
        next_list = new MTLSafeFreeList();
        next_.store(next_list);
      }
    }

    BLI_assert(next_list);
    next_list->insert_buffer(buffer);

    /* Clamp index to chunk limit if overflowing. */
    current_list_index_ = MTLSafeFreeList::MAX_NUM_BUFFERS_;
    return;
  }

  safe_free_pool_[insert_index] = buffer;
}

No need to introduce new primitives, and you can get rid of has_next_pool_.

P.S. The lock_ should actually be called mutex_.

Ok, I am slowly getting into understanding the actual algorithm and structure used here (before was just looking from the synchronization primitives usage point of view). I am not really sure why do we need anything more than just a double-checked lock here. The amount of waits seems to be the same as with the any proposed change here, but the code is much easier on many levels: ``` void MTLSafeFreeList::insert_buffer(gpu::MTLBuffer *buffer) { BLI_assert(in_free_queue_ == false); /* Lockless list insert. */ uint insert_index = current_list_index_++; /* If the current MTLSafeFreeList size is exceeded, we ripple down the linked-list chain and * insert the buffer into the next available chunk. */ if (insert_index >= MTLSafeFreeList::MAX_NUM_BUFFERS_) { MTLSafeFreeList *next_list = next_.load(); if (!next_list) { std::unique_lock lock(lock_); next_list = next_.load(); if (!next_list) { next_list = new MTLSafeFreeList(); next_.store(next_list); } } BLI_assert(next_list); next_list->insert_buffer(buffer); /* Clamp index to chunk limit if overflowing. */ current_list_index_ = MTLSafeFreeList::MAX_NUM_BUFFERS_; return; } safe_free_pool_[insert_index] = buffer; } ``` No need to introduce new primitives, and you can get rid of `has_next_pool_`. P.S. The `lock_` should actually be called `mutex_`.
First-time contributor

Thanks for the feedback! Yeah this looks far better, had definitely started overcomplicating things once going down the rabbit hole.

This version more or less results in the same locking pattern as the first version it seems, as it still locked if next_list wasn't available and would wait for the thread that had already entered the block.

But I agree that this is a cleaner approach. Will submit with this proposed change, and remove the other has_next_pool_ counter and refactor other parts of the code which used this.

I will also change the clamp on current_list_index_ to clamp against INTMAX. Only need to clamp in the very very rare case that this overflows, as it needs to be >= MAX_NUM_BUFFERS_. However, having the actual index counter in the root list would also provide the useful bit of data on the total number of buffers in the entire chunked list.

Thanks for the feedback! Yeah this looks far better, had definitely started overcomplicating things once going down the rabbit hole. This version more or less results in the same locking pattern as the first version it seems, as it still locked if next_list wasn't available and would wait for the thread that had already entered the block. But I agree that this is a cleaner approach. Will submit with this proposed change, and remove the other has_next_pool_ counter and refactor other parts of the code which used this. I will also change the clamp on current_list_index_ to clamp against INTMAX. Only need to clamp in the very very rare case that this overflows, as it needs to be >= MAX_NUM_BUFFERS_. However, having the actual index counter in the root list would also provide the useful bit of data on the total number of buffers in the entire chunked list.
Jason Fielder added 1 commit 2023-03-01 18:12:05 +01:00
aa37c22016 Addressed feedback. Replace conditional_variable
synchronization primitive with shared_future implementation.
Jason Fielder added 3 commits 2023-03-13 13:34:00 +01:00
Sergey Sharybin reviewed 2023-03-13 17:43:41 +01:00
Sergey Sharybin left a comment
Owner

Thanks for the update. Looks much cleaner and easier to understand.

I think there are couple of include statements which are not needed anymore.

Also, did you consider updating comment for the insert_buffer in the header? It states Performs a lockless list insert.. Not sure what is the important point for the API here. Maybe something like Can be used from multiple threads. Performs insertion with the least amount of threading synchronization ?

Thanks for the update. Looks much cleaner and easier to understand. I think there are couple of include statements which are not needed anymore. Also, did you consider updating comment for the `insert_buffer` in the header? It states `Performs a lockless list insert.`. Not sure what is the important point for the API here. Maybe something like `Can be used from multiple threads. Performs insertion with the least amount of threading synchronization` ?
@ -15,6 +15,8 @@
#include <Metal/Metal.h>
#include <QuartzCore/QuartzCore.h>
#include <future>

Is this include still needed for anything?

Is this include still needed for anything?
@ -8,6 +8,8 @@
#include "mtl_debug.hh"
#include "mtl_memory.hh"
#include <condition_variable>

Think this is also not needed anymore.

Think this is also not needed anymore.
Hans Goudey changed title from Metal: Resolve race condition in memory manager. to Metal: Resolve race condition in memory manager 2023-03-13 19:42:43 +01:00
Jason Fielder added 1 commit 2023-03-14 15:21:32 +01:00
Jason Fielder added 1 commit 2023-03-14 15:32:20 +01:00
fac099cd82 Remove unnecessary header includes and adjust insert_buffer
documentation to better represent current functionality.
Michael Parkin-White added 1 commit 2023-03-14 20:47:47 +01:00
Sergey Sharybin approved these changes 2023-03-15 10:02:02 +01:00
Jeroen Bakker requested review from Jeroen Bakker 2023-03-15 10:33:17 +01:00
Jeroen Bakker added the
Interest
Metal
label 2023-03-16 08:20:40 +01:00
Jeroen Bakker added this to the 3.5 milestone 2023-03-16 08:20:45 +01:00
Jeroen Bakker added this to the EEVEE & Viewport project 2023-03-16 08:20:49 +01:00
Jeroen Bakker approved these changes 2023-03-16 08:24:18 +01:00
Jeroen Bakker merged commit 7bdd82eca0 into blender-v3.5-release 2023-03-16 08:25:26 +01:00
Sign in to join this conversation.
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No Assignees
4 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#105254
No description provided.