Both, the guarded and lockfree allocator, are keeping track of current and peak memory usage. Even the lockfree allocator used to use a global atomic variable for the memory usage. When multiple threads use the allocator at the same time, this variable is highly contended. This can result in significant slowdowns as presented in D16862. While specific cases could always be optimized by reducing the number of allocations, having this synchronization point in functions used by almost every part of Blender is not great. The solution is use thread-local memory counters which are only added together when the memory usage is actually requested. For more details see in-code comments and D16862. Differential Revision: https://developer.blender.org/D16862