BLI: speedup memory bandwidth bound tasks by reducing threading #118939

Merged
Jacques Lucke merged 14 commits from JacquesLucke/blender:limit-threading-for-bandwidth-bound-tasks into main 2024-03-19 18:24:07 +01:00
Member

This improves performance by reducing the amounts of threads used for tasks which require a high memory bandwidth.

This works because the underlying hardware has a certain maximum memory bandwidth. If that is used up by a few threads already, any additional threads wanting to use a lot of memory will just cause more contention which actually slows things down. By reducing the number of threads that can perform certain tasks, the remaining threads are also not locked up doing work that they can't do efficiently. It's best if there is enough scheduled work so that these tasks can do more compute intensive tasks instead.

To use this new functionality, one has to put the parallel code in question into a threading::memory_bandwidth_bound_task(...) block. Additionally, one also has to provide a (very) rough approximation for how many bytes are accessed. If the number is low, the number of threads shouldn't be reduced because it's likely that all touched memory can be in L3 cache which generally has a much higher bandwidth than main memory.

The exact number of threads that are allowed to do bandwidth bound tasks at the same time is generally highly context and hardware dependent. It's also not really possible to measure reliably because it depends on so many static and dynamic factors. The thread count is now hardcoded to 8. It seems that this many threads are easily capable of maxing out the bandwidth capacity.

With this technique I can measure surprisingly good performance improvements:

  • Generating a 3000x3000 grid: 133ms -> 103ms.
  • Generating a mesh line with 100'000'000 vertices: 212ms -> 189ms.
  • Realize mesh instances resulting in ~27'000'000 vertices: 460ms -> 305ms.

In all of these cases, only 8 instead of 24 threads are used. The remaining threads are idle in these cases, but they could do other work if available.


Running the following code can give a rough idea for how many threads are necessary maxing out the memory bandwidth and when performance starts to degrade again.

const int64_t size = (uint64_t(1) << 30);
Vector<void *> buffers;
for (const int i : IndexRange(20)) {
  char *buffer = (char *)malloc(size);
  buffers.append(buffer);
  tbb::task_arena arena{i};
  SCOPED_TIMER(std::to_string(i));
  arena.execute([&]() {
    threading::parallel_for(IndexRange(size), 10000, [&](const IndexRange range) {
      memset(buffer + range.start(), 10, range.size());
    });
  });
}
for (void *buffer : buffers) {
  free(buffer);
}

These are my results. Note how it first gets faster the more threads there are, but then gets slower again.

Timer '0' took 161.45 ms (same as all threads)
Timer '1' took 119.34 ms
Timer '2' took 88.81 ms
Timer '3' took 86.61 ms
Timer '4' took 85.07 ms
Timer '5' took 86.64 ms
Timer '6' took 87.13 ms
Timer '7' took 88.20 ms
Timer '8' took 91.52 ms
Timer '9' took 123.77 ms
Timer '10' took 123.21 ms
Timer '11' took 114.44 ms
Timer '12' took 114.31 ms
Timer '13' took 139.37 ms
Timer '14' took 189.55 ms
Timer '15' took 194.06 ms
Timer '16' took 152.77 ms
Timer '17' took 191.29 ms
Timer '18' took 196.41 ms
Timer '19' took 186.28 ms

Using a profiler also show that there is an issue. As long as there are only few threads, the CPU cores are utilized well, but as the number of threads increases, one gets gaps in the profile. It looks like the CPU is doing some kind of time sharing of the memory bus.

image

This improves performance by **reducing** the amounts of threads used for tasks which require a high memory bandwidth. This works because the underlying hardware has a certain maximum memory bandwidth. If that is used up by a few threads already, any additional threads wanting to use a lot of memory will just cause more contention which actually slows things down. By reducing the number of threads that can perform certain tasks, the remaining threads are also not locked up doing work that they can't do efficiently. It's best if there is enough scheduled work so that these tasks can do more compute intensive tasks instead. To use this new functionality, one has to put the parallel code in question into a `threading::memory_bandwidth_bound_task(...)` block. Additionally, one also has to provide a (very) rough approximation for how many bytes are accessed. If the number is low, the number of threads shouldn't be reduced because it's likely that all touched memory can be in L3 cache which generally has a much higher bandwidth than main memory. The exact number of threads that are allowed to do bandwidth bound tasks at the same time is generally highly context and hardware dependent. It's also not really possible to measure reliably because it depends on so many static and dynamic factors. The thread count is now hardcoded to 8. It seems that this many threads are easily capable of maxing out the bandwidth capacity. With this technique I can measure surprisingly good performance improvements: * Generating a 3000x3000 grid: 133ms -> 103ms. * Generating a mesh line with 100'000'000 vertices: 212ms -> 189ms. * Realize mesh instances resulting in ~27'000'000 vertices: 460ms -> 305ms. In all of these cases, only 8 instead of 24 threads are used. The remaining threads are idle in these cases, but they could do other work if available. ----- Running the following code can give a rough idea for how many threads are necessary maxing out the memory bandwidth and when performance starts to degrade again. ```cpp const int64_t size = (uint64_t(1) << 30); Vector<void *> buffers; for (const int i : IndexRange(20)) { char *buffer = (char *)malloc(size); buffers.append(buffer); tbb::task_arena arena{i}; SCOPED_TIMER(std::to_string(i)); arena.execute([&]() { threading::parallel_for(IndexRange(size), 10000, [&](const IndexRange range) { memset(buffer + range.start(), 10, range.size()); }); }); } for (void *buffer : buffers) { free(buffer); } ``` These are my results. Note how it first gets faster the more threads there are, but then gets slower again. ``` Timer '0' took 161.45 ms (same as all threads) Timer '1' took 119.34 ms Timer '2' took 88.81 ms Timer '3' took 86.61 ms Timer '4' took 85.07 ms Timer '5' took 86.64 ms Timer '6' took 87.13 ms Timer '7' took 88.20 ms Timer '8' took 91.52 ms Timer '9' took 123.77 ms Timer '10' took 123.21 ms Timer '11' took 114.44 ms Timer '12' took 114.31 ms Timer '13' took 139.37 ms Timer '14' took 189.55 ms Timer '15' took 194.06 ms Timer '16' took 152.77 ms Timer '17' took 191.29 ms Timer '18' took 196.41 ms Timer '19' took 186.28 ms ``` Using a profiler also show that there is an issue. As long as there are only few threads, the CPU cores are utilized well, but as the number of threads increases, one gets gaps in the profile. It looks like the CPU is doing some kind of time sharing of the memory bus. ![image](/attachments/da9e04e6-5462-4074-834b-5573d5cf7d69)
Jacques Lucke added 3 commits 2024-03-01 00:04:49 +01:00
Jacques Lucke added 6 commits 2024-03-01 13:36:13 +01:00
buildbot/vexp-code-patch-lint Build done. Details
buildbot/vexp-code-patch-darwin-arm64 Build done. Details
buildbot/vexp-code-patch-linux-x86_64 Build done. Details
buildbot/vexp-code-patch-darwin-x86_64 Build done. Details
buildbot/vexp-code-patch-windows-amd64 Build done. Details
buildbot/vexp-code-patch-coordinator Build done. Details
60a6d6e4d5
speedup realize instances
Author
Member

@blender-bot build

@blender-bot build
Jacques Lucke added 1 commit 2024-03-01 14:27:49 +01:00
Jacques Lucke changed title from WIP: BLI: support reduced multi-threading for memory bandwidth bound tasks to BLI: speedup memory bandwidth bound tasks by reducing threading 2024-03-01 15:01:29 +01:00
Jacques Lucke requested review from Hans Goudey 2024-03-01 15:03:07 +01:00
Jacques Lucke added 1 commit 2024-03-17 19:15:22 +01:00
buildbot/vexp-code-patch-lint Build done. Details
buildbot/vexp-code-patch-linux-x86_64 Build done. Details
buildbot/vexp-code-patch-windows-amd64 Build done. Details
buildbot/vexp-code-patch-darwin-x86_64 Build done. Details
buildbot/vexp-code-patch-darwin-arm64 Build done. Details
buildbot/vexp-code-patch-coordinator Build done. Details
007cd3b342
Merge branch 'main' into limit-threading-for-bandwidth-bound-tasks
Author
Member

@blender-bot build

@blender-bot build
Jacques Lucke requested review from Sergey Sharybin 2024-03-18 14:40:24 +01:00

It does make sense to limit threading for such operations. The overall code looks fine, but there are some non-code related notes/questions type of a things.

The number of active threads sounds quite arbitrary, as you've mentioned. However, it is above of the minimum required 4 cores (don't think we require HT). Not sure if TBB will limit the number workers for the arena in this case. Would be good if we don't introduce extra overhead to a lower end hardware.

And another, possibly related thing, is doing blender -t 1 work as expected here, or does it force some parts of algorithm to be threaded?

It does make sense to limit threading for such operations. The overall code looks fine, but there are some non-code related notes/questions type of a things. The number of active threads sounds quite arbitrary, as you've mentioned. However, it is above of the minimum required 4 cores (don't think we require HT). Not sure if TBB will limit the number workers for the arena in this case. Would be good if we don't introduce extra overhead to a lower end hardware. And another, possibly related thing, is doing `blender -t 1` work as expected here, or does it force some parts of algorithm to be threaded?
Jacques Lucke added 3 commits 2024-03-18 17:17:37 +01:00
Author
Member

And another, possibly related thing, is doing blender -t 1 work as expected here, or does it force some parts of algorithm to be threaded?

It didn't but with the code I just added it's more obvious.

> And another, possibly related thing, is doing blender -t 1 work as expected here, or does it force some parts of algorithm to be threaded? It didn't but with the code I just added it's more obvious.

@JacquesLucke I think the change you did covers both cases of -t 1 and possible CPUs with less than 8 cores?

@JacquesLucke I think the change you did covers both cases of `-t 1` and possible CPUs with less than 8 cores?
Author
Member

Yes right. There is no extra overhead when Blender uses just 8 a fewer threads now (except for the function call overhead which is negligible here).

Yes right. There is no extra overhead when Blender uses just 8 a fewer threads now (except for the function call overhead which is negligible here).
Sergey Sharybin approved these changes 2024-03-19 14:30:26 +01:00
Sergey Sharybin left a comment
Owner

Lovely! I think we can go ahead with this change.

It is kind of interesting of what would the best constants be for multi-socket configurations, but is not something that prevents us from landing the current state of the PR.

Lovely! I think we can go ahead with this change. It is kind of interesting of what would the best constants be for multi-socket configurations, but is not something that prevents us from landing the current state of the PR.
Author
Member

It is kind of interesting of what would the best constants be for multi-socket configurations

True, not something I can test unfortunately.

> It is kind of interesting of what would the best constants be for multi-socket configurations True, not something I can test unfortunately.
Jacques Lucke merged commit b99c1abc3a into main 2024-03-19 18:24:07 +01:00
Jacques Lucke deleted branch limit-threading-for-bandwidth-bound-tasks 2024-03-19 18:24:10 +01:00
Sign in to join this conversation.
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#118939
No description provided.