BLI: speedup memory bandwidth bound tasks by reducing threading #118939

Jacques Lucke · 2024-03-01T00:04:39+01:00

Jacques Lucke commented

2024-03-01 00:04:39 +01:00

This improves performance by reducing the amounts of threads used for tasks which require a high memory bandwidth.

This works because the underlying hardware has a certain maximum memory bandwidth. If that is used up by a few threads already, any additional threads wanting to use a lot of memory will just cause more contention which actually slows things down. By reducing the number of threads that can perform certain tasks, the remaining threads are also not locked up doing work that they can't do efficiently. It's best if there is enough scheduled work so that these tasks can do more compute intensive tasks instead.

To use this new functionality, one has to put the parallel code in question into a threading::memory_bandwidth_bound_task(...) block. Additionally, one also has to provide a (very) rough approximation for how many bytes are accessed. If the number is low, the number of threads shouldn't be reduced because it's likely that all touched memory can be in L3 cache which generally has a much higher bandwidth than main memory.

The exact number of threads that are allowed to do bandwidth bound tasks at the same time is generally highly context and hardware dependent. It's also not really possible to measure reliably because it depends on so many static and dynamic factors. The thread count is now hardcoded to 8. It seems that this many threads are easily capable of maxing out the bandwidth capacity.

With this technique I can measure surprisingly good performance improvements:

Generating a 3000x3000 grid: 133ms -> 103ms.
Generating a mesh line with 100'000'000 vertices: 212ms -> 189ms.
Realize mesh instances resulting in ~27'000'000 vertices: 460ms -> 305ms.

In all of these cases, only 8 instead of 24 threads are used. The remaining threads are idle in these cases, but they could do other work if available.

Running the following code can give a rough idea for how many threads are necessary maxing out the memory bandwidth and when performance starts to degrade again.

const int64_t size = (uint64_t(1) << 30);
Vector<void *> buffers;
for (const int i : IndexRange(20)) {
  char *buffer = (char *)malloc(size);
  buffers.append(buffer);
  tbb::task_arena arena{i};
  SCOPED_TIMER(std::to_string(i));
  arena.execute([&]() {
    threading::parallel_for(IndexRange(size), 10000, [&](const IndexRange range) {
      memset(buffer + range.start(), 10, range.size());
    });
  });
}
for (void *buffer : buffers) {
  free(buffer);
}

These are my results. Note how it first gets faster the more threads there are, but then gets slower again.

Timer '0' took 161.45 ms (same as all threads)
Timer '1' took 119.34 ms
Timer '2' took 88.81 ms
Timer '3' took 86.61 ms
Timer '4' took 85.07 ms
Timer '5' took 86.64 ms
Timer '6' took 87.13 ms
Timer '7' took 88.20 ms
Timer '8' took 91.52 ms
Timer '9' took 123.77 ms
Timer '10' took 123.21 ms
Timer '11' took 114.44 ms
Timer '12' took 114.31 ms
Timer '13' took 139.37 ms
Timer '14' took 189.55 ms
Timer '15' took 194.06 ms
Timer '16' took 152.77 ms
Timer '17' took 191.29 ms
Timer '18' took 196.41 ms
Timer '19' took 186.28 ms

Using a profiler also show that there is an issue. As long as there are only few threads, the CPU cores are utilized well, but as the number of threads increases, one gets gaps in the profile. It looks like the CPU is doing some kind of time sharing of the memory bus.

This improves performance by **reducing** the amounts of threads used for tasks which require a high memory bandwidth. This works because the underlying hardware has a certain maximum memory bandwidth. If that is used up by a few threads already, any additional threads wanting to use a lot of memory will just cause more contention which actually slows things down. By reducing the number of threads that can perform certain tasks, the remaining threads are also not locked up doing work that they can't do efficiently. It's best if there is enough scheduled work so that these tasks can do more compute intensive tasks instead. To use this new functionality, one has to put the parallel code in question into a `threading::memory_bandwidth_bound_task(...)` block. Additionally, one also has to provide a (very) rough approximation for how many bytes are accessed. If the number is low, the number of threads shouldn't be reduced because it's likely that all touched memory can be in L3 cache which generally has a much higher bandwidth than main memory. The exact number of threads that are allowed to do bandwidth bound tasks at the same time is generally highly context and hardware dependent. It's also not really possible to measure reliably because it depends on so many static and dynamic factors. The thread count is now hardcoded to 8. It seems that this many threads are easily capable of maxing out the bandwidth capacity. With this technique I can measure surprisingly good performance improvements: * Generating a 3000x3000 grid: 133ms -> 103ms. * Generating a mesh line with 100'000'000 vertices: 212ms -> 189ms. * Realize mesh instances resulting in ~27'000'000 vertices: 460ms -> 305ms. In all of these cases, only 8 instead of 24 threads are used. The remaining threads are idle in these cases, but they could do other work if available. ----- Running the following code can give a rough idea for how many threads are necessary maxing out the memory bandwidth and when performance starts to degrade again. ```cpp const int64_t size = (uint64_t(1) << 30); Vector<void *> buffers; for (const int i : IndexRange(20)) { char *buffer = (char *)malloc(size); buffers.append(buffer); tbb::task_arena arena{i}; SCOPED_TIMER(std::to_string(i)); arena.execute([&]() { threading::parallel_for(IndexRange(size), 10000, [&](const IndexRange range) { memset(buffer + range.start(), 10, range.size()); }); }); } for (void *buffer : buffers) { free(buffer); } ``` These are my results. Note how it first gets faster the more threads there are, but then gets slower again. ``` Timer '0' took 161.45 ms (same as all threads) Timer '1' took 119.34 ms Timer '2' took 88.81 ms Timer '3' took 86.61 ms Timer '4' took 85.07 ms Timer '5' took 86.64 ms Timer '6' took 87.13 ms Timer '7' took 88.20 ms Timer '8' took 91.52 ms Timer '9' took 123.77 ms Timer '10' took 123.21 ms Timer '11' took 114.44 ms Timer '12' took 114.31 ms Timer '13' took 139.37 ms Timer '14' took 189.55 ms Timer '15' took 194.06 ms Timer '16' took 152.77 ms Timer '17' took 191.29 ms Timer '18' took 196.41 ms Timer '19' took 186.28 ms ``` Using a profiler also show that there is an issue. As long as there are only few threads, the CPU cores are utilized well, but as the number of threads increases, one gets gaps in the profile. It looks like the CPU is doing some kind of time sharing of the memory bus. ![image](/attachments/da9e04e6-5462-4074-834b-5573d5cf7d69)

image.png

38 KiB

image.png

28 KiB

image.png

103 KiB

image.png

20 KiB

🚀 3

Jacques Lucke added 3 commits 2024-03-01 00:04:49 +01:00

ced6c64391 initial code

347d2dd4a4 try measuring good thread count, doens't work reliably yet

d9487969c3 cleanup

Jacques Lucke added 6 commits 2024-03-01 13:36:13 +01:00

3250712614 progress

9a036a1f1e fix

e51f964e5d cleanup

e15b81ab4e add lazy threading isolation

63cbd7cd32 speedup mesh line

buildbot/vexp-code-patch-lint Build done. Details

buildbot/vexp-code-patch-darwin-arm64 Build done. Details

buildbot/vexp-code-patch-linux-x86_64 Build done. Details

buildbot/vexp-code-patch-darwin-x86_64 Build done. Details

buildbot/vexp-code-patch-windows-amd64 Build done. Details

buildbot/vexp-code-patch-coordinator Build done. Details

60a6d6e4d5 speedup realize instances

Jacques Lucke commented

2024-03-01 13:39:30 +01:00

@blender-bot build

Jacques Lucke added 1 commit 2024-03-01 14:27:49 +01:00

8f025904f5 cleanup

Jacques Lucke changed title from ~~WIP: BLI: support reduced multi-threading for memory bandwidth bound tasks~~ to BLI: speedup memory bandwidth bound tasks by reducing threading

2024-03-01 15:01:29 +01:00

Jacques Lucke requested review from Hans Goudey 2024-03-01 15:03:07 +01:00

Jacques Lucke added 1 commit 2024-03-17 19:15:22 +01:00

buildbot/vexp-code-patch-lint Build done. Details

buildbot/vexp-code-patch-linux-x86_64 Build done. Details

buildbot/vexp-code-patch-windows-amd64 Build done. Details

buildbot/vexp-code-patch-darwin-x86_64 Build done. Details

buildbot/vexp-code-patch-darwin-arm64 Build done. Details

buildbot/vexp-code-patch-coordinator Build done. Details

007cd3b342 Merge branch 'main' into limit-threading-for-bandwidth-bound-tasks

Jacques Lucke commented

2024-03-17 20:52:42 +01:00

@blender-bot build

Jacques Lucke requested review from Sergey Sharybin 2024-03-18 14:40:24 +01:00

Sergey Sharybin commented

2024-03-18 14:58:16 +01:00

It does make sense to limit threading for such operations. The overall code looks fine, but there are some non-code related notes/questions type of a things.

The number of active threads sounds quite arbitrary, as you've mentioned. However, it is above of the minimum required 4 cores (don't think we require HT). Not sure if TBB will limit the number workers for the arena in this case. Would be good if we don't introduce extra overhead to a lower end hardware.

And another, possibly related thing, is doing blender -t 1 work as expected here, or does it force some parts of algorithm to be threaded?

It does make sense to limit threading for such operations. The overall code looks fine, but there are some non-code related notes/questions type of a things. The number of active threads sounds quite arbitrary, as you've mentioned. However, it is above of the minimum required 4 cores (don't think we require HT). Not sure if TBB will limit the number workers for the arena in this case. Would be good if we don't introduce extra overhead to a lower end hardware. And another, possibly related thing, is doing `blender -t 1` work as expected here, or does it force some parts of algorithm to be threaded?

Jacques Lucke added 3 commits 2024-03-18 17:17:37 +01:00

c93203cccd Merge branch 'main' into limit-threading-for-bandwidth-bound-tasks

d4523898a2 typo

ecd30aba1a avoid overhead of task arena if it does not have any effect

Jacques Lucke commented

2024-03-18 17:18:47 +01:00

And another, possibly related thing, is doing blender -t 1 work as expected here, or does it force some parts of algorithm to be threaded?

It didn't but with the code I just added it's more obvious.

> And another, possibly related thing, is doing blender -t 1 work as expected here, or does it force some parts of algorithm to be threaded? It didn't but with the code I just added it's more obvious.

Sergey Sharybin commented

2024-03-18 17:31:25 +01:00

@JacquesLucke I think the change you did covers both cases of -t 1 and possible CPUs with less than 8 cores?

@JacquesLucke I think the change you did covers both cases of `-t 1` and possible CPUs with less than 8 cores?

Jacques Lucke commented

2024-03-18 17:34:09 +01:00

Yes right. There is no extra overhead when Blender uses just 8 a fewer threads now (except for the function call overhead which is negligible here).

Sergey Sharybin approved these changes 2024-03-19 14:30:26 +01:00

Sergey Sharybin left a comment

Lovely! I think we can go ahead with this change.

It is kind of interesting of what would the best constants be for multi-socket configurations, but is not something that prevents us from landing the current state of the PR.

Lovely! I think we can go ahead with this change. It is kind of interesting of what would the best constants be for multi-socket configurations, but is not something that prevents us from landing the current state of the PR.

Jacques Lucke commented

2024-03-19 14:59:37 +01:00

It is kind of interesting of what would the best constants be for multi-socket configurations

True, not something I can test unfortunately.

> It is kind of interesting of what would the best constants be for multi-socket configurations True, not something I can test unfortunately.

Jacques Lucke merged commit b99c1abc3a into main

2024-03-19 18:24:07 +01:00

Jacques Lucke referenced this issue from a commit

2024-03-19 18:24:07 +01:00

BLI: speedup memory bandwidth bound tasks by reducing threading

Jacques Lucke deleted branch limit-threading-for-bandwidth-bound-tasks

2024-03-19 18:24:10 +01:00

Sign in to join this conversation.

No reviewers

No Label

Download

What's New

Blender Studio

Manual

Developers Blog

Documentation

Benchmark

Blender Conference

Development Fund

One-time Donations

BLI: speedup memory bandwidth bound tasks by reducing threading #118939