The goal is to address performance regression when going from
few threads to 10s of threads. On a systems with more than 32
CPU threads the benefit of threaded loop was actually harmful.
There are following tweaks now:
- The chunk size is adaptive for the number of threads, which
minimizes scheduling overhead.
- The number of tasks is adaptive to the list size and chunk
size.
Here comes performance comparison on the production shot:
Number of threads DEG time before DEG time after
44 0.09 0.02
32 0.055 0.025
16 0.025 0.025
8 0.035 0.033
40 KiB
40 KiB