Commit Graph

112 Commits

Author SHA1 Message Date
fcbec6e97e BLI_task: Add pooled threaded index range iterator, Take II.
This code allows to push a set of different operations all based on
iterations over a range of indices, and then process them all at once
over multiple threads.

This commit also adds unit tests for both old un-pooled, and new pooled
task_parallel_range family of functions, as well as some basic
performances tests.

This is mainly interesting for relatively low amount of individual
tasks, as expected.

E.g. performance tests on a 32 threads machine, for a set of 10
different tasks, shows following improvements when using pooled version
instead of ten sequential calls to BLI_task_parallel_range():

| Num Items | Sequential | Pooled  | Speed-up |
| --------- | ---------- | ------- | -------- |
|       10K |     365 us |  138 us |   2.5  x |
|      100K |     877 us |  530 us |   1.66 x |
|     1000K |    5521 us | 4625 us |   1.25 x |

Differential Revision: https://developer.blender.org/D6189

Note: Compared to previous commit yesterday, this reworks atomic handling in
parallel iter code, and fixes a dummy double-free bug.

Now we should only use the two critical values for synchronization from
atomic calls results, which is the proper way to do things.

Reading a value after an atomic operation does not guarantee you will
get the latest value in all cases (especially on Windows release builds
it seems).
2019-11-26 14:30:41 +01:00
3f87ac3684 Revert "BLI_task: Add pooled threaded index range iterator."
This reverts commit f9028a3be1.

This is giving weird heisenbug crash on only Windows release builds...
Reverting until we understand to issue.
2019-11-25 19:54:40 +01:00
52f0d685ba Revert "Cleanup: Unused variable in release build mode"
This reverts commit e0cada9519.
2019-11-25 19:54:40 +01:00
e0cada9519 Cleanup: Unused variable in release build mode
Thanks Bastien for code review!
2019-11-25 15:22:21 +01:00
f9028a3be1 BLI_task: Add pooled threaded index range iterator.
This code allows to push a set of different operations all based on
iterations over a range of indices, and then process them all at once
over multiple threads.

This commit also adds unit tests for both old un-pooled, and new pooled
`task_parallel_range` family of functions, as well as some basic
performances tests.

This is mainly interesting for relatively low amount of individual
tasks, as expected.

E.g. performance tests on a 32 threads machine, for a set of 10
different tasks, shows following improvements when using pooled version
instead of ten sequential calls to `BLI_task_parallel_range()`:

    | Num Items | Sequential | Pooled  | Speed-up |
    | --------- | ---------- | ------- | -------- |
    |       10K |     365 us |  138 us |   2.5  x |
    |      100K |     877 us |  530 us |   1.66 x |
    |     1000K |    5521 us | 4625 us |   1.25 x |

Differential Revision: https://developer.blender.org/D6189
2019-11-25 11:58:09 +01:00
2defd81b5e Cleanup: comments 2019-11-20 18:12:50 +11:00
Bastien Montagne
29433da4c6 BLI_task: Add new generic BLI_task_parallel_iterator().
This new function is part of the 'parallel for loops' functions. It
takes an iterator callback to generate items to be processed, in
addition to the usual 'process' func callback.

This allows to use common code from BLI_task for a wide range of custom
iteratiors, whithout having to re-invent the wheel of the whole tasks &
data chuncks handling.

This supports all settings features from `BLI_task_parallel_range()`,
including dynamic and static (if total number of items is knwon)
scheduling, TLS data and its finalize callback, etc.

One question here is whether we should provide usercode with a spinlock
by default, or enforce it to always handle its own sync mechanism.
I kept it, since imho it will be needed very often, and generating one
is pretty cheap even if unused...

----------

Additionaly, this commit converts (currently unused)
`BLI_task_parallel_listbase()` to use that generic code. This was done
mostly as proof of concept, but performance-wise it shows some
interesting data, roughly:
 - Very light processing (that should not be threaded anyway) is several
   times slower, which is expected due to more overhead in loop management
   code.
 - Heavier processing can be up to 10% quicker (probably thanks to the
   switch from dynamic to static scheduling, which reduces a lot locking
   to fill-in the per-tasks chunks of data). Similar speed-up in
   non-threaded case comes as a surprise though, not sure what can
   explain that.

While this conversion is not really needed, imho we should keep it
(instead of existing code for that function), it's easier to have
complex handling logic in as few places as possible, for maintaining and
for improving it.

Note: That work was initially done to allow for D5372 to be possible... Unfortunately that one proved to be not better  than orig code on performances point of view.

Reviewed By: sergey

Differential Revision: https://developer.blender.org/D5371
2019-10-30 12:23:45 +01:00
2409a9f0af BLI_tasks: simplify/generalize heuristic computing default chunk size.
That code is simpler and more general (not limited to some specific
values of thread numbers). It still gives similar default chunk size as
what we had before, but handles smoother increase steps, and higher
number of threads, by keeping increasing the chunk size.

No functional change expected from that commit.
2019-09-18 17:38:08 +02:00
0b2d1badec Cleanup: use post increment/decrement
When the result isn't used, prefer post increment/decrement
(already used nearly everywhere in Blender).
2019-09-08 00:23:25 +10:00
05721cd00a Mesh Batch Cache: Fix threading issue
I believed the crash I experienced happened because:
1. The `extract_pos_nor_init` function is called.
2. Tasks are added to the task pool for `extract_pos_nor`.
3. The tasks begin to be executed while more tasks are added.
4. In some rare cases, all existing tasks are finished, but not all have been added yet.
5. This let the task-counter go down to zero.
6. This triggered a call to `extract_pos_nor_finish`.
7. Then more tasks are added and in the end `extract_pos_nor_finish` is called again.

A solution is to use a task pool that is suspended when created.
Unfortunately, there was an outdated comment, that was probably the root cause of the issue.

Reviewers: fclem, sergey

Differential Revision: https://developer.blender.org/D5680
2019-09-05 09:57:30 +02:00
0d719fcacb Cleanup: spelling 2019-08-12 01:10:43 +10:00
5f405728bb BLI_task: Cleanup: rename some structs to make them more generic.
TLS and Settings can be used by other types of parallel 'for loops', so
removing 'Range' from their names.

No functional changes expected here.
2019-07-30 14:56:47 +02:00
b9c257019f BLI_task: tweak default chunk size for BLI_task_parallel_range().
Previously we were setting it to 1 (aka no 'chunking'), to follow
previous behavior. However, this is far from optimal, especially with
CPUs that can have tens of threads nowadays.

Now taking an heuristic approach (inspired from the one already existing
for `BLI_task_parallel_listbase()`, which tries to guesstimate best
chunk sizes based on several factors (amount of threads/parallel tasks,
total number of items, ...).

Think this is a reasonable base ground, more optimization here would of
course be possible.

Note that code that was already explicitely settings some value here
won't be affected at all by that change.
2019-07-30 14:36:59 +02:00
f18373a9ab Fix: BLI_task_test deadlock on windows.
This patch makes BLI_task_scheduler_create wait for all worker threads to have started before
returning to caller. For very short workloads (BLI_taks_test) there is the chance that the
worker threads have not fully started yet, and the main thread is calling pthread_join at
the same time as pthread_setspecific is being called on the worker threads which causes a
deadlock on pthreads4w.

Differential Revision: https://developer.blender.org/D4936

Reviewed By: mont29, sergey, brecht
2019-05-25 17:18:17 -06:00
cda4cd0705 Cleanup: comments (long lines) in blenlib 2019-04-22 06:30:08 +10:00
e12c08e8d1 ClangFormat: apply to source, most of intern
Apply clang format as proposed in T53211.

For details on usage and instructions for migrating branches
without conflicts, see:

https://wiki.blender.org/wiki/Tools/ClangFormat
2019-04-17 06:21:24 +02:00
9ba948a485 Cleanup: style, use braces for blenlib 2019-03-27 13:17:30 +11:00
eb8e656b2b Cleanup: spelling 2019-03-08 17:48:49 +11:00
de13d0a80c doxygen: add newline after \file
While \file doesn't need an argument, it can't have another doxy
command after it.
2019-02-18 08:22:12 +11:00
eef4077f18 Cleanup: remove redundant doxygen \file argument
Move \ingroup onto same line to be more compact and
make it clear the file is in the group.
2019-02-06 15:45:22 +11:00
65ec7ec524 Cleanup: remove redundant, invalid info from headers
BF-admins agree to remove header information that isn't useful,
to reduce noise.

- BEGIN/END license blocks

  Developers should add non license comments as separate comment blocks.
  No need for separator text.

- Contributors

  This is often invalid, outdated or misleading
  especially when splitting files.

  It's more useful to git-blame to find out who has developed the code.

See P901 for script to perform these edits.
2019-02-02 01:36:28 +11:00
4226ee0b71 Cleanup: comment line length (blenlib)
Prevents clang-format wrapping text before comments.
2019-01-15 23:30:31 +11:00
49490e5cfb Merge branch 'master' into blender2.8 2018-12-12 13:02:09 +11:00
e757c4a3be Cleanup: use colon separator after parameter
Helps separate variable names from descriptive text.
Was already used in some parts of the code,
double space and dashes were used elsewhere.
2018-12-12 12:50:58 +11:00
01581d4a1e BLI_task: fix queue in work_and_wait, and support resetting.
To make the pool more usable for running multiple stages of tasks,
fix local queue handling in BLI_task_pool_work_and_wait.

Specifically, after the wait loop the local queue should be empty,
or the wait part of the function contract isn't fulfilled. Instead,
check and run any tasks in queue before the wait loop.

Also, add a new function that resets the suspended state of the pool.
2018-12-04 14:08:50 +03:00
df2635099b Merge branch 'master' into blender2.8 2018-12-04 11:45:22 +01:00
3f31ec8398 Cleanup: Spelling 2018-12-04 11:43:53 +01:00
5c632ced53 Merge branch 'master' into blender2.8 2018-11-20 15:02:13 +01:00
01e8e7dc6d Task scheduler: Optimize parallel loop over lists
The goal is to address performance regression when going from
few threads to 10s of threads. On a systems with more than 32
CPU threads the benefit of threaded loop was actually harmful.

There are following tweaks now:

- The chunk size is adaptive for the number of threads, which
  minimizes scheduling overhead.

- The number of tasks is adaptive to the list size and chunk
  size.

Here comes performance comparison on the production shot:

 Number of threads        DEG time before        DEG time after
       44                     0.09                   0.02
       32                     0.055                  0.025
       16                     0.025                  0.025
       8                      0.035                  0.033
2018-11-20 14:58:17 +01:00
3cf724209f Cleanup, spelling 2018-11-08 15:00:19 +01:00
0ddf3e110e Cleanup: comment blocks 2018-09-02 18:51:31 +10:00
ae57383648 Cleanup: comment blocks 2018-09-02 18:28:27 +10:00
bf8f5f5142 Cleanup: doxygen comments 2018-03-14 02:08:07 +11:00
2aef87bfae Cleanup: rename BLI_thread.h API
- Use BLI_threadpool_ prefix for (deprecated)
  thread/listbase API.
- Use BLI_thread as prefix for other functions.

See P614 to apply instead of manually resolving conflicts.
2018-02-16 01:13:46 +11:00
ccdacf1c9b Cleanup: use '_len' instead of '_size' w/ BLI API
- When returning the number of items in a collection use BLI_*_len()
- Keep _size() for size in bytes.
- Keep _count() for data structures that don't store length
  (hint this isn't a simple getter).

See P611 to apply instead of manually resolving conflicts.
2018-02-15 23:39:08 +11:00
c253fe5e87 Cleanup typo in comment. 2018-01-11 17:55:58 +01:00
518c65460e Task scheduler: Use more const qualifiers 2018-01-10 12:27:43 +01:00
5fe87a0a8c Task scheduler: Use single thread branch when range fits into single chunk 2018-01-09 18:10:47 +01:00
4a3b303bb0 Task scheduler: Fix wrong tasks calculation when chunk size is too big 2018-01-09 18:07:34 +01:00
932d448ae0 Task scheduler: Use const qualifiers in parallel range 2018-01-09 16:09:33 +01:00
8cffb0a141 Task scheduler: Avoid over-allocation of tasks for parallel ranges
This seems to only cause extra rthreading overhead on systems with 10s of
threads, without actually solving anything.
2018-01-09 16:09:33 +01:00
c4e42d70a4 Task scheduler: Add minimum number of iterations per thread in parallel range
The idea is to support following: allow doing parallel for on a small range,
each iteration of which takes lots of compute power, but limit such range to
a subset of threads.

For example, on a machine with 44 threads we can occupy 4 threads to handle
range of 64 elements, 16 elements per thread, where each block of 16 elements
is very complex to compute.

The idea should be to use this setting instead of global use_threading flag,
which is only based on size of array. Proper use of the new flag will improve
threadability.

This commit only contains internal task scheduler changes, this setting is not
used yet by any areas.
2018-01-09 16:09:33 +01:00
3144f0573a Task scheduler: Simplify parallel range function
Basically, split it up and avoid extra abstraction level.
2018-01-09 16:09:33 +01:00
4c4a7e84c6 Task scheduler: Use single parallel range function with more flexible function
Now all the fine-tuning is happening using parallel range settings structure,
which avoid passing long lists of arguments, allows extend fine-tuning further,
avoid having lots of various functions which basically does the same thing.
2018-01-09 16:09:33 +01:00
d2708b0f73 Task scheduler: Get rid of extended version of parallel range callback
Wrap all arguments into TLS type of argument. Avoids some branching and also
makes it easier to extend things in the future.
2018-01-09 16:09:33 +01:00
6efd58dd3e Task scheduler: Clarify why do we need an atomic add of 0 2017-12-22 16:37:25 +01:00
50f1c9a8af Task scheduler: Start with suspended pool to avoid threading overhead on push
The idea is to avoid any threading overhead when we start pushing tasks in a
loop. Similarly to how we do it from the new dependency graph. Gives couple of
percent of speedup here, but also improves scalability.
2017-12-22 12:25:11 +01:00
efb86b712d Add a new parallel looper for MemPool items to BLI_task.
It merely uses the new thread-safe iterators system of mempool, quite
straight forward.

Note that to avoid possible confusion with two void pointers as
parameters of the callback, a dummy opaque struct pointer is used
instead for the second parameter (pointer generated by iteration over
mempool), callback functions must explicitely convert it to expected
real type.

Also added a basic gtest for this new feature.
2017-11-23 21:14:43 +01:00
497e2b3dfa Cleanup: use signed atomic ops when needed. 2017-11-23 16:24:34 +01:00
00c4f49a6d Cleanup: indentation, long lines 2017-06-12 13:38:21 +10:00