I believed the crash I experienced happened because:
1. The `extract_pos_nor_init` function is called.
2. Tasks are added to the task pool for `extract_pos_nor`.
3. The tasks begin to be executed while more tasks are added.
4. In some rare cases, all existing tasks are finished, but not all have been added yet.
5. This let the task-counter go down to zero.
6. This triggered a call to `extract_pos_nor_finish`.
7. Then more tasks are added and in the end `extract_pos_nor_finish` is called again.
A solution is to use a task pool that is suspended when created.
Unfortunately, there was an outdated comment, that was probably the root cause of the issue.
Reviewers: fclem, sergey
Differential Revision: https://developer.blender.org/D5680
Previously we were setting it to 1 (aka no 'chunking'), to follow
previous behavior. However, this is far from optimal, especially with
CPUs that can have tens of threads nowadays.
Now taking an heuristic approach (inspired from the one already existing
for `BLI_task_parallel_listbase()`, which tries to guesstimate best
chunk sizes based on several factors (amount of threads/parallel tasks,
total number of items, ...).
Think this is a reasonable base ground, more optimization here would of
course be possible.
Note that code that was already explicitely settings some value here
won't be affected at all by that change.
This patch makes BLI_task_scheduler_create wait for all worker threads to have started before
returning to caller. For very short workloads (BLI_taks_test) there is the chance that the
worker threads have not fully started yet, and the main thread is calling pthread_join at
the same time as pthread_setspecific is being called on the worker threads which causes a
deadlock on pthreads4w.
Differential Revision: https://developer.blender.org/D4936
Reviewed By: mont29, sergey, brecht
BF-admins agree to remove header information that isn't useful,
to reduce noise.
- BEGIN/END license blocks
Developers should add non license comments as separate comment blocks.
No need for separator text.
- Contributors
This is often invalid, outdated or misleading
especially when splitting files.
It's more useful to git-blame to find out who has developed the code.
See P901 for script to perform these edits.
To make the pool more usable for running multiple stages of tasks,
fix local queue handling in BLI_task_pool_work_and_wait.
Specifically, after the wait loop the local queue should be empty,
or the wait part of the function contract isn't fulfilled. Instead,
check and run any tasks in queue before the wait loop.
Also, add a new function that resets the suspended state of the pool.
The goal is to address performance regression when going from
few threads to 10s of threads. On a systems with more than 32
CPU threads the benefit of threaded loop was actually harmful.
There are following tweaks now:
- The chunk size is adaptive for the number of threads, which
minimizes scheduling overhead.
- The number of tasks is adaptive to the list size and chunk
size.
Here comes performance comparison on the production shot:
Number of threads DEG time before DEG time after
44 0.09 0.02
32 0.055 0.025
16 0.025 0.025
8 0.035 0.033
- Use BLI_threadpool_ prefix for (deprecated)
thread/listbase API.
- Use BLI_thread as prefix for other functions.
See P614 to apply instead of manually resolving conflicts.
- When returning the number of items in a collection use BLI_*_len()
- Keep _size() for size in bytes.
- Keep _count() for data structures that don't store length
(hint this isn't a simple getter).
See P611 to apply instead of manually resolving conflicts.
The idea is to support following: allow doing parallel for on a small range,
each iteration of which takes lots of compute power, but limit such range to
a subset of threads.
For example, on a machine with 44 threads we can occupy 4 threads to handle
range of 64 elements, 16 elements per thread, where each block of 16 elements
is very complex to compute.
The idea should be to use this setting instead of global use_threading flag,
which is only based on size of array. Proper use of the new flag will improve
threadability.
This commit only contains internal task scheduler changes, this setting is not
used yet by any areas.
Now all the fine-tuning is happening using parallel range settings structure,
which avoid passing long lists of arguments, allows extend fine-tuning further,
avoid having lots of various functions which basically does the same thing.
The idea is to avoid any threading overhead when we start pushing tasks in a
loop. Similarly to how we do it from the new dependency graph. Gives couple of
percent of speedup here, but also improves scalability.
It merely uses the new thread-safe iterators system of mempool, quite
straight forward.
Note that to avoid possible confusion with two void pointers as
parameters of the callback, a dummy opaque struct pointer is used
instead for the second parameter (pointer generated by iteration over
mempool), callback functions must explicitely convert it to expected
real type.
Also added a basic gtest for this new feature.
The idea is to accumulate all new tasks in a thread local queue
first without doing any thread synchronization (aka, locks and
conditional variables) and move those tasks to a scheduler queue
once they are all ready. This way we avoid per-task-pool lock
and only have one lock per bunch of tasks.
This is particularly handy when scheduling new dependency graph
node children. Brings FPS of cached simulation from the linked
below file from ~30 to ~50.
See documentation for BLI_task_pool_delayed_push_{begin, end}
and for TaskThreadLocalStorage::do_delayed_push.
Fixes T50027: Rigidbody playback and simulation performance regression with new depsgraph
Thanks Bastien for the review!
We can not re-use anything for such pools, because we will know nothing about whether
the main thread is sleeping or not. So we identify such threads as 0, but we don't
use main thread's TLS.
This fixes dead-locks and crashes reported by Luca when doing playblasts.
Suspended pools allows to push huge amount of initial tasks
without any threading synchronization and hence overhead.
This gives ~50% speedup of cached rigid body with file from
T50027 and seems to have no negative affect in other scenes
here.