Together with the extended loop callback and userdata_chunk, this allows to perform
cumulative tasks (like aggregation) in a lockfree way using local userdata_chunk to store temp data,
and once all workers have finished, to merge those userdata_chunks in the finalize callback
(from calling thread, so no need to lock here either).
Note that this changes how userdata_chunk is handled (now fully from 'main' thread,
which means a given worker thread will always get the same userdata_chunk, without
being re-initialized anymore to init value at start of each iter chunk).
New code is actually much, much better than first version, using 'fetch_and_add' atomic op
here allows us to get rid of the loop etc.
The broken CAS issue remains on windows, to be investigated...
There are some serious issues under windows, causing deadlocks somehow (not reproducible under linux so far).
Until further investigation over why this happens, better to revert to previous
spin-locked behavior.
This reverts commits a83bc4f597 and 98123ae916.
Reading the shared state->iter value after storing it in the 'reference' var could in theory
lead to a race condition setting state->iter value above state->stop, which would be 'deadly'.
This **may** be the cause of T48422, though I was not able to reproduce that issue so far.
Pass distance argument so its possible to limit the range we get all hits from.
Other changes:
- Use boundbox test before calling callback, avoids redundant calls.
- Remove meaningless return value.
- Add doc string, explaining purpose of this function.
This commit makes use of new taskpool feature (instead of allocating own tasks),
and removes the spinlock used to generate chunks (using atomic ops instead).
In best cases (dynamic scheduled loop with light processing func callback), we
get a few percents of speedup, in most cases there is no sensible enhancement.
Appears mutex was guarateeing number of tasks is not modified at moments
when it's not expected. Removing those mutexes resulted in some hard-to-catch
locks where worker thread were waiting for work by all the tasks were already
done.
This reverts commit a1d8fe052c.
Brain melt here, intention was to reduce number of tasks in case we have not much chunks of data to loop over,
not to increase it!
Note that this only affected dynamic scheduling.
This commit implements new function BLI_task_pool_push_from_thread()
who's main goal is to have less parasitic load on the CPU bu avoiding
memory allocations as much as possible, making taks pushing cheaper.
This function expects thread ID, which must be 0 for the thread from
which pool is created from (and from which wait_work() is called) and
for other threads it mush be the ID which was sent to the thread working
function.
This reduces allocations quite a bit in the new dependency graph,
hopefully gaining some visible speedup on a fewzillion core machines
(on my own machine can only see benefit in profiler, which shows
significant reduce of time wasted in the memory allocation).
Separate the creation of trees from EditMesh from the creation of trees from DerivedMesh.
This was meant to simplify the API, but didn't work out so well.
`bvhtree_from_mesh_*` actually is working as `bvhtree_from_derivedmesh_*`.
This is inconsistent with the trees created from EditMesh. Since for create them does not use the DerivedMesh.
In such cases the dm is being used only to cache the tree in the struct DerivedMesh. What is immediately released once
bvhtree is being used in functions that change(tag) the DM cleaning the cache.
- Use a filter function so users of SnapObjectContext can define how edit-mesh elements are handled.
- Remove em_evil.
- bvhtree of EditMesh is now really cached in the snap functions.
- Code becomes organized and easier to maintain.
This is an important patch for future improvements in snapping functions.
Using SSE2 intrinsics when available for this kind of conversions.
It's not totally accurate, but accurate enough for the purposes where
we're using direct colorspace conversion by-passing OCIO.
Partially based on code from Cycles, partially based on other online
articles:
https://stackoverflow.com/questions/6475373/optimizations-for-pow-with-const-non-integer-exponent
Makes projection painting on hi-res float textures smoother.
This commit also enables global SSE2 in Blender. It shouldn't
bring any regressions in supported hardware (we require SSE2 since
2.64 now), but should keep an eye on because compilers might have
some bugs with that (unlikely, but possible).
This is an alternative to passing a copy callback which is some times inconvenient.
Instead the caller can write to the key - needed when the key is duplicated memory.
Allows for minor optimization in ghash/gset use.
Also add BLI_gset_ensure_p_ex
Behavior is similar to python's set.pop(), it removes and returns a 'random' entry from the hash.
Notes:
* Popping will return items in same order as ghash/gset iterators (i.e. increasing
order in internal buckets-based storage), unless ghash/gset is modified in between.
* We are keeping a track of the latest bucket we popped out (through a 'state' parameter),
this allows for similar performances to iterators when iteratively popping a whole hash
(without it, we are roughly O(n!), with it we are roughly O(n)...).
Reviewers: campbellbarton
Differential Revision: https://developer.blender.org/D1808