There is no point having operations that iterate over the whole
bit array as macros, so convert BLI_BITMAP_SET_ALL to a function.
Also, add more utilities for copying and manipulating masks.
Reviewers: brecht, campbellbarton
Differential Revision: https://developer.blender.org/D4101
The new data structure uses open addressing instead of chaining to resolve collisions in the hash table.
This new structure was never slower than the old implementation in my tests. Code that first inserts all edges and then iterates through all edges (e.g. to remove duplicates) benefits the most, because the `EdgeHashIterator` becomes a simple for loop over a continuous array.
Reviewer: campbellbarton
Differential Revision: D4050
This makes the `#include <Windows.h>` use more localised and out of the
`image.c` file.
Reviewers: LazyDodo
Reviewed by: LazyDodo
Differential revision: https://developer.blender.org/D4049
To make the pool more usable for running multiple stages of tasks,
fix local queue handling in BLI_task_pool_work_and_wait.
Specifically, after the wait loop the local queue should be empty,
or the wait part of the function contract isn't fulfilled. Instead,
check and run any tasks in queue before the wait loop.
Also, add a new function that resets the suspended state of the pool.
The idea is to make main thread and job threads to be scheduled
on CPU dies which has direct access to memory (those are NUMA
nodes 0 and 2).
We also do this for new EPYC CPUs since their NUMA nodes 1 and 3
do have access but only to a higher range DDR slots. By preferring
nodes 0 and 2 on EPYC we make it so users with partially filled
DDR slots has fast memory access.
One thing which is not really solved yet is localization of
memory allocation: we do not guarantee that memory is allocated
on the closest to the NUMA node DDR slot and hope that memory
manager of OS is acting in favor of us.
There is a new `bpy.app.timers` api.
For more details, look in the Python API documentation.
Reviewers: campbellbarton
Differential Revision: https://developer.blender.org/D3994
The goal is to address performance regression when going from
few threads to 10s of threads. On a systems with more than 32
CPU threads the benefit of threaded loop was actually harmful.
There are following tweaks now:
- The chunk size is adaptive for the number of threads, which
minimizes scheduling overhead.
- The number of tasks is adaptive to the list size and chunk
size.
Here comes performance comparison on the production shot:
Number of threads DEG time before DEG time after
44 0.09 0.02
32 0.055 0.025
16 0.025 0.025
8 0.035 0.033
The Nearest Surface Point shrink method, while fast, is neither
smooth nor continuous: as the source point moves, the projected
point can both stop and jump. This causes distortions in the
deformation of the shrinkwrap modifier, and the motion of an
animated object with a shrinkwrap constraint.
This patch implements a new mode, which, instead of using the simple
nearest point search, iteratively solves an equation for each triangle
to find a point which has its interpolated normal point to or from the
original vertex. Non-manifold boundary edges are treated as infinitely
thin cylinders that cast normals in all perpendicular directions.
Since this is useful for the constraint, and having multiple
objects with constraints targeting the same guide mesh is a quite
reasonable use case, rather than calculating the mesh boundary edge
data over and over again, it is precomputed and cached in the mesh.
Reviewers: mont29
Differential Revision: https://developer.blender.org/D3836
Simple find_nearest relies on a heuristic for efficient culling of
the BVH tree, which involves a fast callback that always updates the
result, and the caller reusing the result of the previous find_nearest
to prime the process for the next vertex.
If the callback is slow and/or applies significant restrictions on
what kind of nodes can qualify for the result, the heuristic can't
work. Thus for such tasks it is necessary to order and prune nodes
before the callback at BVH tree level using a priority queue.
Since, according to code history, for simple find_nearest the
heuristic approach is faster, this mode has to be an option.
The main use one can imagine for this is adding tweak controls to
parts of a model that are already deformed by multiple other major
bones. It is natural to expect such locations to deform as if the
tweaks aren't there by default; however currently there is no easy
way to make a bone follow multiple other bones.
This adds a new constraint that implements the math behind the Armature
modifier, with support for explicit weights, bone envelopes, and dual
quaternion blending. It can also access bones from multiple armatures
at the same time (mainly because it's easier to code it that way.)
This also fixes dquat_to_mat4, which wasn't used anywhere before.
Differential Revision: https://developer.blender.org/D3664
If the user only needs insertion and removal from top, there is
no need to allocate and manage separate HeapNode objects: the
data can be stored directly in the main tree array.
This measured a 24% FPS increase on a ~50% heap-heavy workload.
Reviewers: brecht
Differential Revision: https://developer.blender.org/D3898
The index field of nodes is supposed to be its actual index, so
there is no need to read it in swap. On a 64-bit processor i and
j are already in registers, so this removes two memory reads.
In addition, cache the tree pointer, use branch hints, and
put the most frequently accessed 'value' field at 0 offset.
Produced a 20% FPS improvement for a 50% heap-heavy workload.