BLI: refactor IndexMask for better performance and memory usage #104629
Merged
Jacques Lucke
merged 254 commits from 2023-05-24 18:11:47 +02:00
JacquesLucke/blender:index-mask-refactor
into main
Reviewers
Request review
No reviewers
Labels
Clear labels
This issue affects/is about backward or forward compatibility
Issues relating to security: https://wiki.blender.org/wiki/Process/Vulnerability_Reports
Apply labels
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
This issue affects/is about backward or forward compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest: Wayland
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Issues relating to security: https://wiki.blender.org/wiki/Process/Vulnerability_Reports
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest: Wayland
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
Milestone
Set milestone
Clear milestone
No items
No Milestone
Projects
Set Project
Clear projects
No project
Assignees
Assign users
Clear assignees
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.
No due date set.
Dependencies
No dependencies set.
Reference: blender/blender#104629
Reference in New Issue
There is no content yet.
Delete Branch "JacquesLucke/blender:index-mask-refactor"
Deleting a branch is permanent. Although the deleted branch may exist for a short time before cleaning up, in most cases it CANNOT be undone. Continue?
Goals of this refactor:
IndexMask
. The oldIndexMask
uses anint64_t
for each index which is more than necessary in pretty much all practical cases currently. I still wouldn't want to simply reduce the size toint32_t
because that could become limiting in the future in case we use this to index e.g. byte buffers larger than a few gigabytes. I also don't want to templateIndexMask
, because that would cause a split in the "ecosystem", or everything would have to be implemented twice or templated.IndexMask
contains a single array. This is generally good but has the problem that it is hard to fill from multiple-threads when the final size is not known from the beginning. This is commonly the case when e.g. converting an array of bool to an index mask. Currently, this kind of code only runs on a single thread.IndexMask
very efficiently. The most important part of that is to avoid all memory access when iterating over continuous ranges. For some core nodes (e.g. math nodes), we generate optimized code for the cases of irregular index masks and simple index ranges.To achieve these goals, a few compromises had to made:
O(log #indices)
now, but with a low constant factor. It should be possible to split a mask into n approximately equally sized parts inO(n)
though, making the time per splitO(1)
.IndexMask
. Therefor,foreach_*
functions with callbacks have to be used. To avoid extra code complexity at the call site, theforeach_*
methods support multi-threading out of the box.The new data structure splits an
IndexMask
into an arbitrary number of orderedIndexMaskSegment
. Each segment can contain at most2^14 = 16384
indices. The indices within a segment are stored asint16_t
. Each segment has an additionalint64_t
offset which allows storing arbitraryint64_t
indices. This approach has the main benefits that segments can be processed/constructed individually on multiple threads without a serial bottleneck. Also it reduces the memory requirements significantly.For more details see comments in
BLI_index_mask.hh
.I did a few tests to verify that the data structure generally improves performance and does not cause regressions:
int64_t
toint16_t
for indices. In the worst case, the memory requirements can be larger when there are many indices that are very far away. However, when they are far away from each other, that indicates that there aren't many indices in total. In common cases, memory usage can be way lower than 1/4 of before, because sub-ranges use static memory.IndexMask::from_bools
inindex_mask_from_selection
on 10.000.000 elements at various probabilities fortrue
at every index:@blender-bot build
@blender-bot build
WIP: BLI: refactor IndexMask for better performance and memory usageto BLI: refactor IndexMask for better performance and memory usage@blender-bot build
@blender-bot build
Looks quite good! It's satisfying to see how this has come together and been simplified. Committing it now/soon will make it easier to track future improvements like improving
from_bits
orcomplement
.I think the API documentation is missing a bit of description of how to choose between
foreach_segment
andforeach_index
(and the corresponding optimized variants). The choice seems a bit arbitrary right now in the various users.@ -31,0 +26,4 @@
* - The most-significant-bit is not used so that signed integers can be used which avoids common
* issues when mixing signed and unsigned ints.
* - The second most-significant bit is not used for indices so that #max_segment_size itself can
* be stored in the #int16_t.
Might be helpful to mention why it's helpful that
max_segment_size
fits inint16_t
@ -73,0 +77,4 @@
/**
* Encodes the size of each segment. The size of a specific segment can be computed by
* subtracting consecutive elements (also see #OffsetIndices). The size of this array is one
* larger than #segments_num_. Note that the first elements is _not_ necessarily zero.
Hmm, why would the first element not be 0 always?
The first element is often not 0 when the
IndexMask
is a slice of another mask.@ -257,3 +294,1 @@
*
* All the indices in the sub-mask are shifted by 3 towards zero,
* so that the first index in the output is zero.
* Same as above but may generate more code at compile time because it optimizes for the case
"optimizes for" might be more helpful if it said "generates a separate case for "
@ -291,0 +353,4 @@
* The class has to be constructed once. Afterwards, `update` has to be called to fill the mask
* with the provided segment.
*/
class IndexMaskFromSegment : NonCopyable, NonMovable {
IndexMaskFromSegment
could probably get a more "private" API, with only publicmask()
andupdate()
methods and everything else private. That might make it more obvious how it's supposed to be used.@ -0,0 +7,4 @@
namespace blender {
/**
* An #OffsetSpan where a constant offset is added to every value when accessed. This allows e.g.
An #OffsetSpan where a constant
->An #OffsetSpan is a #Span with a constant
@ -0,0 +66,4 @@
* [3, 4, 5, 6, 8, 9, 10]
* ^ Range ends here because 6 and 8 are not consecutive.
*/
template<typename T> inline int64_t find_size_of_next_range(const Span<T> indices)
It's pretty clear from the implementations, but it might be nice to put the average and worst case big-O runtime for
find_size_until_next_range
andfind_size_of_next_range
@ -5,0 +28,4 @@
const IndexMask &get_static_index_mask_for_min_size(const int64_t min_size)
{
static constexpr int64_t size_shift = 30;
static constexpr int64_t max_size = (1 << size_shift);
Might as well add
/* 1'073'741'824 */
in a comment after this. Maybe 2 billion or so is safer?It's a bit confusing that
min_size
is passed to this function in general, but I guess it's the only good way to have that assert.@ -44,3 +149,1 @@
}
if (indices_.is_empty()) {
return full_range;
/* TODO: Implement more efficient solution. */
I guess this might come if we work on the operations you wrote earlier using bit masks? Worth doing soon I guess, since it's not great to have this with a very different algorithmic complexity than the proper solution.
It's probably faster to implement this without bit masks. One could just generate a new segment for each old non-range segment and add extra segments for the gaps between old segments.
@ -54,0 +168,4 @@
int64_t group_start_segment_i = 0;
int64_t group_first = segments[0][0];
int64_t group_last = segments[0].last();
bool group_as_range = unique_sorted_indices::non_empty_is_range(segments[0].base_span());
const bool
?@ -90,0 +257,4 @@
LinearAllocator<> &allocator,
Vector<IndexMaskSegment> &r_segments)
{
Vector<std::variant<IndexRange, Span<T>>> segments;
With the
1/2^14
constant factor, a larger inline buffer here could probably eliminate most allocations. Same below withVector<IndexMaskSegment> segments
@ -90,0 +277,4 @@
segment_indices.size());
while (!segment_indices.is_empty()) {
const int64_t offset = segment_indices[0];
const int64_t next_segment_size = binary_search::find_predicate_begin(
Maybe not worth it (also just want to test my understanding)-- couldn't this limit the maximum size of the span argument to
find_predicate_begin
with something likefind_predicate_begin(segment_indices.take_front(max_segment_size + offset),...
?Not sure why the
+ offset
, but taking at mostmax_segment_size
makes sense.In practice it likely doesn't make a difference right now, because the span passed to
segments_from_indices
is already sliced (for multi-threading).@ -135,2 +130,2 @@
return curves_to_delete[curve_i];
});
IndexMaskMemory mask_memory;
const IndexMask &mask_to_delete = IndexMask::from_bools(curves_to_delete, mask_memory);
Probably shouldn't be a reference here
@ -165,2 +164,4 @@
}
}
IndexMaskMemory memory;
const IndexMask changed_curves_mask = IndexMask::from_indices<int64_t>(changed_curves_indices,
It looks like this whole
changed_curves_indices
thing could be replaced withfrom_predicate
?Might not be entirely trivial right now, but generally I agree. Will leave that for later.
@ -323,3 +321,3 @@
MutableSpan<T> dst = attributes.dst[i_attribute].typed<T>();
for (const int i_curve : sliced_selection) {
for (const int i_curve : selection_segment) {
This changes
int
toint64_t
in some places but not others, probably best to be consistent.@ -527,1 +514,3 @@
for (const int i : selection.slice(range)) {
selection.foreach_segment(GrainSize(512), [&](const IndexMaskSegment segment) {
for (const int i : segment) {
nurbs_order[i] = 4;
Any particular reason not to just use
masked_fill
here, rather than writing a for loop? Seems like it would be simpler that way.@ -1068,3 +1051,3 @@
/* Only trimmed curves are no longer cyclic. */
if (bke::SpanAttributeWriter cyclic = dst_attributes.lookup_for_write_span<bool>("cyclic")) {
cyclic.span.fill_indices(selection.indices(), false);
selection.foreach_index(GrainSize(4096), [&](const int64_t i) { cyclic.span[i] = false; });
Might as well used
masked_fill
hereDon't mind much either way, but performance wise it's likely better the way it is now, even though it also doesn't matter too much. Might be good to have a version of
masked_fill
that takesIndexMaskSegment
.I think I'd rather keep the code a bit simpler here, and leave changing from
fill_indices
to something besidesmasked_fill
for a separate commit, where it can be more easily evaluated separately from this larger change.@ -415,2 +408,4 @@
});
});
IndexMaskMemory memory;
This
memory
will fill up while processing in the for loop, sincefrom_indices
doesn't clear the existing memory. Maybe better to declare it inside the loop? Or maybe not, hrmm...Actually, maybe I'll just try to replace this with the same
from_groups
thing from elsewhere.Yeah, difficult, will also leave that for later for now. It's probably good to refactor this a bit more like you mentioned.
Oops. Meant to request changes.
Also, it would be nice to see at least a few performance measurements (or at least some memory usage example case).
The new
foreach
documentation is great, thanks for that!I just left one comment about the code replacing
masked_fill
in two places. Other than that this looks ready to me.@blender-bot build
@blender-bot build
Reviewers