BLI: optimize constructing IndexMask from bits and bools #126888
No reviewers
Labels
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset System
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Code Documentation
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Viewport & EEVEE
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Asset Browser Project
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Module
Viewport & EEVEE
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Severity
High
Severity
Low
Severity
Normal
Severity
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: blender/blender#126888
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "JacquesLucke/blender:index-mask-from-bits"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This patch optimizes
IndexMask::from_bits
by making use of the fact that many bits can be processed at once and one does not have to look at every bit individual in many cases. Bits are stored as array ofBitInt
(akauint64_t
). So we can process at least 64 bits at a time. On some platforms we can also make use of SIMD and process up to 128 bits at once. This can significantly improve performance if all bits are set/unset.As a byproduct, this patch also optimizes
IndexMask::from_bools
which is now implemented in terms ofIndexMask::from_bits
. The conversion from bools to bits has been optimized significantly too by using SIMD intrinsics.These graphs show how the old vs. new implementation of
IndexMask::from_bits
perform. Some observations:These are the same graphs as above, but are zoomed into the beginning/end.
@blender-bot build
@blender-bot build
@blender-bot build
I had some trouble while implementing a Potrace curves converter since i stated from the generation of the bits image from the bool field (even had to deep into swar & multiplication method to shift 8 bools to char). This makes things much more easier! Even IndexMask is not really necessary, at least before this will be a field kind of\
Other thing is that now we have interface to implement non-trivial predicates which do work with ranges and are able to skip a lot of work while dealing with IndexMask/
Having a periodic offset span also would simplify some things a lot (last my use case was also Potrace converter).
For future, might make sense to check SWAR log2 iteration over seted bits instead of the for lop with shift of mask\
@ -0,0 +105,4 @@
#if BLI_HAVE_SSE2
for (; int_i + 1 < ints_to_check; int_i += 2) {
/* Loads the next 128 bit. */
const __m128i group = _mm_loadu_si128(reinterpret_cast<const __m128i *>(start + int_i));
Wonder if popcount check for all-range-or-all-not will be faster in benchmarks\
Not sure how
popcount
could be faster than just comparing with zero.@ -0,0 +22,4 @@
* \note This data structure has a pre-defined capacity and can not automatically grow once that
* capacity is reached. Use #IndexRangesBuilderBuffer to control the capacity.
*/
template<typename T> class IndexRangesBuilder : NonCopyable, NonMovable {
Wonder if this can have more integration with offset indices as just a chessboard for them\
Not sure what you mean specifically. This seems somewhat unrelated to offset indices.
@ -0,0 +34,4 @@
/* Iterate over chunks of booleans. */
for (; bool_i + 16 < bools.size(); bool_i += 16) {
/* Load 16 bools at once. */
const __m128i group = _mm_loadu_si128(reinterpret_cast<const __m128i *>(bools_ + bool_i));
Think here do have to be additional prefix handling to match arbitrary alignment of bools.
The
u
in_mm_loadu_si128
stands for unaligned. So that should be fine.Ah, i did not noticed difference with
_mm_load_si128
\@ -452,0 +491,4 @@
IndexRangesBuilder<int16_t> &builder)> batch_predicate,
Vector<IndexMaskSegment, 16> &r_segments)
{
IndexRangesBuilderBuffer<int16_t, max_segment_size> builder_buffer;
In the most worst case where number of the gaps is equal to the number of the set bits and distribution is perfectly uniform (1, 0, 1, 0, 1, 0, ...) number of the ranges equal to the number of the set bits and maximum as possible (half of the possible set bits). So this seems like x2 oversized buffer.
Probably max_ranges_in_segment can be another one constant in the header of the IndexMask.hh\
It doesn't really matter if this buffer is oversized. Everything that's not used of it is never touched.
That said, I did notice that a buffer larger than
max_segment_size/2
is probably never needed, because there can be at most that many ranges. Not sure if I'm off by one here.@ -452,0 +503,4 @@
* masks with fewer segments can be build more efficiently, but when iterating over a mask it may
* be benefitial to have more ranges if that means that there are more ranges which can be
* processed more efficiently. This could be exposed to the caller in the future. */
const int64_t threshold = 64;
Can be constexpr/
@ -124,1 +124,4 @@
}
MINLINE int count_bits_uint64(const uint64_t a)
{
return count_bits_i((uint32_t)a) + count_bits_i((uint32_t)(a >> 32));
Note: the SWAR popcount can be scaled even for 64, 128 and 256 and 512 bits, we only need register, mask, shift and add operations. SWAR popcount is log2 asymptotic algorithm\
Yeah, thought so, will leave this to someone else. The code here should not run in practice ideally.
@ -171,3 +171,1 @@
inline void mix_into_first_expr(ExprFn &&expr,
const FirstBitSpanT &first_arg,
const BitSpanT &...args)
inline void mix_into_first_expr(ExprFn &&expr, FirstBitSpanT &&first_arg, const BitSpanT &...args)
Might be nice to commit this separate with an explanation?
@ -0,0 +53,4 @@
}
const int64_t bit_i_to_output_offset = start - start_bit;
/* Iterate over ranges of 1s. For example, the the bits are 0b000111110001111000, the loop
For example, the
->For example, if the
@ -0,0 +55,4 @@
/* Iterate over ranges of 1s. For example, the the bits are 0b000111110001111000, the loop
* below requires two iterations. The worst case for this is when the there are very many small
* ranges of 1s (e.g. 0b10101010101). So far it seemed like the overhead of detecting such
seemed
->seems
Small thing, but usually this sort of thing is written in present tense@ -0,0 +111,4 @@
template<typename T, int64_t MaxRangesNum>
struct IndexRangesBuilderBuffer
: public std::array<T,
I've read that inheriting from types in the
std
namespace causes undefined behavior (https://stackoverflow.com/questions/2034916/is-it-okay-to-inherit-implementation-from-stl-containers-rather-than-delegate)I'm not confident in that, I find the whole thing confusing. But just double checking your confidence level here :P
@ -0,0 +17,4 @@
#ifndef NDEBUG
/* Assert that all bits are zero. This is often already the case and simplifies the code below.
* So it's good if the caller is responsible for zeroing the bits. */
Makes me think of the optimization to use
calloc
when allocating containers if the initial data is zero. Doing that in the future makes me feel better about this decision.Yeah, that would be good. Two notes:
BitVector
is a bit special in that sense, because it initializes the bits always currently, even if no explicit initializer is given. That's necessary to workaround annoying compiler and run-time warnings which can't handle the case when an integer is only partially initialized.or
into the bits. This way it may be more useful in other situations too.@ -452,0 +459,4 @@
memory,
[&](const IndexMaskSegment universe_segment, IndexRangesBuilder<int16_t> &builder) {
const int64_t segment_start = universe_segment[0];
if (unique_sorted_indices::non_empty_is_range(universe_segment.base_span())) {
How about
non_empty_as_range_try
?@ -452,0 +507,4 @@
int64_t next_range_to_process = 0;
int64_t skipped_indices_num = 0;
auto consolidate_skipped_ranges = [&](int64_t end_range_i) {
Could you add a comment above
auto consolidate_skipped_ranges
to mention what it does in some different words? Finding it a bit hard to understand at first@ -476,2 +592,2 @@
return IndexMask::from_predicate(
universe, GrainSize(1024), memory, [bools](const int64_t index) { return !bools[index]; });
BitVector bits(bools);
bits::invert(bits);
I guess this is leaving plenty of room for improvement!
Yeah, I optimized it a bit more to avoid performance regressions in a few cases, but I think there are more potential optimizations. For example, one could always process 64 bools at once: convert them into a
BitInt
and then process that. Currently we still iterate over the data twice, first to convert to bits and then to extract the indices.@blender-bot build