BLI: optimize constructing IndexMask from bits and bools #126888

Merged
Jacques Lucke merged 51 commits from JacquesLucke/blender:index-mask-from-bits into main 2024-08-29 12:15:49 +02:00
Member

This patch optimizes IndexMask::from_bits by making use of the fact that many bits can be processed at once and one does not have to look at every bit individual in many cases. Bits are stored as array of BitInt (aka uint64_t). So we can process at least 64 bits at a time. On some platforms we can also make use of SIMD and process up to 128 bits at once. This can significantly improve performance if all bits are set/unset.

As a byproduct, this patch also optimizes IndexMask::from_bools which is now implemented in terms of IndexMask::from_bits. The conversion from bools to bits has been optimized significantly too by using SIMD intrinsics.

These graphs show how the old vs. new implementation of IndexMask::from_bits perform. Some observations:

  • The new implementation is faster for all tested inputs.
  • The spike on the right happens because there the index mask has the most segments. That's because it detects many ranges but the ranges are not super long yet.
  • While performance in the middle of the graph is approximately the same, the performance differences at the edges are very significant. This is also what this optimization is mostly about. When the data is simple, the operation should be very fast, the more complex the data gets, the longer it takes. Converting bits to an index mask should "feel free" when the bits are all set or all unset.

image
These are the same graphs as above, but are zoomed into the beginning/end.
image
image

This patch optimizes `IndexMask::from_bits` by making use of the fact that many bits can be processed at once and one does not have to look at every bit individual in many cases. Bits are stored as array of `BitInt` (aka `uint64_t`). So we can process at least 64 bits at a time. On some platforms we can also make use of SIMD and process up to 128 bits at once. This can significantly improve performance if all bits are set/unset. As a byproduct, this patch also optimizes `IndexMask::from_bools` which is now implemented in terms of `IndexMask::from_bits`. The conversion from bools to bits has been optimized significantly too by using SIMD intrinsics. These graphs show how the old vs. new implementation of `IndexMask::from_bits` perform. Some observations: * The new implementation is faster for all tested inputs. * The spike on the right happens because there the index mask has the most segments. That's because it detects many ranges but the ranges are not super long yet. * While performance in the middle of the graph is approximately the same, the performance differences at the edges are very significant. This is also what this optimization is mostly about. When the data is simple, the operation should be very fast, the more complex the data gets, the longer it takes. Converting bits to an index mask should "feel free" when the bits are all set or all unset. ![image](/attachments/ef4fb563-6810-49e8-9e1f-f655391aad56) These are the same graphs as above, but are zoomed into the beginning/end. ![image](/attachments/5674007d-1aa3-4c4b-868f-495d961ede05) ![image](/attachments/4c256fa2-e17c-4fb6-8c54-6ebb97625f49)
Jacques Lucke added 41 commits 2024-08-28 16:03:18 +02:00
cleanup
Some checks failed
buildbot/vexp-code-patch-lint Build done.
buildbot/vexp-code-patch-windows-amd64 Build done.
buildbot/vexp-code-patch-darwin-arm64 Build done.
buildbot/vexp-code-patch-linux-x86_64 Build done.
buildbot/vexp-code-patch-darwin-x86_64 Build done.
buildbot/vexp-code-patch-coordinator Build done.
36db2c16d6
Author
Member

@blender-bot build

@blender-bot build
Jacques Lucke added 1 commit 2024-08-28 16:17:05 +02:00
fixes
Some checks reported errors
buildbot/vexp-code-patch-lint Build done.
buildbot/vexp-code-patch-darwin-arm64 Build done.
buildbot/vexp-code-patch-windows-amd64 Build done.
buildbot/vexp-code-patch-linux-x86_64 Build done.
buildbot/vexp-code-patch-darwin-x86_64 Build done.
buildbot/vexp-code-patch-coordinator Build done.
d38b128819
Author
Member

@blender-bot build

@blender-bot build
Jacques Lucke added 2 commits 2024-08-28 16:28:02 +02:00
fix
All checks were successful
buildbot/vexp-code-patch-lint Build done.
buildbot/vexp-code-patch-linux-x86_64 Build done.
buildbot/vexp-code-patch-darwin-x86_64 Build done.
buildbot/vexp-code-patch-windows-amd64 Build done.
buildbot/vexp-code-patch-darwin-arm64 Build done.
buildbot/vexp-code-patch-coordinator Build done.
e9a94563f6
Author
Member

@blender-bot build

@blender-bot build
Jacques Lucke requested review from Hans Goudey 2024-08-28 19:02:02 +02:00
Iliya Katushenock reviewed 2024-08-28 21:50:05 +02:00
Iliya Katushenock left a comment
Member

I had some trouble while implementing a Potrace curves converter since i stated from the generation of the bits image from the bool field (even had to deep into swar & multiplication method to shift 8 bools to char). This makes things much more easier! Even IndexMask is not really necessary, at least before this will be a field kind of\

Other thing is that now we have interface to implement non-trivial predicates which do work with ranges and are able to skip a lot of work while dealing with IndexMask/

Having a periodic offset span also would simplify some things a lot (last my use case was also Potrace converter).

For future, might make sense to check SWAR log2 iteration over seted bits instead of the for lop with shift of mask\

I had some trouble while implementing a Potrace curves converter since i stated from the generation of the bits image from the bool field (even had to deep into swar & multiplication method to shift 8 bools to char). This makes things much more easier! Even IndexMask is not really necessary, at least before this will be a field kind of\ Other thing is that now we have interface to implement non-trivial predicates which do work with ranges and are able to skip a lot of work while dealing with IndexMask/ Having a periodic offset span also would simplify some things a lot (last my use case was also Potrace converter). For future, might make sense to check SWAR log2 iteration over seted bits instead of the for lop with shift of mask\
@ -0,0 +105,4 @@
#if BLI_HAVE_SSE2
for (; int_i + 1 < ints_to_check; int_i += 2) {
/* Loads the next 128 bit. */
const __m128i group = _mm_loadu_si128(reinterpret_cast<const __m128i *>(start + int_i));

Wonder if popcount check for all-range-or-all-not will be faster in benchmarks\

Wonder if popcount check for all-range-or-all-not will be faster in benchmarks\
Author
Member

Not sure how popcount could be faster than just comparing with zero.

Not sure how `popcount` could be faster than just comparing with zero.
JacquesLucke marked this conversation as resolved
@ -0,0 +22,4 @@
* \note This data structure has a pre-defined capacity and can not automatically grow once that
* capacity is reached. Use #IndexRangesBuilderBuffer to control the capacity.
*/
template<typename T> class IndexRangesBuilder : NonCopyable, NonMovable {

Wonder if this can have more integration with offset indices as just a chessboard for them\

Wonder if this can have more integration with offset indices as just a chessboard for them\
Author
Member

Not sure what you mean specifically. This seems somewhat unrelated to offset indices.

Not sure what you mean specifically. This seems somewhat unrelated to offset indices.
JacquesLucke marked this conversation as resolved
@ -0,0 +34,4 @@
/* Iterate over chunks of booleans. */
for (; bool_i + 16 < bools.size(); bool_i += 16) {
/* Load 16 bools at once. */
const __m128i group = _mm_loadu_si128(reinterpret_cast<const __m128i *>(bools_ + bool_i));

Think here do have to be additional prefix handling to match arbitrary alignment of bools.

Think here do have to be additional prefix handling to match arbitrary alignment of bools.
Author
Member

The u in _mm_loadu_si128 stands for unaligned. So that should be fine.

The `u` in `_mm_loadu_si128` stands for unaligned. So that should be fine.

Ah, i did not noticed difference with _mm_load_si128\

Ah, i did not noticed difference with `_mm_load_si128`\
JacquesLucke marked this conversation as resolved
@ -452,0 +491,4 @@
IndexRangesBuilder<int16_t> &builder)> batch_predicate,
Vector<IndexMaskSegment, 16> &r_segments)
{
IndexRangesBuilderBuffer<int16_t, max_segment_size> builder_buffer;

In the most worst case where number of the gaps is equal to the number of the set bits and distribution is perfectly uniform (1, 0, 1, 0, 1, 0, ...) number of the ranges equal to the number of the set bits and maximum as possible (half of the possible set bits). So this seems like x2 oversized buffer.

Probably max_ranges_in_segment can be another one constant in the header of the IndexMask.hh\

In the most worst case where number of the gaps is equal to the number of the set bits and distribution is perfectly uniform (1, 0, 1, 0, 1, 0, ...) number of the ranges equal to the number of the set bits and maximum as possible (half of the possible set bits). So this seems like x2 oversized buffer. Probably max_ranges_in_segment can be another one constant in the header of the IndexMask.hh\
Author
Member

It doesn't really matter if this buffer is oversized. Everything that's not used of it is never touched.

That said, I did notice that a buffer larger than max_segment_size/2 is probably never needed, because there can be at most that many ranges. Not sure if I'm off by one here.

It doesn't really matter if this buffer is oversized. Everything that's not used of it is never touched. That said, I did notice that a buffer larger than `max_segment_size/2` is probably never needed, because there can be at most that many ranges. Not sure if I'm off by one here.
JacquesLucke marked this conversation as resolved
@ -452,0 +503,4 @@
* masks with fewer segments can be build more efficiently, but when iterating over a mask it may
* be benefitial to have more ranges if that means that there are more ranges which can be
* processed more efficiently. This could be exposed to the caller in the future. */
const int64_t threshold = 64;

Can be constexpr/

Can be constexpr/
JacquesLucke marked this conversation as resolved
@ -124,1 +124,4 @@
}
MINLINE int count_bits_uint64(const uint64_t a)
{
return count_bits_i((uint32_t)a) + count_bits_i((uint32_t)(a >> 32));

Note: the SWAR popcount can be scaled even for 64, 128 and 256 and 512 bits, we only need register, mask, shift and add operations. SWAR popcount is log2 asymptotic algorithm\

Note: the SWAR popcount can be scaled even for 64, 128 and 256 and 512 bits, we only need register, mask, shift and add operations. SWAR popcount is log2 asymptotic algorithm\
Author
Member

Yeah, thought so, will leave this to someone else. The code here should not run in practice ideally.

Yeah, thought so, will leave this to someone else. The code here should not run in practice ideally.
JacquesLucke marked this conversation as resolved
Hans Goudey approved these changes 2024-08-28 22:20:17 +02:00
@ -171,3 +171,1 @@
inline void mix_into_first_expr(ExprFn &&expr,
const FirstBitSpanT &first_arg,
const BitSpanT &...args)
inline void mix_into_first_expr(ExprFn &&expr, FirstBitSpanT &&first_arg, const BitSpanT &...args)
Member

Might be nice to commit this separate with an explanation?

Might be nice to commit this separate with an explanation?
JacquesLucke marked this conversation as resolved
@ -0,0 +53,4 @@
}
const int64_t bit_i_to_output_offset = start - start_bit;
/* Iterate over ranges of 1s. For example, the the bits are 0b000111110001111000, the loop
Member

For example, the -> For example, if the

`For example, the` -> `For example, if the`
JacquesLucke marked this conversation as resolved
@ -0,0 +55,4 @@
/* Iterate over ranges of 1s. For example, the the bits are 0b000111110001111000, the loop
* below requires two iterations. The worst case for this is when the there are very many small
* ranges of 1s (e.g. 0b10101010101). So far it seemed like the overhead of detecting such
Member

seemed -> seems Small thing, but usually this sort of thing is written in present tense

`seemed` -> `seems` Small thing, but usually this sort of thing is written in present tense
JacquesLucke marked this conversation as resolved
@ -0,0 +111,4 @@
template<typename T, int64_t MaxRangesNum>
struct IndexRangesBuilderBuffer
: public std::array<T,
Member

I've read that inheriting from types in the std namespace causes undefined behavior (https://stackoverflow.com/questions/2034916/is-it-okay-to-inherit-implementation-from-stl-containers-rather-than-delegate)

I'm not confident in that, I find the whole thing confusing. But just double checking your confidence level here :P

I've read that inheriting from types in the `std` namespace causes undefined behavior (https://stackoverflow.com/questions/2034916/is-it-okay-to-inherit-implementation-from-stl-containers-rather-than-delegate) I'm not confident in that, I find the whole thing confusing. But just double checking your confidence level here :P
JacquesLucke marked this conversation as resolved
@ -0,0 +17,4 @@
#ifndef NDEBUG
/* Assert that all bits are zero. This is often already the case and simplifies the code below.
* So it's good if the caller is responsible for zeroing the bits. */
Member

Makes me think of the optimization to use calloc when allocating containers if the initial data is zero. Doing that in the future makes me feel better about this decision.

Makes me think of the optimization to use `calloc` when allocating containers if the initial data is zero. Doing that in the future makes me feel better about this decision.
Author
Member

Yeah, that would be good. Two notes:

  • BitVector is a bit special in that sense, because it initializes the bits always currently, even if no explicit initializer is given. That's necessary to workaround annoying compiler and run-time warnings which can't handle the case when an integer is only partially initialized.
  • I changed the semantics of this function to always or into the bits. This way it may be more useful in other situations too.
Yeah, that would be good. Two notes: * `BitVector` is a bit special in that sense, because it initializes the bits always currently, even if no explicit initializer is given. That's necessary to workaround annoying compiler and run-time warnings which can't handle the case when an integer is only partially initialized. * I changed the semantics of this function to always `or` into the bits. This way it may be more useful in other situations too.
@ -452,0 +459,4 @@
memory,
[&](const IndexMaskSegment universe_segment, IndexRangesBuilder<int16_t> &builder) {
const int64_t segment_start = universe_segment[0];
if (unique_sorted_indices::non_empty_is_range(universe_segment.base_span())) {
Member

How about non_empty_as_range_try?

How about `non_empty_as_range_try`?
JacquesLucke marked this conversation as resolved
@ -452,0 +507,4 @@
int64_t next_range_to_process = 0;
int64_t skipped_indices_num = 0;
auto consolidate_skipped_ranges = [&](int64_t end_range_i) {
Member

Could you add a comment above auto consolidate_skipped_ranges to mention what it does in some different words? Finding it a bit hard to understand at first

Could you add a comment above `auto consolidate_skipped_ranges` to mention what it does in some different words? Finding it a bit hard to understand at first
JacquesLucke marked this conversation as resolved
@ -476,2 +592,2 @@
return IndexMask::from_predicate(
universe, GrainSize(1024), memory, [bools](const int64_t index) { return !bools[index]; });
BitVector bits(bools);
bits::invert(bits);
Member

I guess this is leaving plenty of room for improvement!

I guess this is leaving plenty of room for improvement!
Author
Member

Yeah, I optimized it a bit more to avoid performance regressions in a few cases, but I think there are more potential optimizations. For example, one could always process 64 bools at once: convert them into a BitInt and then process that. Currently we still iterate over the data twice, first to convert to bits and then to extract the indices.

Yeah, I optimized it a bit more to avoid performance regressions in a few cases, but I think there are more potential optimizations. For example, one could always process 64 bools at once: convert them into a `BitInt` and then process that. Currently we still iterate over the data twice, first to convert to bits and then to extract the indices.
Jacques Lucke added 4 commits 2024-08-29 11:13:43 +02:00
Jacques Lucke added 3 commits 2024-08-29 11:31:21 +02:00
avoid deriving from std type
All checks were successful
buildbot/vexp-code-patch-lint Build done.
buildbot/vexp-code-patch-linux-x86_64 Build done.
buildbot/vexp-code-patch-darwin-x86_64 Build done.
buildbot/vexp-code-patch-windows-amd64 Build done.
buildbot/vexp-code-patch-darwin-arm64 Build done.
buildbot/vexp-code-patch-coordinator Build done.
0230c8f3a1
Author
Member

@blender-bot build

@blender-bot build
Jacques Lucke merged commit 66adedbd78 into main 2024-08-29 12:15:49 +02:00
Jacques Lucke deleted branch index-mask-from-bits 2024-08-29 12:15:52 +02:00
Sign in to join this conversation.
No reviewers
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset System
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Code Documentation
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Viewport & EEVEE
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Asset Browser Project
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Module
Viewport & EEVEE
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Severity
High
Severity
Low
Severity
Normal
Severity
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#126888
No description provided.