BLI: optimize constructing IndexMask from bits and bools #126888

Jacques Lucke · 2024-08-28T16:03:10+02:00

Jacques Lucke commented

2024-08-28 16:03:10 +02:00

This patch optimizes IndexMask::from_bits by making use of the fact that many bits can be processed at once and one does not have to look at every bit individual in many cases. Bits are stored as array of BitInt (aka uint64_t). So we can process at least 64 bits at a time. On some platforms we can also make use of SIMD and process up to 128 bits at once. This can significantly improve performance if all bits are set/unset.

As a byproduct, this patch also optimizes IndexMask::from_bools which is now implemented in terms of IndexMask::from_bits. The conversion from bools to bits has been optimized significantly too by using SIMD intrinsics.

These graphs show how the old vs. new implementation of IndexMask::from_bits perform. Some observations:

The new implementation is faster for all tested inputs.
The spike on the right happens because there the index mask has the most segments. That's because it detects many ranges but the ranges are not super long yet.
While performance in the middle of the graph is approximately the same, the performance differences at the edges are very significant. This is also what this optimization is mostly about. When the data is simple, the operation should be very fast, the more complex the data gets, the longer it takes. Converting bits to an index mask should "feel free" when the bits are all set or all unset.

These are the same graphs as above, but are zoomed into the beginning/end.

This patch optimizes `IndexMask::from_bits` by making use of the fact that many bits can be processed at once and one does not have to look at every bit individual in many cases. Bits are stored as array of `BitInt` (aka `uint64_t`). So we can process at least 64 bits at a time. On some platforms we can also make use of SIMD and process up to 128 bits at once. This can significantly improve performance if all bits are set/unset. As a byproduct, this patch also optimizes `IndexMask::from_bools` which is now implemented in terms of `IndexMask::from_bits`. The conversion from bools to bits has been optimized significantly too by using SIMD intrinsics. These graphs show how the old vs. new implementation of `IndexMask::from_bits` perform. Some observations: * The new implementation is faster for all tested inputs. * The spike on the right happens because there the index mask has the most segments. That's because it detects many ranges but the ranges are not super long yet. * While performance in the middle of the graph is approximately the same, the performance differences at the edges are very significant. This is also what this optimization is mostly about. When the data is simple, the operation should be very fast, the more complex the data gets, the longer it takes. Converting bits to an index mask should "feel free" when the bits are all set or all unset. ![image](/attachments/ef4fb563-6810-49e8-9e1f-f655391aad56) These are the same graphs as above, but are zoomed into the beginning/end. ![image](/attachments/5674007d-1aa3-4c4b-868f-495d961ede05) ![image](/attachments/4c256fa2-e17c-4fb6-8c54-6ebb97625f49)

image.png

29 KiB

image.png

36 KiB

image.png

33 KiB

🚀 2

Jacques Lucke added 41 commits 2024-08-28 16:03:18 +02:00

initial from-bits conversion 9ea1793731

improve gathering indices cd01822c78

optimize 2f78e554db

support multi-threading 224ffdc635

benchmark aa032fd1ef

reduce constant overhead b5d5743c91

progress 538fe46dc7

add index ranges builder e024561f80

add tests for ranges builder 5012f3aca7

improve benchmark d1bb457007

improve check 459a787204

add comparison with old implementation to benchmark fc78dce5ee

add more generate from_batch_predicate api 170077dd14

add forward declaration d22ae07e86

remove size template parameter 7e3ed4bfd1

extract header to convert bits to index ranges 4f9d270ed6

cleanup 919ddccdc6

use simd for checking 128 bits for zero at the same time a102d099c3

optimize case when there are 128 one bits 124d72cbfd

cleanup f4743c8cfd

optimize bool to bits conversion 9afd2fb17c

add bools to bits test 5848104a43

add IndexMask::from_bools test d57e69b266

cleanup test 6a22edbd95

disable multi-threading for small inputs 0dcbdc7c4f

fix 2f44789178

enable multi-threading again cab548cd7f

cleanup 6873ce1f9f

use __builtin_popcountll 59abc3ee3a

add popcount builtins for msvc 8e63d95b76

add popcount test 690f8284a7

cleanup 926ebc54d9

Merge branch 'main' into index-mask-from-bits c609e6108d

cleanup aad29f5fc4

add test 0a325a968f

add comments 01c250f748

add comments 768c5ea29e

add tests f10d42794f

add assert d752255115

cleanup 0f6952d453

cleanup

buildbot/vexp-code-patch-lint Build done.

Details

buildbot/vexp-code-patch-windows-amd64 Build done.

Details

buildbot/vexp-code-patch-darwin-arm64 Build done.

Details

buildbot/vexp-code-patch-linux-x86_64 Build done.

Details

buildbot/vexp-code-patch-darwin-x86_64 Build done.

Details

buildbot/vexp-code-patch-coordinator Build done.

Details

36db2c16d6

Jacques Lucke commented

2024-08-28 16:04:05 +02:00

@blender-bot build

Jacques Lucke added 1 commit 2024-08-28 16:17:05 +02:00

fixes

buildbot/vexp-code-patch-lint Build done.

Details

buildbot/vexp-code-patch-darwin-arm64 Build done.

Details

buildbot/vexp-code-patch-windows-amd64 Build done.

Details

buildbot/vexp-code-patch-linux-x86_64 Build done.

Details

buildbot/vexp-code-patch-darwin-x86_64 Build done.

Details

buildbot/vexp-code-patch-coordinator Build done.

Details

d38b128819

Jacques Lucke commented

2024-08-28 16:19:20 +02:00

@blender-bot build

Jacques Lucke added 2 commits 2024-08-28 16:28:02 +02:00

fix 3d6b07030b

fix

buildbot/vexp-code-patch-lint Build done.

Details

buildbot/vexp-code-patch-linux-x86_64 Build done.

Details

buildbot/vexp-code-patch-darwin-x86_64 Build done.

Details

buildbot/vexp-code-patch-windows-amd64 Build done.

Details

buildbot/vexp-code-patch-darwin-arm64 Build done.

Details

buildbot/vexp-code-patch-coordinator Build done.

Details

e9a94563f6

Jacques Lucke commented

2024-08-28 16:28:22 +02:00

@blender-bot build

Jacques Lucke requested review from Hans Goudey 2024-08-28 19:02:02 +02:00

Iliya Katushenock reviewed 2024-08-28 21:50:05 +02:00

Iliya Katushenock left a comment

I had some trouble while implementing a Potrace curves converter since i stated from the generation of the bits image from the bool field (even had to deep into swar & multiplication method to shift 8 bools to char). This makes things much more easier! Even IndexMask is not really necessary, at least before this will be a field kind of\

Other thing is that now we have interface to implement non-trivial predicates which do work with ranges and are able to skip a lot of work while dealing with IndexMask/

Having a periodic offset span also would simplify some things a lot (last my use case was also Potrace converter).

For future, might make sense to check SWAR log2 iteration over seted bits instead of the for lop with shift of mask\

I had some trouble while implementing a Potrace curves converter since i stated from the generation of the bits image from the bool field (even had to deep into swar & multiplication method to shift 8 bools to char). This makes things much more easier! Even IndexMask is not really necessary, at least before this will be a field kind of\ Other thing is that now we have interface to implement non-trivial predicates which do work with ranges and are able to skip a lot of work while dealing with IndexMask/ Having a periodic offset span also would simplify some things a lot (last my use case was also Potrace converter). For future, might make sense to check SWAR log2 iteration over seted bits instead of the for lop with shift of mask\

source/blender/blenlib/BLI_bit_span_to_index_ranges.hh

						
				@ -0,0 +105,4 @@

				#if BLI_HAVE_SSE2

				    for (; int_i + 1 < ints_to_check; int_i += 2) {

				      /* Loads the next 128 bit. */

				      const __m128i group = _mm_loadu_si128(reinterpret_cast<const __m128i *>(start + int_i));

Iliya Katushenock commented

2024-08-28 21:18:37 +02:00

Wonder if popcount check for all-range-or-all-not will be faster in benchmarks\

Jacques Lucke commented

2024-08-29 11:07:33 +02:00

Not sure how popcount could be faster than just comparing with zero.

Not sure how `popcount` could be faster than just comparing with zero.

JacquesLucke marked this conversation as resolved

source/blender/blenlib/BLI_index_ranges_builder.hh

						
				@ -0,0 +22,4 @@

				 * \note This data structure has a pre-defined capacity and can not automatically grow once that

				 * capacity is reached. Use #IndexRangesBuilderBuffer to control the capacity.

				 */

				template<typename T> class IndexRangesBuilder : NonCopyable, NonMovable {

Iliya Katushenock commented

2024-08-28 21:23:41 +02:00

Wonder if this can have more integration with offset indices as just a chessboard for them\

Jacques Lucke commented

2024-08-29 11:08:08 +02:00

Not sure what you mean specifically. This seems somewhat unrelated to offset indices.

JacquesLucke marked this conversation as resolved

source/blender/blenlib/intern/bit_bool_conversion.cc Outdated

						
				@ -0,0 +34,4 @@

				  /* Iterate over chunks of booleans. */

				  for (; bool_i + 16 < bools.size(); bool_i += 16) {

				    /* Load 16 bools at once. */

				    const __m128i group = _mm_loadu_si128(reinterpret_cast<const __m128i *>(bools_ + bool_i));

Iliya Katushenock commented

2024-08-28 21:30:08 +02:00

Think here do have to be additional prefix handling to match arbitrary alignment of bools.

Jacques Lucke commented

2024-08-29 11:09:41 +02:00

The u in _mm_loadu_si128 stands for unaligned. So that should be fine.

The `u` in `_mm_loadu_si128` stands for unaligned. So that should be fine.

😆 1

Iliya Katushenock commented

2024-08-29 17:54:55 +02:00

Ah, i did not noticed difference with _mm_load_si128\

Ah, i did not noticed difference with `_mm_load_si128`\

JacquesLucke marked this conversation as resolved

source/blender/blenlib/intern/index_mask.cc Outdated

						
				@ -452,0 +491,4 @@

				                              IndexRangesBuilder<int16_t> &builder)> batch_predicate,

				    Vector<IndexMaskSegment, 16> &r_segments)

				{

				  IndexRangesBuilderBuffer<int16_t, max_segment_size> builder_buffer;

Iliya Katushenock commented

2024-08-28 21:38:31 +02:00

In the most worst case where number of the gaps is equal to the number of the set bits and distribution is perfectly uniform (1, 0, 1, 0, 1, 0, ...) number of the ranges equal to the number of the set bits and maximum as possible (half of the possible set bits). So this seems like x2 oversized buffer.

Probably max_ranges_in_segment can be another one constant in the header of the IndexMask.hh\

In the most worst case where number of the gaps is equal to the number of the set bits and distribution is perfectly uniform (1, 0, 1, 0, 1, 0, ...) number of the ranges equal to the number of the set bits and maximum as possible (half of the possible set bits). So this seems like x2 oversized buffer. Probably max_ranges_in_segment can be another one constant in the header of the IndexMask.hh\

Jacques Lucke commented

2024-08-29 11:11:55 +02:00

It doesn't really matter if this buffer is oversized. Everything that's not used of it is never touched.

That said, I did notice that a buffer larger than max_segment_size/2 is probably never needed, because there can be at most that many ranges. Not sure if I'm off by one here.

It doesn't really matter if this buffer is oversized. Everything that's not used of it is never touched. That said, I did notice that a buffer larger than `max_segment_size/2` is probably never needed, because there can be at most that many ranges. Not sure if I'm off by one here.

JacquesLucke marked this conversation as resolved

source/blender/blenlib/intern/index_mask.cc Outdated

						
				@ -452,0 +503,4 @@

				   * masks with fewer segments can be build more efficiently, but when iterating over a mask it may

				   * be benefitial to have more ranges if that means that there are more ranges which can be

				   * processed more efficiently. This could be exposed to the caller in the future. */

				  const int64_t threshold = 64;

Iliya Katushenock commented

2024-08-28 21:39:20 +02:00

Can be constexpr/

JacquesLucke marked this conversation as resolved

source/blender/blenlib/intern/math_bits_inline.c

						
				@ -124,1 +124,4 @@

				}

				MINLINE int count_bits_uint64(const uint64_t a)

				{

				  return count_bits_i((uint32_t)a) + count_bits_i((uint32_t)(a >> 32));

Iliya Katushenock commented

2024-08-28 21:42:19 +02:00

Note: the SWAR popcount can be scaled even for 64, 128 and 256 and 512 bits, we only need register, mask, shift and add operations. SWAR popcount is log2 asymptotic algorithm\

Jacques Lucke commented

2024-08-29 11:12:56 +02:00

Yeah, thought so, will leave this to someone else. The code here should not run in practice ideally.

JacquesLucke marked this conversation as resolved

Hans Goudey approved these changes 2024-08-28 22:20:17 +02:00

source/blender/blenlib/BLI_bit_span_ops.hh

						
				@ -171,3 +171,1 @@

				inline void mix_into_first_expr(ExprFn &&expr,

				                                const FirstBitSpanT &first_arg,

				                                const BitSpanT &...args)

				inline void mix_into_first_expr(ExprFn &&expr, FirstBitSpanT &&first_arg, const BitSpanT &...args)

Hans Goudey commented

2024-08-28 21:53:59 +02:00

Might be nice to commit this separate with an explanation?

JacquesLucke marked this conversation as resolved

source/blender/blenlib/BLI_bit_span_to_index_ranges.hh Outdated

						
				@ -0,0 +53,4 @@

				    }

				    const int64_t bit_i_to_output_offset = start - start_bit;

				    /* Iterate over ranges of 1s. For example, the the bits are 0b000111110001111000, the loop

Hans Goudey commented

2024-08-28 21:56:41 +02:00

For example, the -> For example, if the

`For example, the` -> `For example, if the`

JacquesLucke marked this conversation as resolved

source/blender/blenlib/BLI_bit_span_to_index_ranges.hh Outdated

						
				@ -0,0 +55,4 @@

				    /* Iterate over ranges of 1s. For example, the the bits are 0b000111110001111000, the loop

				     * below requires two iterations. The worst case for this is when the there are very many small

				     * ranges of 1s (e.g. 0b10101010101). So far it seemed like the overhead of detecting such

Hans Goudey commented

2024-08-28 21:57:15 +02:00

seemed -> seems Small thing, but usually this sort of thing is written in present tense

`seemed` -> `seems` Small thing, but usually this sort of thing is written in present tense

JacquesLucke marked this conversation as resolved

source/blender/blenlib/BLI_index_ranges_builder.hh Outdated

						
				@ -0,0 +111,4 @@

				template<typename T, int64_t MaxRangesNum>

				struct IndexRangesBuilderBuffer

				    : public std::array<T,

Hans Goudey commented

2024-08-28 22:10:17 +02:00

I've read that inheriting from types in the std namespace causes undefined behavior (https://stackoverflow.com/questions/2034916/is-it-okay-to-inherit-implementation-from-stl-containers-rather-than-delegate)

I'm not confident in that, I find the whole thing confusing. But just double checking your confidence level here :P

I've read that inheriting from types in the `std` namespace causes undefined behavior (https://stackoverflow.com/questions/2034916/is-it-okay-to-inherit-implementation-from-stl-containers-rather-than-delegate) I'm not confident in that, I find the whole thing confusing. But just double checking your confidence level here :P

👍 1

JacquesLucke marked this conversation as resolved

source/blender/blenlib/intern/bit_bool_conversion.cc Outdated

						
				@ -0,0 +17,4 @@

				#ifndef NDEBUG

				  /* Assert that all bits are zero. This is often already the case and simplifies the code below.

				   * So it's good if the caller is responsible for zeroing the bits. */

Hans Goudey commented

2024-08-28 22:12:48 +02:00

Makes me think of the optimization to use calloc when allocating containers if the initial data is zero. Doing that in the future makes me feel better about this decision.

Makes me think of the optimization to use `calloc` when allocating containers if the initial data is zero. Doing that in the future makes me feel better about this decision.

Jacques Lucke commented

2024-08-29 11:19:40 +02:00

Yeah, that would be good. Two notes:

BitVector is a bit special in that sense, because it initializes the bits always currently, even if no explicit initializer is given. That's necessary to workaround annoying compiler and run-time warnings which can't handle the case when an integer is only partially initialized.
I changed the semantics of this function to always or into the bits. This way it may be more useful in other situations too.

Yeah, that would be good. Two notes: * `BitVector` is a bit special in that sense, because it initializes the bits always currently, even if no explicit initializer is given. That's necessary to workaround annoying compiler and run-time warnings which can't handle the case when an integer is only partially initialized. * I changed the semantics of this function to always `or` into the bits. This way it may be more useful in other situations too.

source/blender/blenlib/intern/index_mask.cc Outdated

						
				@ -452,0 +459,4 @@

				      memory,

				      [&](const IndexMaskSegment universe_segment, IndexRangesBuilder<int16_t> &builder) {

				        const int64_t segment_start = universe_segment[0];

				        if (unique_sorted_indices::non_empty_is_range(universe_segment.base_span())) {

Hans Goudey commented

2024-08-28 22:14:17 +02:00

How about non_empty_as_range_try?

How about `non_empty_as_range_try`?

JacquesLucke marked this conversation as resolved

source/blender/blenlib/intern/index_mask.cc

						
				@ -452,0 +507,4 @@

				  int64_t next_range_to_process = 0;

				  int64_t skipped_indices_num = 0;

				  auto consolidate_skipped_ranges = [&](int64_t end_range_i) {

Hans Goudey commented

2024-08-28 22:17:42 +02:00

Could you add a comment above auto consolidate_skipped_ranges to mention what it does in some different words? Finding it a bit hard to understand at first

Could you add a comment above `auto consolidate_skipped_ranges` to mention what it does in some different words? Finding it a bit hard to understand at first

JacquesLucke marked this conversation as resolved

source/blender/blenlib/intern/index_mask.cc Outdated

						
				@ -476,2 +592,2 @@

				  return IndexMask::from_predicate(

				      universe, GrainSize(1024), memory, [bools](const int64_t index) { return !bools[index]; });

				  BitVector bits(bools);

				  bits::invert(bits);

Hans Goudey commented

2024-08-28 22:19:11 +02:00

I guess this is leaving plenty of room for improvement!

Jacques Lucke commented

2024-08-29 11:25:20 +02:00

Yeah, I optimized it a bit more to avoid performance regressions in a few cases, but I think there are more potential optimizations. For example, one could always process 64 bools at once: convert them into a BitInt and then process that. Currently we still iterate over the data twice, first to convert to bits and then to extract the indices.

Yeah, I optimized it a bit more to avoid performance regressions in a few cases, but I think there are more potential optimizations. For example, one could always process 64 bools at once: convert them into a `BitInt` and then process that. Currently we still iterate over the data twice, first to convert to bits and then to extract the indices.