BLI: Optimize utility for index counting #109628

Hans Goudey · 2023-07-03T00:43:39+02:00

Hans Goudey commented

2023-07-03 00:43:39 +02:00

The utility counts the number of occurrences of each index in an array.
It's used to build offsets for mesh topology maps, or to count the
number of connected elements. Some users are geometry nodes,
the subdivision draw cache, and mesh to curve conversion.

This PR parallelizes the counting to take advantage of multiple
threads. On a Ryzen 7950x, when counting connected edges to
vertices, I observed an improvement from 10.2 to 3.0 ms.
This most likely makes the counting less efficient, but it is
quite a nice performance improvement.

The new code was much slower for me at less than four threads,
so I added a check so that counting remains single threaded in
that case.

Here you can see the change in assembly in godbolt

The test case:

The utility counts the number of occurrences of each index in an array. It's used to build offsets for mesh topology maps, or to count the number of connected elements. Some users are geometry nodes, the subdivision draw cache, and mesh to curve conversion. This PR parallelizes the counting to take advantage of multiple threads. On a Ryzen 7950x, when counting connected edges to vertices, I observed an improvement from 10.2 to 3.0 ms. This most likely makes the counting less efficient, but it is quite a nice performance improvement. The new code was much slower for me at less than four threads, so I added a check so that counting remains single threaded in that case. --- Here you can see the change in assembly in [godbolt](https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1DIApACYAQuYukl9ZATwDKjdAGFUtAK4sGIAOykrgAyeAyYAHI%2BAEaYxBIAzKQADqgKhE4MHt6%2BASlpGQKh4VEssfFcSXaYDplCBEzEBNk%2BfoHVtQL1jQTFkTFxibYNTS257SO9Yf1lg5UAlLaoXsTI7BzmCWHI3lgA1CYJbgrJhofYJhoAgpvbu5gHR8gKBPiC55c31wD03wdmWyoewgWBo4XQEAA%2BpCAOIRORuaHzf4ANnMKL2YDAoOmEOhO0MwCR80%2BvwOAFYrJTgAwvCAQKhko4WHgAF6YCDmMwAWioXgYxE8tG5tFQjIUXJJ5IAIiYZRSqRYaXSGUy8Cz2ZyzGYAPIJSVy2Uy0l/TauPBUE17ABuqDw6D2aH5BEhYXwawUkKohmQAE8IGgGC89mECHsAFTQ4iYF7EPAOEMMd0x0hWvbpjOZrPZjOB4OhlGSSFht3xmOQ2ksVM/P45uv10MRqMxghxhNOwQKEk1g7%2BKw99P8YjAgtFkuPaV7DSHCwhx5uRPJz2Vmch6zWZEmPtp3PLTtyiylj0HvCGs%2BWdcJftXMnpreynv3z4Ha63IEgzBgzB42HwxGQzczDRIDMWxT9cShSECSMYlPgPA9lXpRlmTZDkuV5flBVoYVRXFA1jXleDqVpJC1Q1NDtT1fCjQfG4AXNIFPlte10w7F0j3LaImHSZAAwEYMXnQekTjOI5Q3ORcywUUg9kE4TTgYQ43HEhJsEdPcCC7ODt1fAELT2HFwUg6CiQAuCAWSYgmGAFgmD2TChXMhIGOfPYh2BPMS0EOcQEkj1Nx0q5MzYiVKVPY0LwvK9XKfV9/Foz4OEWWhOHJXg/A4LRSFQTg3HXSxZOWVYHk2HhSAITQksWABrEAEgSAA6ermpa1qUX0ThJHSyrss4XgFBADRysqxY4FgJA0BYZI6DichKEm6b6HiEyuBRDQhqwa0ywANTwTAAHcdWSRhODKmhaAIOIBogaIeuiMJGl9U7eHu5hiF9HVom0GoKu4XhJrYQQdQYWgnsy3gsGiLxgDcMRaAGv7SCwWyjHEcGkbwaNamtGMeswVQai8S7nvIQRMBS9HaDwaIrPejwsB61t1WexYqAMYAFF2g6jpOxH%2BEEEQxHYKQZEERQVHUdHdC4fRCRQfKbCp6IBsgRZkMyBHuT1PZuUEw5pVMSKLDMDQdds1YEH11kGBx4guuy224ywFWIEWDpHGcCBXDGPwZZCaZSnKPRUnSD2sk8Vpg4KMO%2BkDuZbHJn66kmH29Hd5OeljgYKmGHpU5ll5M4D7OJDdoq1lLjqODS0gMqynKOD2VQAA4UW5QtHQMIw9lWhqND74FcEIEh/gSLh5l4X6tHmGq6sa1qF%2Ba9qKftuveAb/rBuG8HRpgRAGSJ5Iibmvippm4gIlYdYW7bjvgGQZAe8asrv2Hp29H54RRHEEXP/FtQerS1IPtKyyQWZVxrmvXqHAdSHyJnsVAQIb7t0kJ3QkPcUR9wHhADwZ8lqj3HpPEas96pNUXgvKuq8eob1sHoKeVUq5mG6ujGh9CZ6kFtukZwkggA%3D%3D%3D) The test case: ![image](/attachments/30483a16-d8e8-43b7-9fe0-d61340e7367b)

image.png

45 KiB

🎉 1

Hans Goudey added 1 commit 2023-07-03 00:43:47 +02:00

e05df28192 BLI: Add utility for index counting

The utility counts the number of occurences of each index in an array.
This happens to build offsets for mesh topology maps, or to count the
number of connected elements. Some users are geometry nodes,
the subdivision draw cache, and mesh to curve conversion.

Now that the utility is in one place, it's reasonable to optimize it
with compiler flags. On GCC, unrolling the loop gave me a 1.9x
performance improvement, counting the number corners for each
vertex in a 4 million vertex mesh went from 7.4 to 3.9 ms.

In a couple places this improves code reuse, sharing the
implementation of the pattern where it was repeated before.

Iliya Katushenock commented

2023-07-03 01:43:24 +02:00

MSVC 23

Timer 'new': (Average: 13.7 ms, Min: 8.8 ms, Last: 12.8 ms)
Timer 'old': (Average: 13.8 ms, Min: 11.1 ms, Last: 14.7 ms)

Test (cube with subdiv):

MSVC 23 Timer 'new': (Average: 13.7 ms, Min: 8.8 ms, Last: 12.8 ms) Timer 'old': (Average: 13.8 ms, Min: 11.1 ms, Last: 14.7 ms) Test (cube with subdiv): ![image](/attachments/05cfb714-f309-4ed3-bb3e-096b9432aba5)

test array utils ms.txt

1.3 KiB

image.png

59 KiB

👍 1

Hans Goudey added 1 commit 2023-07-03 04:18:06 +02:00

bdc1bf8048 Merge branch 'main' into bli-index-count

Hans Goudey added this to the Core Libraries project 2023-07-03 13:38:04 +02:00

Hans Goudey requested review from Jacques Lucke 2023-07-03 16:00:08 +02:00

Hans Goudey added 1 commit 2023-07-03 16:02:59 +02:00

c945c4cf51 Merge branch 'main' into bli-index-count

Jacques Lucke approved these changes 2023-07-03 19:35:56 +02:00

Jacques Lucke left a comment

I can reproduce the speedup. Noted some possible improvements inline.

source/blender/blenlib/intern/array_utils.cc Outdated

						
				@ -45,6 +45,27 @@ void gather(const GSpan src, const IndexMask &indices, GMutableSpan dst, const i

				  gather(GVArray::ForSpan(src), indices, dst, grain_size);

				}

				#if (defined(__GNUC__) && !defined(__clang__))

Jacques Lucke commented

2023-07-03 19:27:59 +02:00

At this point it may be nice to have a define for these optimization flags, what do you think?

source/blender/blenlib/intern/array_utils.cc Outdated

						
				@ -48,0 +57,4 @@

				#  pragma unroll

				#endif

				  for (int64_t i = 0; i < indices_num; i++) {

				    counts[indices[i]]++;

Jacques Lucke commented

2023-07-03 19:28:50 +02:00

Would be interesting to check if using parallelism with atomic increments can help improve performance here.

Hans Goudey commented

2023-07-03 19:38:14 +02:00

I don't remember the numbers, but I remember that being slower. I'll test it again though!

Hans Goudey commented

2023-07-04 01:53:21 +02:00

I'm glad I tested again, this was much faster! Also simpler in that it doesn't require using compiler specific flags.

🎉 1

Hans Goudey referenced this issue from a commit

2023-07-04 00:47:17 +02:00

Cleanup: Extract utility for counting indices

Hans Goudey added 4 commits 2023-07-04 01:46:54 +02:00

f726797376 Merge branch 'main' into bli-index-count

fbca37ff01 Remove performance changes

848024900b Merge branch 'main' into bli-index-count

d9286288f2 Use atomic implementation

Hans Goudey changed title from ~~BLI: Add and optimize utility for index counting~~ to BLI: Optimize utility for index counting

2023-07-04 01:48:01 +02:00

Hans Goudey requested review from Jacques Lucke 2023-07-04 01:53:27 +02:00

Jacques Lucke commented

2023-07-04 10:11:05 +02:00

Can you say something about how the algorithm scales with more threads?

Jacques Lucke approved these changes 2023-07-04 10:11:17 +02:00

Hans Goudey commented

2023-07-05 15:46:30 +02:00

Good idea!

"Before" was at 10.2 ms BTW. Raw data: https://docs.google.com/spreadsheets/d/1a4IVNPM2Gud7yYLuaRJmz3viCb8PCAsnVyeZXY8pCqk/edit?usp=sharing

I was a little surprised by that, since I expected atomics to be cheap when there isn't much contention, but I think the higher thread counts just make up for the new overhead quite effectively.

I added a check so the code is still single threaded if there are less than 4 threads. Seems good to make sure to avoid the worst of the performance degradation, though I'm sure that heuristic will be less effective with different source data or hardware.

Good idea! ![image](/attachments/df2f2e1f-6283-4068-9e57-4a9734af16e2) "Before" was at 10.2 ms BTW. Raw data: https://docs.google.com/spreadsheets/d/1a4IVNPM2Gud7yYLuaRJmz3viCb8PCAsnVyeZXY8pCqk/edit?usp=sharing I was a little surprised by that, since I expected atomics to be cheap when there isn't much contention, but I think the higher thread counts just make up for the new overhead quite effectively. I added a check so the code is still single threaded if there are less than 4 threads. Seems good to make sure to avoid the worst of the performance degradation, though I'm sure that heuristic will be less effective with different source data or hardware.

image.png

18 KiB

Hans Goudey added 2 commits 2023-07-05 15:47:38 +02:00

2b8d53f669 Merge branch 'main' into bli-index-count

5daa89666b Add check for thread count < 4

Hans Goudey merged commit 53416281bd into main

2023-07-05 16:05:22 +02:00

Hans Goudey referenced this issue from a commit

2023-07-05 16:05:23 +02:00

BLI: Optimize utility for index counting

Hans Goudey deleted branch bli-index-count

2023-07-05 16:05:23 +02:00