BLI: faster float<->half array conversions, use in Vulkan #127838

Aras Pranckevicius · 2024-09-19T10:30:00+02:00

2024-09-19 10:30:00 +02:00

In addition to float<->half functions to convert one number (#127708), add float_to_half_array and half_to_float_array functions:

On x64, this uses SSE2 4-wide implementation to do the conversion (2x faster half->float, 4x faster float->half compared to scalar),
- There's also an AVX2 codepath that uses CPU hardware F16C instructions (8-wide), to be used when/if blender codebase will start to be built for AVX2 (today it is not yet).
On arm64, this uses NEON VCVT instructions to do the conversion.

Use these functions in Vulkan buffer/texture conversion code. Time taken to convert float->half texture while viewing EXR file in image space (22M numbers to convert): 39.7ms -> 10.1ms (would be 6.9ms if building for AVX2)

In addition to float<->half functions to convert one number (#127708), add `float_to_half_array` and `half_to_float_array` functions: - On x64, this uses SSE2 4-wide implementation to do the conversion (2x faster half->float, 4x faster float->half compared to scalar), - There's also an AVX2 codepath that uses CPU hardware F16C instructions (8-wide), to be used when/if blender codebase will start to be built for AVX2 (today it is not yet). - On arm64, this uses NEON VCVT instructions to do the conversion. Use these functions in Vulkan buffer/texture conversion code. Time taken to convert float->half texture while viewing EXR file in image space (22M numbers to convert): **39.7ms -> 10.1ms** (would be 6.9ms if building for AVX2)

🎉 1

Aras Pranckevicius added 3 commits 2024-09-19 10:30:11 +02:00

BLI: add float_to_half_array and half_to_float_array c9098a7eae

So far only simple loop over data using scalar functions.
Still, converting 23M float->half numbers (viewing EXR image) for Vulkan
on Ryzen 5950X: 39.7ms -> 25.4ms

BLI: add AVX2 F16C and SSE2 paths to half<->float array conversions ec774a58e1

BLI: NEON VCVT path in half<->float array conversions

buildbot/vexp-code-patch-lint Build done.

Details

buildbot/vexp-code-patch-linux-x86_64 Build done.

Details

buildbot/vexp-code-patch-darwin-x86_64 Build done.

Details

buildbot/vexp-code-patch-darwin-arm64 Build done.

Details

buildbot/vexp-code-patch-windows-amd64 Build done.

Details

buildbot/vexp-code-patch-coordinator Build done.

Details

c31479b1e8

Aras Pranckevicius requested review from Sergey Sharybin 2024-09-19 11:29:46 +02:00

Aras Pranckevicius requested review from Jeroen Bakker 2024-09-19 11:29:53 +02:00

Jeroen Bakker reviewed 2024-09-19 11:59:13 +02:00

Jeroen Bakker left a comment

I didn't test the code. So only added some small comments.

source/blender/blenlib/intern/math_half.cc

						
				@ -117,0 +246,4 @@

				    src += 4;

				    dst += 4;

				  }

				#endif

Jeroen Bakker commented

2024-09-19 11:52:53 +02:00

Would add comment that this will convert the remaining elements.

👍 1

aras_p marked this conversation as resolved

source/blender/gpu/vulkan/vk_data_conversion.cc Outdated

						
				@ -1004,3 +1004,3 @@

				    case ConversionType::FLOAT_TO_HALF:

				      convert_per_component<F16, F32>(dst_memory, src_memory, buffer_size, device_format);

				      blender::math::float_to_half_array(static_cast<const float *>(src_memory),

Jeroen Bakker commented

2024-09-19 11:58:14 +02:00

Can we remove

static void convert(F16 &dst, const F32 &src)
{
  dst.value = math::float_to_half(src.value);
}

static void convert(F32 &dst, const F16 &src)
{
  dst.value = math::half_to_float(src.value);
}

as those should not be used anymore.

Can we remove ```C static void convert(F16 &dst, const F32 &src) { dst.value = math::float_to_half(src.value); } static void convert(F32 &dst, const F16 &src) { dst.value = math::half_to_float(src.value); } ``` as those should not be used anymore.

👍 1

aras_p marked this conversation as resolved

Aras Pranckevicius added 1 commit 2024-09-19 12:07:48 +02:00

Comments and cleanup e2f0b10587

Sergey Sharybin reviewed 2024-09-19 12:21:21 +02:00

source/blender/blenlib/intern/math_half.cc

						
				@ -117,0 +222,4 @@

				{

				  size_t i = 0;

				#if defined(USE_HARDWARE_FP16_F16C) /* 8-wide loop using AVX2 F16C */

				  for (; i + 7 < length; i += 8) {

Sergey Sharybin commented

2024-09-19 12:20:42 +02:00

Not for this patch, but perhaps we should do runtime check for intrinsics for such functions.

Aras Pranckevicius commented

2024-09-19 12:34:28 +02:00

Yeah I thought about that, but within Blender there's no way to query CPU cap bits right now, right? (only very indirectly through e.g. "what does ffmpeg thinks our CPU caps are?" and so on). Or if there is, where is it?

Sergey Sharybin commented

2024-09-19 12:42:59 +02:00

Check the intern/cycles/util/system.cpp, system_cpu_capabilities(), system_cpu_support_sse42(), system_cpu_support_avx2(). We can copy this function to Blender side.

Check the `intern/cycles/util/system.cpp`, `system_cpu_capabilities()`, `system_cpu_support_sse42()`, `system_cpu_support_avx2()`. We can copy this function to Blender side.

source/blender/blenlib/intern/math_half.cc

						
				@ -117,0 +231,4 @@

				  }

				#elif defined(USE_SSE2_FP16)          /* 4-wide loop using SSE2 */

				  for (; i + 3 < length; i += 4) {

				    __m128 src4 = _mm_loadu_ps(src);

Sergey Sharybin commented

2024-09-19 12:19:35 +02:00

It is a bit annoying to do unaligned reads. Maybe it is benefitial to chekc src and dst alignment and have a dedicated code path for this case?

It is a bit annoying to do unaligned reads. Maybe it is benefitial to chekc `src` and `dst` alignment and have a dedicated code path for this case?

Aras Pranckevicius commented

2024-09-20 12:40:14 +02:00

I checked on my PC (Ryzen 5950X) whether replacing unaligned loads with aligned ones brings any performance benefit, and as I expected... it is not faster at all; i.e. same performance.

Then I checked Agner Fog's CPU instruction latency/throughput tables, and basically these days there's no performance difference between unaligned load/store and aligned load/store (unaligned are slower in case your data crosses cacheline, but normally that does not happen). Both latency and throughput of unaligned vs aligned have been the same since Intel Ivy Bridge (2012) and AMD Bulldozer (2011).

I checked on my PC (Ryzen 5950X) whether replacing unaligned loads with aligned ones brings any performance benefit, and as I expected... it is not faster at all; i.e. same performance. Then I checked Agner Fog's CPU instruction latency/throughput tables, and basically these days there's no performance difference between unaligned load/store and aligned load/store (unaligned are slower in case your data crosses cacheline, but normally that does not happen). Both latency and throughput of unaligned vs aligned have been the same since Intel Ivy Bridge (2012) and AMD Bulldozer (2011).

aras_p marked this conversation as resolved

source/blender/blenlib/intern/math_half.cc

						
				@ -117,0 +240,4 @@

				  }

				#elif defined(USE_HARDWARE_FP16_NEON) /* 4-wide loop using NEON */

				  for (; i + 3 < length; i += 4) {

				    float32x4_t src4 = vld1q_f32(src);

Sergey Sharybin commented

2024-09-19 12:17:51 +02:00

Did you experiment with using more than one register for conversion? Something like an extra loop

  for (; i + 7 < length; i += 8) {
    float16x4_t src4_1 = vld1_f16((const float16_t *)src);
    float16x4_t src4_2 = vld1_f16((const float16_t *)src + 4);
    float32x4_t f4_1 = vcvt_f32_f16(src4_1);
    float32x4_t f4_2 = vcvt_f32_f16(src4_2);

    vst1q_f32(dst, f4_1);
    vst1q_f32(dst + 4, f4_2);
    src += 8;
    dst += 8;
  }

I didn't check details for this use-case, but for kernels like dot-product having multiple accumulators gives measurable speedup.

Did you experiment with using more than one register for conversion? Something like an extra loop ``` for (; i + 7 < length; i += 8) { float16x4_t src4_1 = vld1_f16((const float16_t *)src); float16x4_t src4_2 = vld1_f16((const float16_t *)src + 4); float32x4_t f4_1 = vcvt_f32_f16(src4_1); float32x4_t f4_2 = vcvt_f32_f16(src4_2); vst1q_f32(dst, f4_1); vst1q_f32(dst + 4, f4_2); src += 8; dst += 8; } ``` I didn't check details for this use-case, but for kernels like dot-product having multiple accumulators gives measurable speedup.

Aras Pranckevicius commented

2024-09-19 12:36:24 +02:00

I can try, but the primary reason why it ends up helping accumulation/dot-product is because the loops in there are serial, and by doing 2 (or more) accumulations at once, you're allowing CPU to have more things processed in parallel. Whereas this half<->float loop does not have dependencies between the loop iterations; the CPU can already schedule/process ahead as much as it can.

I can try, but the primary reason why it ends up helping accumulation/dot-product is because the loops in there are serial, and by doing 2 (or more) accumulations at once, you're allowing CPU to have more things processed in parallel. Whereas this half<->float loop does *not* have dependencies between the loop iterations; the CPU can already schedule/process ahead as much as it can.

Sergey Sharybin commented

2024-09-19 12:47:29 +02:00

Sure. I am just on a skeptical side and double-check whether CPU actually does of what you logically expect from it.
But it is not something I'd call essential to happen for this PR.

Sure. I am just on a skeptical side and double-check whether CPU actually does of what you logically expect from it. But it is not something I'd call essential to happen for this PR.

Aras Pranckevicius commented

2024-09-19 13:02:41 +02:00

Yeah, just tried going 2x wider and 4x wider within one loop iteration. On Mac M1 (NEON path), does not bring any performance benefits at all.

Sergey Sharybin commented

2024-09-19 15:24:03 +02:00

Thanks for checking!

Sergey marked this conversation as resolved

source/blender/blenlib/tests/BLI_math_half_test.cc

						
				@ -107,0 +171,4 @@

				  double t0 = BLI_time_now_seconds();

				  size_t sum = 0;

				  blender::math::half_to_float_array(src, dst, test_size);

				  for (int i = 0; i < test_size; i++) {

Sergey Sharybin commented

2024-09-19 12:21:18 +02:00

Don't think we should be including this look into the timing.

Sergey Sharybin approved these changes 2024-09-20 15:36:32 +02:00

Sergey Sharybin left a comment

Perosnally i'd put the end timing before the loop which sums the elements in the result. But if you have stronger feelings about it, just stick to the current code.
The rest seems fine, so marking it as green so ti can go in without extra review iterations.

Perosnally i'd put the end timing before the loop which sums the elements in the result. But if you have stronger feelings about it, just stick to the current code. The rest seems fine, so marking it as green so ti can go in without extra review iterations.

Aras Pranckevicius added 1 commit 2024-09-22 14:54:33 +02:00

Merge branch 'main' into fp16_conv_batch

buildbot/vexp-code-patch-lint Build done.

Details

buildbot/vexp-code-patch-linux-x86_64 Build done.

Details

buildbot/vexp-code-patch-darwin-x86_64 Build done.

Details

buildbot/vexp-code-patch-darwin-arm64 Build done.

Details

buildbot/vexp-code-patch-windows-amd64 Build done.

Details

buildbot/vexp-code-patch-coordinator Build done.

Details

a40d8c5dae

# Conflicts:
#	source/blender/blenlib/tests/BLI_math_half_test.cc

Aras Pranckevicius commented

2024-09-22 14:56:08 +02:00

@blender-bot build

Aras Pranckevicius merged commit c6f5c89669 into main

2024-09-22 17:40:13 +02:00

Aras Pranckevicius referenced this issue from a commit

2024-09-22 17:40:15 +02:00

BLI: faster float<->half array conversions, use in Vulkan

Aras Pranckevicius deleted branch fp16_conv_batch

2024-09-22 17:40:16 +02:00

Sign in to join this conversation.

No reviewers

No Label

Download

What's New

Blender Studio

Manual

Developers Blog

Documentation

Benchmark

Blender Conference

Development Fund

One-time Donations

BLI: faster float<->half array conversions, use in Vulkan #127838