This extracts the inner loops into a separate function. There are two main reasons for this: * Allows using `__restrict` to indicate that no other parameter aliases with the output array. This allows for better optimization. * Makes it easier to search for the generated assembly code, especially with the `BLI_NOINLINE`.