This implements two optimizations: * If the duplication count is constant, the offsets array can be filled directly in parallel. * Otherwise, extracting the counts from the virtual array is parallelized. But there is still a serial loop over all elements in the end to compute the offsets.