The GPU kernel needs to use atomics for accumulation since all offsets are processed in parallel, but on CPUs that's not the case, so we can disable them there for a considerable speedup.
14 KiB
14 KiB