Detected when testing mr_elephant on an Intel HD520. When copying
the velocity buffer using the copy shader, the number of scheduled
workgroups could be larger than supported by the device.
This PR fixes this by splitting the copy pass in multiple smaller
passes so the velocity is copied.
NOTE: I didn't went for the approach to add a new workgroup dimension
as that would lead to more overhead when using more smaller meshes. I
would assume these devices would more often be used with scenes with
smaller geometry.