It's pretty simple, but threading it, and making it write out whole
pixel at a time (instead of one byte at a time) still makes it faster.
4K resolution, five Color strips blended over each other, playback
on Windows/VS2022, Ryzen 5950X:
- Playback 9.2FPS -> 11.5FPS
- do_solid_color for one effect, median time 7.7ms -> 3.8ms
Additionally, the solid color on byte output was not doing float->byte
color rounding & clamping properly, and on float output it was writing
255.0 into alpha instead of 1.0. So fix that too.