In a test producing 10 million vertices I observed a 3.6x improvement, from 470ms to 130ms. The largest improvement comes from calculating each mesh array on a separate thread. Besides that, the larger changes come from splitting the filling of corner and face arrays, and precalculating sines and cosines for each ring. Using `parallel_invoke` does gives some overhead. On a small 32x16 input, the time went up from 51us to 74us. It could be disabled for small outputs in the future. The reasoning for this parallelization method instead of more standard data-size-based parallelism is that the latter wouldn't be helpful except for very high resolution.