This time, with have over 300% speedup!
But no, this is not due to switch to BLI_task (which 'only' gives usal 15% speedup),
but to enhancement of the algorithm, flatten loop over covariance matrix items now allows
to compute (usually) all items in parallel, instead of having at most 3 or 4 working threads
(with unbalanced load even)...