This speeds up the node ~20% in common cases, e.g. when only the X axis is used. The main optimization comes from not writing to memory that's not used afterwards anymore anyway. The "optimal code" for just extracting the x axis in a separate loop was not faster for me. That indicates that the node is bottlenecked by memory bandwidth, which seems reasonable.