FBX Export: Base patch for numpy speedup #104447

Thomas Barlow · 2023-02-27T23:03:16+01:00

Thomas Barlow commented

2023-02-27 23:03:16 +01:00

This is a base patch that all other separate numpy patches to the FBX exporter rely on.

Add support for writing bytes from numpy arrays like the already supported Python arrays.

Add numpy and helper function imports to fbx_utils.py and export_fbx_bin.py to simplify subsequent patches.

Add astype_view_signedness utility function for viewing unsigned integer data as signed only when the itemsizes match, to avoid copying arrays unnecessarily with numpy.ndarray.astype(new_type, copy=False).

Add numpy versions of the vcos_transformed_gen and nors_transformed_gen mesh transform helpers. 4d output is supported following comments in nors_transformed_gen, though remains unused.
Given tests of 1000 to 200000 vectors:
The most common use case is where the matrix is None (when the bake_space_transform option of the exporter is disabled, which is the default), which is ~44-105 times faster.
When bake_space_transform is enabled geom_mat_co is usually a matrix containing only scaling, which is ~14-65 times faster.
When bake_space_transform is enabled geom_mat_no is usually the identity matrix, which is ~18-170 times faster.

Add helper functions for performing faster uniqueness along the first axis when a sorted result is not needed. The sorting part of numpy.unique is often what takes the most time in the subsequent patches and this helper function can run numpy.unique many times faster when the second axis of an array has more than one element because it treats each row as a single element of a larger dtype meaning it only has to sort once.

This patch on its own makes no change to exported files.

Roughly this time last year I was trying to export as FBX a .blend file that I didn't think was that big/complex, but was finding it taking 30 seconds to export which seemed rather long. Since then I've been working on and off on speeding up the exporter. Initially I started with trying to optimise existing loops but it soon became clear that more significant speedups would be achievable by using numpy.

A lot of the performance cost of the FBX export code comes from iterating through and processing every vertex/edge/etc. individually in Python code. By rewriting the iterated code to use numpy vectorized functions where possible, the iterations can be run in optimised C code instead. To facilitate the use of numpy arrays throughout the exporter, this patch adds support for exporting bytes from numpy arrays.

An additional, surprising performance cost came from the bpy_prop_collection.foreach_get function. When the second argument is a Python object that implements the buffer interface and the datatype of the buffer matches the C datatype of the property being accessed, then a direct memcpy into the buffer is performed, which is very fast, but if the datatype doesn't match, then the data is iterated, and each element is cast to the same type as the buffer.
The foreach_get function seems to be quite slow when it has to cast the elements for some reason. Passing in a buffer matching the C datatype and then casting the entire buffer with numpy is about 20-30 times faster for large arrays. From my testing, when the C datatype doesn't match the desired array type, it's actually even slightly faster to pass a Python list into foreach_get and then pass that list into the creation of a new Python array (or numpy array using np.fromiter) of the desired type than it is to pass an array of the desired type directly into foreach_get.

This patch doesn't contain any code to do with using bpy_prop_collection.foreach_get, but it's a common feature of the subsequent patches so I've gone into more detail here. For reference, the C function is foreach_getset in bpy_rna.c

		source/blender/python/intern/bpy_rna.c
		Line 5319 in 4675ee3c73
	
				static PyObject *foreach_getset(BPy_PropertyRNA *self, PyObject *args, int set)

I couldn't find a way to inspect properties at runtime through Blender's Python API to determine their C type (the hard max/min values do not always match the min/max of the C types) so for the most part relied on a custom Blender build I managed to cobble together that would print a warning whenever the buffer passed into foreach_get/set didn't match the type of the C data being accessed.

The helper functions for performing fast uniqueness along the first axis could definitely do with being scrutinised since viewing floats as types that compare by bytes is a bit hacky. I think I've covered the main issue of -0.0 and 0.0 having different representations as bytes and I think it's ok to not care that different NaNs are not collapsed into a single value, though it wouldn't be too much of an issue to modify the code to collapse the different NaNs into one.

I have not been able to test the code in encode_bin.py on a big endian system, only simulate the behaviour on a copy of the code with _IS_BIG_ENDIAN replaced with True.

This is a base patch that all other separate numpy patches to the FBX exporter rely on. Add support for writing bytes from numpy arrays like the already supported Python arrays. Add numpy and helper function imports to fbx_utils.py and export_fbx_bin.py to simplify subsequent patches. Add astype_view_signedness utility function for viewing unsigned integer data as signed only when the itemsizes match, to avoid copying arrays unnecessarily with `numpy.ndarray.astype(new_type, copy=False)`. Add numpy versions of the vcos_transformed_gen and nors_transformed_gen mesh transform helpers. 4d output is supported following comments in nors_transformed_gen, though remains unused. Given tests of 1000 to 200000 vectors: The most common use case is where the matrix is None (when the bake_space_transform option of the exporter is disabled, which is the default), which is ~44-105 times faster. When bake_space_transform is enabled geom_mat_co is usually a matrix containing only scaling, which is ~14-65 times faster. When bake_space_transform is enabled geom_mat_no is usually the identity matrix, which is ~18-170 times faster. Add helper functions for performing faster uniqueness along the first axis when a sorted result is not needed. The sorting part of numpy.unique is often what takes the most time in the subsequent patches and this helper function can run numpy.unique many times faster when the second axis of an array has more than one element because it treats each row as a single element of a larger dtype meaning it only has to sort once. This patch on its own makes no change to exported files. ----- Roughly this time last year I was trying to export as FBX a .blend file that I didn't think was that big/complex, but was finding it taking 30 seconds to export which seemed rather long. Since then I've been working on and off on speeding up the exporter. Initially I started with trying to optimise existing loops but it soon became clear that more significant speedups would be achievable by using numpy. A lot of the performance cost of the FBX export code comes from iterating through and processing every vertex/edge/etc. individually in Python code. By rewriting the iterated code to use numpy vectorized functions where possible, the iterations can be run in optimised C code instead. To facilitate the use of numpy arrays throughout the exporter, this patch adds support for exporting bytes from numpy arrays. An additional, surprising performance cost came from the `bpy_prop_collection.foreach_get` function. When the second argument is a Python object that implements the buffer interface and the datatype of the buffer matches the C datatype of the property being accessed, then a direct `memcpy` into the buffer is performed, which is very fast, but if the datatype doesn't match, then the data is iterated, and each element is cast to the same type as the buffer. The foreach_get function seems to be quite slow when it has to cast the elements for some reason. Passing in a buffer matching the C datatype and then casting the entire buffer with numpy is about 20-30 times faster for large arrays. From my testing, when the C datatype doesn't match the desired array type, it's actually even slightly faster to pass a Python list into foreach_get and then pass that list into the creation of a new Python array (or numpy array using `np.fromiter`) of the desired type than it is to pass an array of the desired type directly into foreach_get. This patch doesn't contain any code to do with using `bpy_prop_collection.foreach_get`, but it's a common feature of the subsequent patches so I've gone into more detail here. For reference, the C function is foreach_getset in bpy_rna.c https://projects.blender.org/blender/blender/src/commit/4675ee3c7342c151311a1a2d74acacecd64c4545/source/blender/python/intern/bpy_rna.c#L5319 I couldn't find a way to inspect properties at runtime through Blender's Python API to determine their C type (the hard max/min values do not always match the min/max of the C types) so for the most part relied on a custom Blender build I managed to cobble together that would print a warning whenever the buffer passed into foreach_get/set didn't match the type of the C data being accessed. The helper functions for performing fast uniqueness along the first axis could definitely do with being scrutinised since viewing floats as types that compare by bytes is a bit hacky. I think I've covered the main issue of -0.0 and 0.0 having different representations as bytes and I think it's ok to not care that different NaNs are not collapsed into a single value, though it wouldn't be too much of an issue to modify the code to collapse the different NaNs into one. I have not been able to test the code in encode_bin.py on a big endian system, only simulate the behaviour on a copy of the code with `_IS_BIG_ENDIAN` replaced with True.

Thomas Barlow referenced this pull request

2023-02-27 23:15:07 +01:00

Speed up FBX export of vertex cos with numpy #104448

Thomas Barlow referenced this pull request

2023-02-27 23:23:59 +01:00

Speed up FBX export of normals/tangents/bitangents with numpy #104449

Thomas Barlow referenced this pull request

2023-02-27 23:28:00 +01:00

Speed up FBX export of face material indices with numpy #104450

Thomas Barlow referenced this pull request

2023-02-27 23:43:25 +01:00

Speed up FBX export of polygon indices and edges with numpy #104451

Thomas Barlow referenced this pull request

2023-02-28 00:27:18 +01:00

Speed up FBX export of shape keys with numpy #104452

Thomas Barlow referenced this pull request

2023-02-28 00:59:29 +01:00

Speed up FBX export of UVs with numpy #104453

Thomas Barlow referenced this pull request

2023-02-28 01:17:11 +01:00

Speed up FBX export of vertex colors with numpy #104454

Thomas Barlow requested review from Bastien Montagne 2023-02-28 01:20:39 +01:00

Bastien Montagne approved these changes 2023-02-28 18:02:12 +01:00

Bastien Montagne left a comment

LGTM. Think I will first commit this, then you can rebase the vcos PR on main, and commit that one. Then would let a few extra days of initial validation before merging the other PRs.

Also kudos for the detailed doc in code, always great to see!

The helper functions for performing fast uniqueness along the first axis could definitely do with being scrutinised since viewing floats as types that compare by bytes is a bit hacky. I think I've covered the main issue of -0.0 and 0.0 having different representations as bytes and I think it's ok to not care that different NaNs are not collapsed into a single value, though it wouldn't be too much of an issue to modify the code to collapse the different NaNs into one.

That sounds valid assumptions to me, NaN values in Blender data should be considered as invalid anyway...

LGTM. Think I will first commit this, then you can rebase the `vcos` PR on main, and commit that one. Then would let a few extra days of initial validation before merging the other PRs. Also kudos for the detailed doc in code, always great to see! > The helper functions for performing fast uniqueness along the first axis could definitely do with being scrutinised since viewing floats as types that compare by bytes is a bit hacky. I think I've covered the main issue of -0.0 and 0.0 having different representations as bytes and I think it's ok to not care that different NaNs are not collapsed into a single value, though it wouldn't be too much of an issue to modify the code to collapse the different NaNs into one. That sounds valid assumptions to me, NaN values in Blender data should be considered as invalid anyway...

Bastien Montagne force-pushed fbx_numpy_base_patch_pr from 323182c873 to 388f48cb09

2023-02-28 18:02:37 +01:00

Compare

Bastien Montagne merged commit 994c4d9175 into main

2023-02-28 18:03:14 +01:00

Bastien Montagne referenced this issue from a commit

2023-02-28 18:03:14 +01:00

FBX Export: Base patch for numpy speedup

Sign in to join this conversation.

No reviewers

No Label

Download

What's New

Blender Studio

Manual

Developers Blog

Documentation

Benchmark

Blender Conference

Development Fund

One-time Donations

FBX Export: Base patch for numpy speedup #104447