FBX Export: Base patch for numpy speedup #104447

Merged
Bastien Montagne merged 1 commits from Mysteryem/blender-addons:fbx_numpy_base_patch_pr into main 2023-02-28 18:03:14 +01:00
Member

This is a base patch that all other separate numpy patches to the FBX exporter rely on.

Add support for writing bytes from numpy arrays like the already supported Python arrays.

Add numpy and helper function imports to fbx_utils.py and export_fbx_bin.py to simplify subsequent patches.

Add astype_view_signedness utility function for viewing unsigned integer data as signed only when the itemsizes match, to avoid copying arrays unnecessarily with numpy.ndarray.astype(new_type, copy=False).

Add numpy versions of the vcos_transformed_gen and nors_transformed_gen mesh transform helpers. 4d output is supported following comments in nors_transformed_gen, though remains unused.
Given tests of 1000 to 200000 vectors:
The most common use case is where the matrix is None (when the bake_space_transform option of the exporter is disabled, which is the default), which is ~44-105 times faster.
When bake_space_transform is enabled geom_mat_co is usually a matrix containing only scaling, which is ~14-65 times faster.
When bake_space_transform is enabled geom_mat_no is usually the identity matrix, which is ~18-170 times faster.

Add helper functions for performing faster uniqueness along the first axis when a sorted result is not needed. The sorting part of numpy.unique is often what takes the most time in the subsequent patches and this helper function can run numpy.unique many times faster when the second axis of an array has more than one element because it treats each row as a single element of a larger dtype meaning it only has to sort once.

This patch on its own makes no change to exported files.


Roughly this time last year I was trying to export as FBX a .blend file that I didn't think was that big/complex, but was finding it taking 30 seconds to export which seemed rather long. Since then I've been working on and off on speeding up the exporter. Initially I started with trying to optimise existing loops but it soon became clear that more significant speedups would be achievable by using numpy.

A lot of the performance cost of the FBX export code comes from iterating through and processing every vertex/edge/etc. individually in Python code. By rewriting the iterated code to use numpy vectorized functions where possible, the iterations can be run in optimised C code instead. To facilitate the use of numpy arrays throughout the exporter, this patch adds support for exporting bytes from numpy arrays.

An additional, surprising performance cost came from the bpy_prop_collection.foreach_get function. When the second argument is a Python object that implements the buffer interface and the datatype of the buffer matches the C datatype of the property being accessed, then a direct memcpy into the buffer is performed, which is very fast, but if the datatype doesn't match, then the data is iterated, and each element is cast to the same type as the buffer.
The foreach_get function seems to be quite slow when it has to cast the elements for some reason. Passing in a buffer matching the C datatype and then casting the entire buffer with numpy is about 20-30 times faster for large arrays. From my testing, when the C datatype doesn't match the desired array type, it's actually even slightly faster to pass a Python list into foreach_get and then pass that list into the creation of a new Python array (or numpy array using np.fromiter) of the desired type than it is to pass an array of the desired type directly into foreach_get.

This patch doesn't contain any code to do with using bpy_prop_collection.foreach_get, but it's a common feature of the subsequent patches so I've gone into more detail here. For reference, the C function is foreach_getset in bpy_rna.c

static PyObject *foreach_getset(BPy_PropertyRNA *self, PyObject *args, int set)

I couldn't find a way to inspect properties at runtime through Blender's Python API to determine their C type (the hard max/min values do not always match the min/max of the C types) so for the most part relied on a custom Blender build I managed to cobble together that would print a warning whenever the buffer passed into foreach_get/set didn't match the type of the C data being accessed.

The helper functions for performing fast uniqueness along the first axis could definitely do with being scrutinised since viewing floats as types that compare by bytes is a bit hacky. I think I've covered the main issue of -0.0 and 0.0 having different representations as bytes and I think it's ok to not care that different NaNs are not collapsed into a single value, though it wouldn't be too much of an issue to modify the code to collapse the different NaNs into one.

I have not been able to test the code in encode_bin.py on a big endian system, only simulate the behaviour on a copy of the code with _IS_BIG_ENDIAN replaced with True.

This is a base patch that all other separate numpy patches to the FBX exporter rely on. Add support for writing bytes from numpy arrays like the already supported Python arrays. Add numpy and helper function imports to fbx_utils.py and export_fbx_bin.py to simplify subsequent patches. Add astype_view_signedness utility function for viewing unsigned integer data as signed only when the itemsizes match, to avoid copying arrays unnecessarily with `numpy.ndarray.astype(new_type, copy=False)`. Add numpy versions of the vcos_transformed_gen and nors_transformed_gen mesh transform helpers. 4d output is supported following comments in nors_transformed_gen, though remains unused. Given tests of 1000 to 200000 vectors: The most common use case is where the matrix is None (when the bake_space_transform option of the exporter is disabled, which is the default), which is ~44-105 times faster. When bake_space_transform is enabled geom_mat_co is usually a matrix containing only scaling, which is ~14-65 times faster. When bake_space_transform is enabled geom_mat_no is usually the identity matrix, which is ~18-170 times faster. Add helper functions for performing faster uniqueness along the first axis when a sorted result is not needed. The sorting part of numpy.unique is often what takes the most time in the subsequent patches and this helper function can run numpy.unique many times faster when the second axis of an array has more than one element because it treats each row as a single element of a larger dtype meaning it only has to sort once. This patch on its own makes no change to exported files. ----- Roughly this time last year I was trying to export as FBX a .blend file that I didn't think was that big/complex, but was finding it taking 30 seconds to export which seemed rather long. Since then I've been working on and off on speeding up the exporter. Initially I started with trying to optimise existing loops but it soon became clear that more significant speedups would be achievable by using numpy. A lot of the performance cost of the FBX export code comes from iterating through and processing every vertex/edge/etc. individually in Python code. By rewriting the iterated code to use numpy vectorized functions where possible, the iterations can be run in optimised C code instead. To facilitate the use of numpy arrays throughout the exporter, this patch adds support for exporting bytes from numpy arrays. An additional, surprising performance cost came from the `bpy_prop_collection.foreach_get` function. When the second argument is a Python object that implements the buffer interface and the datatype of the buffer matches the C datatype of the property being accessed, then a direct `memcpy` into the buffer is performed, which is very fast, but if the datatype doesn't match, then the data is iterated, and each element is cast to the same type as the buffer. The foreach_get function seems to be quite slow when it has to cast the elements for some reason. Passing in a buffer matching the C datatype and then casting the entire buffer with numpy is about 20-30 times faster for large arrays. From my testing, when the C datatype doesn't match the desired array type, it's actually even slightly faster to pass a Python list into foreach_get and then pass that list into the creation of a new Python array (or numpy array using `np.fromiter`) of the desired type than it is to pass an array of the desired type directly into foreach_get. This patch doesn't contain any code to do with using `bpy_prop_collection.foreach_get`, but it's a common feature of the subsequent patches so I've gone into more detail here. For reference, the C function is foreach_getset in bpy_rna.c https://projects.blender.org/blender/blender/src/commit/4675ee3c7342c151311a1a2d74acacecd64c4545/source/blender/python/intern/bpy_rna.c#L5319 I couldn't find a way to inspect properties at runtime through Blender's Python API to determine their C type (the hard max/min values do not always match the min/max of the C types) so for the most part relied on a custom Blender build I managed to cobble together that would print a warning whenever the buffer passed into foreach_get/set didn't match the type of the C data being accessed. The helper functions for performing fast uniqueness along the first axis could definitely do with being scrutinised since viewing floats as types that compare by bytes is a bit hacky. I think I've covered the main issue of -0.0 and 0.0 having different representations as bytes and I think it's ok to not care that different NaNs are not collapsed into a single value, though it wouldn't be too much of an issue to modify the code to collapse the different NaNs into one. I have not been able to test the code in encode_bin.py on a big endian system, only simulate the behaviour on a copy of the code with `_IS_BIG_ENDIAN` replaced with True.
Thomas Barlow requested review from Bastien Montagne 2023-02-28 01:20:39 +01:00
Bastien Montagne approved these changes 2023-02-28 18:02:12 +01:00
Bastien Montagne left a comment
Owner

LGTM. Think I will first commit this, then you can rebase the vcos PR on main, and commit that one. Then would let a few extra days of initial validation before merging the other PRs.

Also kudos for the detailed doc in code, always great to see!

The helper functions for performing fast uniqueness along the first axis could definitely do with being scrutinised since viewing floats as types that compare by bytes is a bit hacky. I think I've covered the main issue of -0.0 and 0.0 having different representations as bytes and I think it's ok to not care that different NaNs are not collapsed into a single value, though it wouldn't be too much of an issue to modify the code to collapse the different NaNs into one.

That sounds valid assumptions to me, NaN values in Blender data should be considered as invalid anyway...

LGTM. Think I will first commit this, then you can rebase the `vcos` PR on main, and commit that one. Then would let a few extra days of initial validation before merging the other PRs. Also kudos for the detailed doc in code, always great to see! > The helper functions for performing fast uniqueness along the first axis could definitely do with being scrutinised since viewing floats as types that compare by bytes is a bit hacky. I think I've covered the main issue of -0.0 and 0.0 having different representations as bytes and I think it's ok to not care that different NaNs are not collapsed into a single value, though it wouldn't be too much of an issue to modify the code to collapse the different NaNs into one. That sounds valid assumptions to me, NaN values in Blender data should be considered as invalid anyway...
Bastien Montagne force-pushed fbx_numpy_base_patch_pr from 323182c873 to 388f48cb09 2023-02-28 18:02:37 +01:00 Compare
Bastien Montagne merged commit 994c4d9175 into main 2023-02-28 18:03:14 +01:00
Sign in to join this conversation.
No reviewers
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender-addons#104447
No description provided.