Improved Surface Deform performance #116545
No reviewers
Labels
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
4 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: blender/blender#116545
Loading…
Reference in New Issue
No description provided.
Delete Branch "Eugene-Kuznetsov/blender:ek_surface_deform"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Open the attached blend file. Go into geonodes trees on both hair objects and click "Bake" in their respective bake nodes. Hit "play".
With unpatched Blender, I see around 1 fps (Threadripper 5955WX).
This patch takes me from 1 fps to 7 fps. (With 'viewport thinning' set to 0.1, the effect is even stronger, from the same 1 fps to 9-10 fps.)
25163d7d78
to429689b1bf
Thanks for this path to implement/optimize UV sampler.
Strongly sure this node can be improved much more if reimplement this as parallel logarithmic search tree, instead of current hash map (see: #106713). Or, at least, to built this has map more parallel/exact for strips representation.
Not sure about point of this PR about using 2d array, this looks even worst for some cases with sparse uv data (for memory usage ?).
Would be better to add more different tests with measured numbers and attached .blend files to simplify checking on different machines. Technical details also would be helpful in description.
Just added some code-style comments.
@ -720,3 +720,3 @@
GMutableSpan dst);
void calculate_basis(const float parameter, float4 &r_weights);
inline void calculate_basis(const float parameter, float4 &r_weights)
Weights can be returned value, if this is refactoring.
@ -1326,6 +1326,21 @@ MINLINE bool is_zero_v4_db(const double v[4])
return (v[0] == 0.0 && v[1] == 0.0 && v[2] == 0.0 && v[3] == 0.0);
}
MINLINE bool is_finite_v2(const float v[2])
Inlining is not the best thing to do. You should be able to split this in to separate PR, and measure performance impact (strong) to convince of the need for this.
@ -26,0 +24,4 @@
int2 resolution_;
int2 span_;
std::vector<std::vector<int> > corner_tris_by_cell_;
You should not use
std::vector
instead ofblender::Vector
without strong reason.You should not use
Vector<Vector<A>>
instead ofArray<int> offsets; Array<A> data; GroupedSpan<A> values(offsets, data);
without strong reason (see as examples: #110707. #115142, #111081, #114683).I've switched to blender::Vector. Array/GroupedSpan can't be used here.
@ -15,2 +14,3 @@
static int2 uv_to_cell_key(const float2 &uv, const float2 &uv_base, const int2 &resolution, int2 span)
{
return int2{uv * resolution};
auto key = int2{int((uv.x-uv_base.x)*resolution.x), int((uv.y-uv_base.y)*resolution.y)};
There is missed default formatting, run
make format
in root folder of your blender fork.@ -15,2 +15,3 @@
{
return int2{uv * resolution};
auto key = int2{int((uv.x-uv_base.x)*resolution.x), int((uv.y-uv_base.y)*resolution.y)};
return int2{std::max(0,std::min(span.x-1,key.x)), std::max(0,std::min(span.y-1,key.y))};
Magic numbers.
@ -23,0 +30,4 @@
min_uv = max_uv = uv_0;
int live_count = 0;
std::vector<uint8_t> mask(corner_tris.size());
uint8_t
->int8_t
(maybe evenbool
?)https://wiki.blender.org/wiki/Style_Guide/C_Cpp#Integer_Types
@ -30,3 +57,1 @@
const int2 key_0 = uv_to_cell_key(uv_0, resolution_);
const int2 key_1 = uv_to_cell_key(uv_1, resolution_);
const int2 key_2 = uv_to_cell_key(uv_2, resolution_);
resolution_.x = std::max<int>(1, mult / (max_uv[0]-min_uv[0] + 1e-6));
math::max<int>
The 2d array gives a good balance between construction time and sampling time. Notice that sampling is threaded but construction is not, and the sampler object is constructed every frame (actually, twice per frame for every Surface Deform node). Much of the performance gain comes from replacing the old complicated MultiMap structure with a simple array. The code is now at the point where ReverseUVSampler is not a significant bottleneck, and further optimizations are not going to yield as much. Reusing samplers would be a more promising direction.
I've benchmarked just the changes for barycentric_coords_v2 and is_finite_v2/v3/v4, I estimate that inlining results in a 10% to 15% reduction in execution time of ReverseUVSampler::sample_many.
Here's a blend file with two more test meshes. I see at least 5x speedup with 40k strands per mesh, it gets progressively smaller with more strands, it's as low as 10-20% with 2M strands/mesh. (At that point, the bottleneck is https://projects.blender.org/blender/blender/src/branch/main/source/blender/draw/intern/draw_cache_impl_curves.cc#L432-L436.)
429689b1bf
toc51a967a0a
It is hard to do benchmarks without some file with just one node. Usually, you should use
BLI_timeit.hh
header, instead of FPS, to do not measure impact of other system of blender.And file should be as simple as possible, to do not measure impact of all other nodes in geometry nodes setup.
I have simplify your file (few) by hiding all objects except horses. Here is my measurements:
Timer 'ReverseUVSampler::ReverseUVSampler': (Average: 182.04 ms, Min: 182.04 ms, Last: 182.04 ms)
Timer 'ReverseUVSampler::ReverseUVSampler': (Average: 157.20 ms, Min: 157.20 ms, Last: 157.20 ms)
Timer 'ReverseUVSampler::sample_many': (Average: 400 ns, Min: 400 ns, Last: 400 ns)
Timer 'ReverseUVSampler::sample_many': (Average: 300 ns, Min: 200 ns, Last: 200 ns)
Timer 'ReverseUVSampler::ReverseUVSampler': (Average: 168.13 ms, Min: 154.23 ms, Last: 154.23 ms)
Timer 'ReverseUVSampler::ReverseUVSampler': (Average: 173.34 ms, Min: 157.20 ms, Last: 189.49 ms)
Timer 'ReverseUVSampler::sample_many': (Average: 39.51 ms, Min: 200 ns, Last: 118.54 ms)
Timer 'ReverseUVSampler::sample_many': (Average: 49.56 ms, Min: 200 ns, Last: 79.69 ms)
Timer 'ReverseUVSampler::ReverseUVSampler': (Average: 243.12 ms, Min: 154.23 ms, Last: 393.11 ms)
Timer 'ReverseUVSampler::ReverseUVSampler': (Average: 217.38 ms, Min: 157.20 ms, Last: 305.46 ms)
Timer 'ReverseUVSampler::sample_many': (Average: 39.92 ms, Min: 200 ns, Last: 1.36 ms)
Timer 'ReverseUVSampler::sample_many': (Average: 33.52 ms, Min: 200 ns, Last: 1.55 ms)
Timer 'ReverseUVSampler::ReverseUVSampler': (Average: 5.07 ms, Min: 5.07 ms, Last: 5.07 ms)
Timer 'ReverseUVSampler::ReverseUVSampler': (Average: 5.73 ms, Min: 5.73 ms, Last: 5.73 ms)
Timer 'ReverseUVSampler::sample_many': (Average: 400 ns, Min: 400 ns, Last: 400 ns)
Timer 'ReverseUVSampler::sample_many': (Average: 300 ns, Min: 200 ns, Last: 200 ns)
Timer 'ReverseUVSampler::ReverseUVSampler': (Average: 2.73 ms, Min: 0.38 ms, Last: 0.38 ms)
Timer 'ReverseUVSampler::ReverseUVSampler': (Average: 3.11 ms, Min: 0.48 ms, Last: 0.48 ms)
Timer 'ReverseUVSampler::sample_many': (Average: 217.30 ms, Min: 200 ns, Last: 651.90 ms)
Timer 'ReverseUVSampler::sample_many': (Average: 281.37 ms, Min: 200 ns, Last: 473.56 ms)
Timer 'ReverseUVSampler::ReverseUVSampler': (Average: 1.88 ms, Min: 0.20 ms, Last: 0.20 ms)
Timer 'ReverseUVSampler::ReverseUVSampler': (Average: 2.12 ms, Min: 0.16 ms, Last: 0.16 ms)
Timer 'ReverseUVSampler::sample_many': (Average: 231.22 ms, Min: 200 ns, Last: 30.63 ms)
Timer 'ReverseUVSampler::sample_many': (Average: 198.93 ms, Min: 200 ns, Last: 37.49 ms)
As i can see, this PR change speed of both constructor and sample methods.
sample_many seems to be much slower on your end (what hardware is this?) Here's what I'm getting with your blend file:
main:
Timer 'ReverseUVSampler::ReverseUVSampler': (Average: 184.07 ms, Min: 0.44 ms, Last: 45.51 ms)
Timer 'ReverseUVSampler::ReverseUVSampler': (Average: 182.58 ms, Min: 0.44 ms, Last: 46.06 ms)
Timer 'ReverseUVSampler::sample_many': (Average: 1.06 ms, Min: 20 ns, Last: 60 ns)
Timer 'ReverseUVSampler::sample_many': (Average: 1.05 ms, Min: 20 ns, Last: 30 ns)
PR:
Timer 'ReverseUVSampler::ReverseUVSampler': (Average: 0.80 ms, Min: 94541 ns, Last: 0.12 ms)
Timer 'ReverseUVSampler::ReverseUVSampler': (Average: 0.79 ms, Min: 94541 ns, Last: 0.30 ms)
Timer 'ReverseUVSampler::sample_many': (Average: 5.34 ms, Min: 20 ns, Last: 6.23 ms)
Timer 'ReverseUVSampler::sample_many': (Average: 5.37 ms, Min: 20 ns, Last: 9.69 ms)
There is a magic number 0.25 here
c51a967a0a/source/blender/geometry/intern/reverse_uv_sampler.cc (L60)
that controls the balance between constructor and sampler. It probably needs to be tuned. Try to set it to 1.0.I'll try to benchmark on a couple more systems.
Also, in that blend file, most objects are disabled in viewport, but ReverseUVSampler still runs for all of them. That's why "Min", "Last" and "Average" times are all different - you're getting the "Average" for all objects, and the "Last" for one of several objects. And it's not even doing sample_all on the horse, because the modifier geonode that generates horse hair is disabled.
To get clean numbers, I had to do this:
38e34e82d6
bde05142c8
(PR branch)6e1e74ab08
(main branch)"Exclude from View Layer" all collections except Collection 3
reenable all modifiers on "Horse hair", click "Bake" on the bake node.
Improved Surface Deform performanceto WIP: Improved Surface Deform performancec51a967a0a
tode4c975949
de4c975949
tob8d176fb2d
b8d176fb2d
tof170c9cc88
f170c9cc88
to52f5d04fca
52f5d04fca
to4f91c770ce
b8a86c3eac
+39b4d939ea
Test system 1: AMD Threadripper Pro 5955WX, Radeon RX7900, Linux
Test system 2: AMD Ryzen 7 5800X, Radeon RX6900, Windows 10
Would you be interested in moving the
preserve
arguments for attribute access functions to a separate PR? Those seem easily separated from the other changes here, and I'd be interested how much of a difference they make.Done (#117179)
4f91c770ce
to6c11389b8a
WIP: Improved Surface Deform performanceto Improved Surface Deform performance6c11389b8a
to00a785c977
Comparing against 2024/03/04 trunk (which includes the improvements from #118772):
Playback:
Directly timing DeformCurvesOnSurface (second row #tris - original mesh, #tris - eval mesh, #curves)
Thanks for the updated comparison. It's nice that we can still get a good speedup. Unfortunately, this patch is somewhat hard to act on because it contains so many different changes (at least it feels like it) and the patch description also doesn't really say what is changed specifically.
For the reverse uv sampler specifically, I'd strongly recommend to not put the different phases into a single loop with many
if (phase == X)
checks. That's very hard to follow.00a785c977
toaa8cecf53b
Thanks. I've addressed the phases loop. One part of the PR is broken out as #117179, I am somewhat reluctant to spend any more time refactoring or documenting it, in view of this PR's age and history.
Yeah fair. I can understand that it's a lot of work. Thanks for working on this. I'm sure some of the optimizations will slowly trickle into Blender over time.
It's not that it is a lot of work, it's just I don't want to do all the requested changes and then watch the PR sit in limbo for another 3 months.
I am going to break it out further if and when #117179 gets merged.
Some changes in this PR already skeptical, one of the reason to select all of them is to be able to discuss about each of them.
aa8cecf53b
toc31d11220a
Checkout
From your project repository, check out a new branch and test the changes.