Sculpt: Improve Expand performance #120125
Labels
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset System
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Code Documentation
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Viewport & EEVEE
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Asset Browser Project
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Asset System
Module
Core
Module
Development Management
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Module
Viewport & EEVEE
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Severity
High
Severity
Low
Severity
Normal
Severity
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
5 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: blender/blender#120125
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "farsthary/blender:improve-expand-perf"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Improved performance of the geodesic expand operator initialization
Implemented a parallel Breath First Search to calculate surface
geodesics instead of a serial BFS. The performance is significantly
improved and the difference will be more evident the denser the mesh is.
Novel in this approach is a fast lock free, wait free parallel storage open addressing
with linear probing in an vector that surpasses tbb::concurrent_vector's
performance since is a critical component for performance.
Resolves #102586
For some numerical comparisons:
In monoHD.blend with 2016578 vertices on a MBA M1 8GB RAM
Initialization time:
-Before: 6.56252 seconds
-After: 2.63308 seconds
Performance scales better now with number of cores and topological mesh density
Soon will post comparison numbers before and after the change.
WIP: Fix ##102586 Sculpt Mode: Improve Expand performanceto WIP: Fix #102586 Sculpt Mode: Improve Expand performanceWIP: Fix #102586 Sculpt Mode: Improve Expand performanceto Fix #102586 Sculpt Mode: Improve Expand performanceDisregard the many intermediate commits I've made, mostly are partial cloud backups, and provisional approaches I took that will be replaced later but wanted to keep a history of them.
This will be the first of a series of partial PR's to improve the performance of the different algorithms in the expand operator to keep the PR small and manageable.
Also the expand operator will likely require a deeper refactor to be really scalable with the mesh size with a more dynamic or lazy initialization to avoid calculating a field value for every vertex even when unused and adding multithreading to the current implementation will be a step in the right direction of squeezing more performance now.
@ -1497,10 +1480,12 @@ static void sculpt_expand_update_for_vertex(bContext *C, Object *ob, const PBVHV
sculpt_expand_mask_update_task(ss, mask_write, expand_cache->nodes[i]);
}
});
SCULPT_flush_update_step(C, SCULPT_UPDATE_MASK);
Unrelated change
@ -2140,6 +2124,11 @@ static int sculpt_expand_invoke(bContext *C, wmOperator *op, const wmEvent *even
if (!BKE_base_is_visible(v3d, base)) {
return OPERATOR_CANCELLED;
}
/* Do nothing when the mesh has 0 vertices. */
Unrelated change?
@ -2222,7 +2204,6 @@ static int sculpt_expand_invoke(bContext *C, wmOperator *op, const wmEvent *even
ss->expand_cache, ob, ss->expand_cache->initial_active_vertex, falloff_type);
sculpt_expand_check_topology_islands(ob, falloff_type);
Unrelated change
@ -8,6 +8,7 @@
#include <cmath>
#include <cstdlib>
#include <thread>
Unnecessary include?
@ -77,6 +78,7 @@ static bool sculpt_geodesic_mesh_test_dist_add(Span<float3> vert_positions,
return false;
}
#define UNASSIGNED -1
Typically it's preferred to use
const int
declared inside the function or soSorry, but now this line is other one issue, now this is UB.
@ -146,0 +137,4 @@
threading::parallel_for(IndexRange(0, totvert), 4096, [&](IndexRange range) {
for (const int i : range) {
if (len_squared_v3v3(v_co, vert_positions[i]) <= limit_radius_sq) {
affected_vert[i].set();
Manipulation of a bitmap is not threadsafe because it's a read-modify-write operation. It should be converted to a boolean array, or you can use
threading::parallel_for_aligned
andbits::BitsPerInt
to align the task boundaries on integer boundaries.@ -147,0 +144,4 @@
}
}
std::vector<int> queue, queue_next;
In Blender code
Vector
is preferred overstd::vector
. That's documented more in the headers and here: https://developer.blender.org/docs/features/core/blenlib/containers/@ -147,0 +148,4 @@
queue.reserve(totedge);
/* Add edges adjacent to an initial vertex to the queue.
* Since initial vertex are few only, iterating over its neighbour edges
typo/grammar
Suggest "Since there are typically few initial vertices"
@ -147,0 +149,4 @@
/* Add edges adjacent to an initial vertex to the queue.
* Since initial vertex are few only, iterating over its neighbour edges
* instead of over all edges scales better as mesh edge count increases */
Missing period at the end
@ -147,0 +153,4 @@
GSetIterator gs_iter;
GSET_ITER (gs_iter, initial_verts) {
const int seed_vert = POINTER_AS_INT(BLI_gsetIterator_getKey(&gs_iter));
for (const int e : ss->vert_to_edge_map[seed_vert]) {
Typically in mesh code I'd prefer to see longer variable names like
edge
oredge_index
@ -180,3 +191,3 @@
}
for (const int v_other : corner_verts.slice(faces[face])) {
if (ELEM(v_other, v1, v2)) {
if (ELEM(v_other, v1, v2) || !sculpt_geodesic_mesh_test_dist_add(
Keep these as separate if statements, it's more readable that way. It's really much simpler to review if the changes are as small as necessary to achieve the speedup
@ -202,2 +199,2 @@
BLI_LINKSTACK_PUSH(queue_next, POINTER_FROM_INT(e_other));
}
for (const int e_other : ss->vert_to_edge_map[v_other]) {
const int ev_other = (edges[e_other][0] == v_other) ? edges[e_other][1] :
Use
bke::mesh::edge_other_vert
@ -204,0 +213,4 @@
* collission in hash tables can also be used to avoid
* parallel data storage collisions */
size_t idx = e_other % new_size;
while (queue_next[idx] != UNASSIGNED) {
I'm skeptical that this will give deterministic behavior. That's typically required here, otherwise things get very hard to debug.
For the initial falloff field creation (one falloff value per vertex) the order in which is calculated for the vertex does not matter, at the end of the algorithm all the vertices will have its falloff calculated. If is vertex[0], vertex[1]...vertex[n] or vertex[2], vertex[0]....vertex[1], vertex[n] is irrelevant here. Is deterministic in the sense that all the result are the same regardless the order of the calculation. Is very well worth the speed gain for the user. This code will be infinitely more executed than it will be debugged.
If geometry nodes will let to view randomly selected vertices (by using index of vertex), shouldn't this cause vertex flickering?
Just going to mark this as "not resolved" since there was no change here. Better that the conversation is visible for someone reading the diff.
I don't think so, this is an initialization step that happens only when the falloff field is created and is used only in the expand operator.
@ -303,3 +321,3 @@
}
} // namespace blender::ed::sculpt_paint::geodesic
} /* namespace blender::ed::sculpt_paint::geodesic */
Unrelated change
@ -1 +1 @@
Subproject commit 612e310bf7ee505a79dad7ea925916ec9a812995
Subproject commit ce850ef056c57d0e4127eda48a514b2c4a651435
Unrelated change
I don't know how that file got changed, maybe a merge from main?
Either way, it shouldn't be part of this PR.
I invoke the Git Gurus to help me get rid of this XD
fixed
Fix #102586 Sculpt Mode: Improve Expand performanceto Sculpt: Improve Expand performance@ -1 +1 @@
Subproject commit 612e310bf7ee505a79dad7ea925916ec9a812995
Subproject commit 42979e6ec67f657c7efce9cb672fe07b3f8fff9d
Applied
git reset origin/main tests/data
but still shows a file changed.
@ -610,3 +610,2 @@
const float *co = SCULPT_vertex_co_get(ss, symm_vertex);
for (int i = 0; i < totvert; i++) {
PBVHVertRef vertex = BKE_pbvh_index_to_vertex(ss->pbvh, i);
threading::parallel_for(IndexRange(0, totvert), 4096, [&](IndexRange range) {
IndexRange(0, totvert)
->IndexRange(totvert)
I think it will be good idea to split this off into a separate PR, as it will make it easier to present and review separate modes of the Expand operator.
For example, if this code is split off into a separate PR it gives an opportunity:
@ -79,14 +79,13 @@ static bool sculpt_geodesic_mesh_test_dist_add(Span<float3> vert_positions,
static float *geodesic_mesh_create(Object *ob, GSet *initial_verts, const float limit_radius)
{
const int UNASSIGNED = -1;
const int UNASSIGNED = -1;
->constexpr const int non_checked = -1;
@ -120,0 +111,4 @@
dists[i] = 0.0f;
}
else {
dists[i] = FLT_MAX;
FLT_MAX
->std::numeric_limits<float>::max()
@ -147,0 +143,4 @@
}
}
Vector<int> queue, queue_next;
Vector<int> queue, queue_next;
->Vector<int> queue;
@ -161,1 +167,3 @@
}
/* queue and queue_next should not differ by a huge amount of size
* since they are an advancing front in the mesh hence no need
* to allocate much more memmory and keep the iteration range smaller */
Vector<int> queue_next;
@ -167,1 +173,3 @@
int v2 = edges[e][1];
BitVector<> edge_tag(totedge);
while (!queue.is_empty()) {
threading::parallel_for_each(IndexRange(0, queue.size()), [&](const int val) {
const int val
->const int index_of_...
(val
value
?)Have you sure this loop should be
parallel_for_each
, notparallel_for
?@ -168,0 +174,4 @@
while (!queue.is_empty()) {
threading::parallel_for_each(IndexRange(0, queue.size()), [&](const int val) {
const int edge_index = queue[val];
int v1 = edges[edge_index][0];
int v1 = edges[edge_index][0];
->int edge_vert_a = edges[edge_index][0];
this will add unnecessary verbosity in a context where is clear what v1 and v2 means
@ -204,0 +216,4 @@
/* Open addressing with linear probing that minimizes
* collission in hash tables can also be used to avoid
* parallel data storage collisions */
size_t idx = edge_other % new_size;
Please, just use
blender::Set
instead this a lot of C-like (usually unreadable) math.Or, at least, move this is separate function. It is hard to deep in the logic of hash table as part of checking geometrical algorithm. Also still
size_t
->const int64_t
(or just int..).You can check this as example: #120224
This will negate the performance gain, Blender set is really slow and this technique works really well, I've spent several days testing and refining it.
@JacquesLucke /
For this technique I use a dual advancing front which a Set/Hash table does not quite map correctly. Is the core of this PR
I haven't analysed this case in detail yet. Multi-threaded write-access to a hash table like done here feels not safe at a first glance. But will need to check that again.
I don't know where the speedup here comes from specifically. If it is the linear probing, then that can also be done with
Set
. The probing strategy is customizable as template parameter. One can just pass inLinearProbingStrategy
instead ofPythonProbingStrategy
. I also recommend to read the descriptions of the probing strategies.There might be other reasons for why a custom hash table is better here, but those need to be justified a bit more.
Honestly, I was just reading the patch description now. Will have a closer look at how your implementation works soon.
The main reason on using a preallocated vector/array is to allow independent writes to each bucket in a performant way. Do blender set allows preallocation and independent writes?
Just tried using VectorSet and Set for this and it didn't worked. It crashes Blender.
You cannot write to these data structures from multiple threads.
hence are not a suitable recommendation. I've refactored that part in a separate function so everyone should be happy now ;)
This is still not the best if blender have 2 separate hash table thought. Glad to know if this is okay\
If we have in Blender a lock free and wait free performant Hash table I could substitute for it.
I suggest reviewers to actually compile and test the branch to clear any doubt of its correctness.
Then in another PR I can use it to implement a lock free and wait free Vector or Hash
@ -80,0 +80,4 @@
/* Open addressing with linear probing that minimizes
* collission in hash tables can also be used to avoid
* parallel data storage collisions */
static void lock_free_emplace(const int edge_other, const int non_checked, Vector<int> &queue_next)
Thanks for separating this in to small function.
If you don't want to push/drop elements,
Vector<int> &queue_next
->MutableSpan<int> queue_next
Nice tip thanks!
@ -147,0 +160,4 @@
}
Vector<int> queue;
Vector<int> queue_next;
If next 10 lines of code is not related with
queue_next
it's probably better to just move this bottom.@ -209,0 +245,4 @@
});
queue.clear();
for (int i = 0; i < new_size; ++i) {
const int edge_i : IndexRange(new_size)
Can you add table of benchmarks of different parts of code to description of PR?
If simple, I don't like idea of using threads in case where you can use log(N) search. I talk about N^2 search of nearest.
Or at least, i'd like to see this as separate change. With separate benchmarks of different sizes of data.
Next one: lock free. Here is no lock free. Currently this PR contains only UB and data races rather than lock free.
I see how to fix hashmap case, but i am not sure about
sculpt_geodesic_mesh_test_dist_add
.Please, check
std::atomic<int>
for hashmap and proof whysculpt_geodesic_mesh_test_dist_add
is acceptable to be in parallel loop.See as additional resources: https://stackoverflow.com/questions/39393850/is-incrementing-an-int-effectively-atomic-in-specific-cases/39396999#39396999
And: https://en.cppreference.com/w/cpp/atomic/atomic/compare_exchange
@ -80,0 +92,4 @@
idx = 0;
}
}
queue_next[idx] = edge_other;
This is data race. Any other thread can change
queue_next[idx]
afterqueue_next[idx] != non_checked
.That's an assumption that there's a data race, in the actual code running even with heavy meshes if there's data race it will be seen as unmasked vertices surrounded by masked vertices. I noticed it with previous algorithms that I could for sure trigger a data race.
I couldn't find any error in the masking currently.
Again, I suggest reviewers to apply this patch to Blender and test it.
If you add an
assert(queue_next[idx] == non_checked)
and run in debug mode you'll hit it. The reason it might not show up in the final mask is probably because vertices/edges are reachable from separate "paths", so chance of "missing" scheduled edge is lower because of that, but is not zero.@ -172,0 +204,4 @@
if (dists[edge_vert_a] > dists[edge_vert_b]) {
std::swap(edge_vert_a, edge_vert_b);
}
sculpt_geodesic_mesh_test_dist_add(vert_positions,
This is data race if you call this function in parallel loop. Do
sculpt_geodesic_mesh_test_dist_add
perform read/modify operations ondists
in random access maner.Ok lets drop this PR and leave the old slow code then since is what people want.
I don't want to say i don't want to improve the speed, but could you fix issues which i point?
Thanks for the links! I did tried atomics and guarded threading approach (is in my commit history) but the overhead make it not worth the change, it will eat the performance away to the point of being almost as slow as before.
This was intended to be a stop-gap change to gain some performance while a deeper refactor that eliminates the need to pre-calculate a falloff field for the whole mesh lands, but now I'm even more concerned if refactors are even possible anymore by normal contributors.
Thanks everyone for their reviews and opinions.
First of all: Data race is the bug that no one wouldn't love to fix. And more important is how data race is going to be confirmed by bug triggers.
Do you planned to be cause of some bug report which is cannot be confirmed intentionally in the most of the cases. And each month some new user will go in this report to leave their mark and say
blender is bugged
..If error probability is low this error still can be. And if there is no rule which will define always correct result, error might occur.
I glad that you can't find the issue. Good way to force me to spend day to actually reproduce the bug (data race! for last one i have is was cow disease, and i was spend 4 hour to simplify bug file! not to triger, but to just simplify it! hm.. okay actually last one was in disjoin... still very corner case it was).
But.. I did that.
Here is the diff of the 2 calls of
geodesic_mesh_create
:As we already can see, this is not deterministic algorithm.
And the fact of that i can move the mouse and selection will be different actually point on the fact that there is gradient of the difference.
And as you can see, texture of the difference is looks like attractor. Than this is distant then its larger i think.
You can see the diff in attached file sculpt_topo_diff_of_diff.txt
And now... how this is looks like in case i made:
File sculpt_plane_bug.blend
And how to reproduce:
Mesh is generated by protected the geometry nodes group.
And now i can not reproduce the issue, but in the main.
I see 2 fair way to do the things:
But i don't see the property of changing the code to improve the speed, and changing the old bug to other one new.
Even more, i don't see how this can be accepted if you don't want to use real atomic types to protect at least hash table and just choose the int. Yeah, exactly there is the cost of correctness.
And last thing i noticed, you still not fix bitvector data race which is was pointed by Hans. So, there is 3 data race in just one PR.
In last half of year i have made several PR's related with
Shortest Edge Path
node:Honestly, i am not really deep in the logic of geodesic path computation, so can't really describe the things, but i can see some relations with A*.
I finished by the fact that there is no more memory bandwidth in the speed limitation, next improvements can be only in multithreading or assembler. Second one of the not something real, so first is the most potential.
But this is the wall for me, to have multithreaded A* without a lot of memory usage and stable without overhead.
If geodesic field can be solved by the same way, this probably will be the most fastest implementation.
For the start, you can try to reduce memory bandwidth in the hot loop and move thread-unrelated operations out of loop (for example, prepare some data in the parallel loop before sequential one).
@mod_moder Let me start by thanking you. This is a really convincing explanation with facts I can use to improve upon and not a battle of egos that PR reviews tends to degenerate on.
I will reopen the PR and will address the open issues.
I was testing this change on Mac Studio with M2 Ultra (24 cores). The time on the
monoHD.blend
before this patch is 5.8 seconds to executegeodesic_mesh_create()
. With ideal threading code we should be able to lower it down to 240 ms (2.8 / 24).With the current state of the PR, the timing went down to approximately 800 ms. It is a big change. Although, still feels there is room for improvement.
For the approach itself, I do not think we should be allowing algorithms which rely on race conditions. Ideally, all algorithms we implement in Blender are deterministic, otherwise troubleshooting will became a pure nightmare. There are some exceptions, like GPU rendering which is not strictly deterministic, because samples could be accumulated in different order, but that was a conscious decision.
I've tried a couple of different approaches here, with the general idea of letting threads to operate purely on local data, and then somehow accumulate results of their work. In this case the result of each thread was a
Set
of new edges it wants to append to the queue.I've tried to use
parallel_reduce
which combinesSet
's. I forgot exact speed, but it was way slower than the proposed PR. Possibly because of a lot of allocations or other memory access of temporarySet
's. Maybe there is a different way of getting it to work, I didn't have much time to try harder.What I found works the best is making it so threads use
Set
from TLS, and combine TLS "manually" after the parallel loop. This seems to avoid the most of the overhead, and it actually lowered the overall time ofgeodesic_mesh_create()
to about 300 ms. This is way closer to the ideal scaling, but also a bit suspecious, so second pair of eyes would be very much welcome.I've attached the modified file. It is a bit of a messy state with both PR and my modifications side-by-side.It should be relatively easy to clean it up, and verify correctness.
@ -2141,6 +2143,11 @@ static int sculpt_expand_invoke(bContext *C, wmOperator *op, const wmEvent *even
return OPERATOR_CANCELLED;
}
const int totvert = SCULPT_vertex_count_get(ss);
Think this could also be split-off to a separate PR. Or, maybe even landed directly, as it is a very minor and obvious cleanup.
@ -50,3 +50,3 @@
}
BLI_assert(dists[v1] != FLT_MAX);
BLI_assert(dists[v1] != std::numeric_limits<float>::max());
I am not sure how I feel about it. The code becomes much longer, and when we know the type at a compile time I am not sure we are benefitting from
numeric_limits
. If the type was templated then sure, we'd have to usenumeric_limits
.To me
FLT_MAX
reads much much easier.But I am also curious what others think of it.
@ -77,16 +77,33 @@ static bool sculpt_geodesic_mesh_test_dist_add(Span<float3> vert_positions,
return false;
}
/* Open addressing with linear probing that minimizes
We do have line limit of 99. There is no need to wrap lines so early. In a way, I think using all available horizontal space makes comment easier to follow.
collission
->collision
@ -161,1 +186,3 @@
}
/* queue and queue_next should not differ by a huge amount of size
* since they are an advancing front in the mesh hence no need
* to allocate much more memmory and keep the iteration range smaller */
memmory
->memory
Thanks @Sergey I just pushed my changes before seeing your review and code. What I found in my thread safe implementation is that is actually slower than the single threaded one :( adding atomics for the sculpt_geodesic_mesh_test_dist_add() and using tbb::concurrent_vector imply an overhead that erase the good speed that the unsafe code had 😅.
Will check your implementation now.
For comparison on an M1 MBA
ST
Mono
ST Bottleneck: time: 0.225167 seconds
ST Bottleneck: time: 0.203988 seconds
MonoHD
ST Bottleneck: time: 6.95426 seconds <----
Plane
ST Bottleneck: time: 0.252023 seconds
SMT
Mono
SMT Bottleneck: time: 0.271475 seconds
SMT Bottleneck: time: 0.224629 seconds
MonoHD
SMT Bottleneck: time: 10.3282 seconds <------
Plane
SMT Bottleneck: time: 0.281746 seconds
@ -52,2 +65,2 @@
BLI_assert(dists[v1] != FLT_MAX);
if (dists[v0] <= dists[v1]) {
BLI_assert(dists[v1].load() != std::numeric_limits<float>::max());
if (dists[v0].load() <= dists[v1].load()) {
Before you start moving forward with this, i'd describe why this doesn't work:
std::atomic::load
is not something magic. Only thing that is protected now is that compiler and CPU will not reorder loadings of both floats (for example).But someone else still can write something in
dists[v1]
between assert and condition ofif
statement.This is why i link
atomic/compare_exchange
in my second review.So, you have to check if value is as you expect and try to rewrite this. But if result is false, you have to try again.
And this even not the fix of issue. Some other thread might find local minimum and write this in
dists[v0]
. And current thread, as more small local minimum, should change this value.And now you have to rewrite them. This is additional overhead that i talk about in last review.
@mod_moder You're right, if even now is getting slower than the single threaded version, adding more thread safety will increase the overhead. When I closed the PR on Saturday was mainly because I realized the performance gain was at the expense of thread safety and I was lucky so far to not encounter a test that showed the error. (even though it looked like a tantrum 😅)
I reopened it later because maybe it could be saved with other data containers I may not be aware that Blender already have. But I'm getting more convinced that only an algorithmic refactor of the expand operator is what will bring true and safe performance gain (and scalability) and this route of just adding multithreading on top of the existing algorithm is a dead end.
The geodesic calculation is just too dependent on previous state and not heavy enough to offset the overhead from thread safety.
If all agree I can close it as it is and attempt a better refactor or try few more rounds with @Sergey 's patch
@Sergey your code has a bug unfortunately: Instead of
is_edge_scheduled_map.clear()
you need something likeis_edge_scheduled_map.fill(false)
.I did a little test myself: https://projects.blender.org/blender/blender/compare/main...JacquesLucke:geodesic-speedup-test
It speeds up
geodesic_mesh_create
from10.3 ms
to1.5 ms
for me (12 cores, 24 threads).Yup, will give @Sergey 's code a shot, if is fully thread safe then it will be very worth it to save this PR ;)
Hmm, true. Just noticed that too. I don't really have a good solution here. In my previous comment I was also only focused on speeding up the queue.
sculpt_geodesic_mesh_test_dist_add
seems to be a bigger issue. Would need to look into the entire algorithm more to better understand if and how this can be solved.@JacquesLucke Ah, true, the
clear()
was doing something else. So from the not-so-much-algorithmical-changes my change indeed "failed".However, why would we want to schedule edge again? Intuitively, it leads to extra calculations of things which are already calculated, with some possibility of early output due to things like
dists[ev_other] != FLT_MAX
. So if we do not clear the bitmap, we avoid those extra calculations, lowering the overall time?And yeah, the
sculpt_geodesic_mesh_test_dist_add
i did not notice before you mentioned it. Maybe there is some purely technical exercise of adding some atomics could work there, or maybe we'd need to find a different formulation of the entire algorithm. Not suer yet.I don't know for sure. I assume that's part of the algorithm. The shortest path based on the topology may be longer than the shortest geodesic path. In this case it feels like the same edge may need to be processed more than once.
I think your previous solution actually implemented the behavior where the bits were not cleared. When I tried it, it worked well in some cases, but just didn't work on others.
Ok based on @Sergey modified code I've added safeguard for updating the dists array and still comes
ahead in performance to the serial implementation.
@mod_moder please take a look and if we all agree I can start pushing for final reviews.
@ -74,0 +80,4 @@
* the actual order of update is irrelevant here */
std::atomic<float> &atomic_dist_v0 = reinterpret_cast<std::atomic<float> &>(dists[v0]);
if (atomic_dist_v0.load() > dist0) {
atomic_dist_v0.store(dist0);
Note that there is still a race-condition here. Another thread may change
atomic_dist_v0
between the load and store.Also, there is undefined behavior in every access of the
dists
array (one thread may read a value from it that another thread is writing at the same time). You might be able to get a way withstd::memory_order_relaxed
in a few places though, haven't checked in detail yet.As i said in #120125 (comment), this is UB just because there is cast of float to atomic float.
It is, but that's not really the core of the problem here currently imo.
Well, that true. I start thinking that this is not the best direction to go. Might it be better to research for some scientific paper with multithreaded geodesic distant field at begin and stop try to just puting another one synchronization primitive in new place.
Closing this PR since the attempts are a dead end as I previously mentioned. No point spending more time in research since the approach of calculating a whole mesh field does not scale well anyways even with multithreading.
The expand operator needs a deeper refactor than that.
Thanks everyone for their feedback.
If needed we can start from scratch in a fresh PR if a worth research is found.
Pull request closed