Sculpt: Improve Expand performance #120125

Closed
Raul Fernandez Hernandez wants to merge 181 commits from farsthary/blender:improve-expand-perf into main

When changing the target branch, be careful to rebase the branch in your fork to match. See documentation.

Improved performance of the geodesic expand operator initialization

Implemented a parallel Breath First Search to calculate surface
geodesics instead of a serial BFS. The performance is significantly
improved and the difference will be more evident the denser the mesh is.

Novel in this approach is a fast lock free, wait free parallel storage open addressing
with linear probing in an vector that surpasses tbb::concurrent_vector's
performance since is a critical component for performance.

Resolves #102586

For some numerical comparisons:

In monoHD.blend with 2016578 vertices on a MBA M1 8GB RAM
Initialization time:
-Before: 6.56252 seconds
-After: 2.63308 seconds

Improved performance of the geodesic expand operator initialization Implemented a parallel Breath First Search to calculate surface geodesics instead of a serial BFS. The performance is significantly improved and the difference will be more evident the denser the mesh is. Novel in this approach is a fast lock free, wait free parallel storage open addressing with linear probing in an vector that surpasses tbb::concurrent_vector's performance since is a critical component for performance. Resolves #102586 For some numerical comparisons: In monoHD.blend with 2016578 vertices on a MBA M1 8GB RAM Initialization time: -Before: 6.56252 seconds -After: 2.63308 seconds
Raul Fernandez Hernandez added 18 commits 2024-04-01 03:01:17 +02:00
Author
Member

Performance scales better now with number of cores and topological mesh density
Soon will post comparison numbers before and after the change.

Performance scales better now with number of cores and topological mesh density Soon will post comparison numbers before and after the change.
Raul Fernandez Hernandez self-assigned this 2024-04-01 03:03:55 +02:00
Raul Fernandez Hernandez added this to the Sculpt, Paint & Texture project 2024-04-01 03:04:06 +02:00
Raul Fernandez Hernandez added the
Module
Sculpt, Paint & Texture
label 2024-04-01 03:04:21 +02:00
Raul Fernandez Hernandez changed title from WIP: Fix ##102586 Sculpt Mode: Improve Expand performance to WIP: Fix #102586 Sculpt Mode: Improve Expand performance 2024-04-01 03:07:50 +02:00
Raul Fernandez Hernandez added 1 commit 2024-04-01 03:24:18 +02:00
Raul Fernandez Hernandez added 1 commit 2024-04-01 20:51:13 +02:00
Raul Fernandez Hernandez added 1 commit 2024-04-01 22:36:06 +02:00
Raul Fernandez Hernandez added 1 commit 2024-04-04 01:26:28 +02:00
Raul Fernandez Hernandez added 1 commit 2024-04-04 04:30:32 +02:00
Raul Fernandez Hernandez added 1 commit 2024-04-04 22:05:11 +02:00
Raul Fernandez Hernandez added 1 commit 2024-04-05 01:20:12 +02:00
Raul Fernandez Hernandez added 1 commit 2024-04-05 04:09:46 +02:00
Raul Fernandez Hernandez added 1 commit 2024-04-05 04:20:39 +02:00
Raul Fernandez Hernandez added 1 commit 2024-04-05 04:22:55 +02:00
Raul Fernandez Hernandez changed title from WIP: Fix #102586 Sculpt Mode: Improve Expand performance to Fix #102586 Sculpt Mode: Improve Expand performance 2024-04-05 04:52:18 +02:00
Raul Fernandez Hernandez requested review from Brecht Van Lommel 2024-04-05 04:52:46 +02:00
Raul Fernandez Hernandez requested review from Sergey Sharybin 2024-04-05 04:52:46 +02:00
Raul Fernandez Hernandez requested review from Hans Goudey 2024-04-05 04:52:47 +02:00
Author
Member

Disregard the many intermediate commits I've made, mostly are partial cloud backups, and provisional approaches I took that will be replaced later but wanted to keep a history of them.

Disregard the many intermediate commits I've made, mostly are partial cloud backups, and provisional approaches I took that will be replaced later but wanted to keep a history of them.
Author
Member

This will be the first of a series of partial PR's to improve the performance of the different algorithms in the expand operator to keep the PR small and manageable.

Also the expand operator will likely require a deeper refactor to be really scalable with the mesh size with a more dynamic or lazy initialization to avoid calculating a field value for every vertex even when unused and adding multithreading to the current implementation will be a step in the right direction of squeezing more performance now.

This will be the first of a series of partial PR's to improve the performance of the different algorithms in the expand operator to keep the PR small and manageable. Also the expand operator will likely require a deeper refactor to be really scalable with the mesh size with a more dynamic or lazy initialization to avoid calculating a field value for every vertex even when unused and adding multithreading to the current implementation will be a step in the right direction of squeezing more performance now.
Raul Fernandez Hernandez added 1 commit 2024-04-05 15:28:29 +02:00
Raul Fernandez Hernandez added 1 commit 2024-04-08 03:59:43 +02:00
Raul Fernandez Hernandez added 1 commit 2024-04-08 17:56:11 +02:00
Hans Goudey requested changes 2024-04-10 20:37:00 +02:00
@ -1497,10 +1480,12 @@ static void sculpt_expand_update_for_vertex(bContext *C, Object *ob, const PBVHV
sculpt_expand_mask_update_task(ss, mask_write, expand_cache->nodes[i]);
}
});
SCULPT_flush_update_step(C, SCULPT_UPDATE_MASK);
Member

Unrelated change

Unrelated change
farsthary marked this conversation as resolved
@ -2140,6 +2124,11 @@ static int sculpt_expand_invoke(bContext *C, wmOperator *op, const wmEvent *even
if (!BKE_base_is_visible(v3d, base)) {
return OPERATOR_CANCELLED;
}
/* Do nothing when the mesh has 0 vertices. */
Member

Unrelated change?

Unrelated change?
farsthary marked this conversation as resolved
@ -2222,7 +2204,6 @@ static int sculpt_expand_invoke(bContext *C, wmOperator *op, const wmEvent *even
ss->expand_cache, ob, ss->expand_cache->initial_active_vertex, falloff_type);
sculpt_expand_check_topology_islands(ob, falloff_type);
Member

Unrelated change

Unrelated change
farsthary marked this conversation as resolved
@ -8,6 +8,7 @@
#include <cmath>
#include <cstdlib>
#include <thread>
Member

Unnecessary include?

Unnecessary include?
farsthary marked this conversation as resolved
@ -77,6 +78,7 @@ static bool sculpt_geodesic_mesh_test_dist_add(Span<float3> vert_positions,
return false;
}
#define UNASSIGNED -1
Member

Typically it's preferred to use const int declared inside the function or so

Typically it's preferred to use `const int` declared inside the function or so

Sorry, but now this line is other one issue, now this is UB.

Sorry, but now this line is other one issue, now this is UB.
@ -146,0 +137,4 @@
threading::parallel_for(IndexRange(0, totvert), 4096, [&](IndexRange range) {
for (const int i : range) {
if (len_squared_v3v3(v_co, vert_positions[i]) <= limit_radius_sq) {
affected_vert[i].set();
Member

Manipulation of a bitmap is not threadsafe because it's a read-modify-write operation. It should be converted to a boolean array, or you can use threading::parallel_for_aligned and bits::BitsPerInt to align the task boundaries on integer boundaries.

Manipulation of a bitmap is not threadsafe because it's a read-modify-write operation. It should be converted to a boolean array, or you can use `threading::parallel_for_aligned` and `bits::BitsPerInt` to align the task boundaries on integer boundaries.
farsthary marked this conversation as resolved
@ -147,0 +144,4 @@
}
}
std::vector<int> queue, queue_next;
Member

In Blender code Vector is preferred over std::vector. That's documented more in the headers and here: https://developer.blender.org/docs/features/core/blenlib/containers/

In Blender code `Vector` is preferred over `std::vector`. That's documented more in the headers and here: https://developer.blender.org/docs/features/core/blenlib/containers/
farsthary marked this conversation as resolved
@ -147,0 +148,4 @@
queue.reserve(totedge);
/* Add edges adjacent to an initial vertex to the queue.
* Since initial vertex are few only, iterating over its neighbour edges
Member

typo/grammar

Suggest "Since there are typically few initial vertices"

typo/grammar Suggest "Since there are typically few initial vertices"
farsthary marked this conversation as resolved
@ -147,0 +149,4 @@
/* Add edges adjacent to an initial vertex to the queue.
* Since initial vertex are few only, iterating over its neighbour edges
* instead of over all edges scales better as mesh edge count increases */
Member

Missing period at the end

Missing period at the end
farsthary marked this conversation as resolved
@ -147,0 +153,4 @@
GSetIterator gs_iter;
GSET_ITER (gs_iter, initial_verts) {
const int seed_vert = POINTER_AS_INT(BLI_gsetIterator_getKey(&gs_iter));
for (const int e : ss->vert_to_edge_map[seed_vert]) {
Member

Typically in mesh code I'd prefer to see longer variable names like edge or edge_index

Typically in mesh code I'd prefer to see longer variable names like `edge` or `edge_index`
farsthary marked this conversation as resolved
@ -180,3 +191,3 @@
}
for (const int v_other : corner_verts.slice(faces[face])) {
if (ELEM(v_other, v1, v2)) {
if (ELEM(v_other, v1, v2) || !sculpt_geodesic_mesh_test_dist_add(
Member

Keep these as separate if statements, it's more readable that way. It's really much simpler to review if the changes are as small as necessary to achieve the speedup

Keep these as separate if statements, it's more readable that way. It's really much simpler to review if the changes are as small as necessary to achieve the speedup
farsthary marked this conversation as resolved
@ -202,2 +199,2 @@
BLI_LINKSTACK_PUSH(queue_next, POINTER_FROM_INT(e_other));
}
for (const int e_other : ss->vert_to_edge_map[v_other]) {
const int ev_other = (edges[e_other][0] == v_other) ? edges[e_other][1] :
Member

Use bke::mesh::edge_other_vert

Use `bke::mesh::edge_other_vert`
farsthary marked this conversation as resolved
@ -204,0 +213,4 @@
* collission in hash tables can also be used to avoid
* parallel data storage collisions */
size_t idx = e_other % new_size;
while (queue_next[idx] != UNASSIGNED) {
Member

I'm skeptical that this will give deterministic behavior. That's typically required here, otherwise things get very hard to debug.

I'm skeptical that this will give deterministic behavior. That's typically required here, otherwise things get very hard to debug.
Author
Member

For the initial falloff field creation (one falloff value per vertex) the order in which is calculated for the vertex does not matter, at the end of the algorithm all the vertices will have its falloff calculated. If is vertex[0], vertex[1]...vertex[n] or vertex[2], vertex[0]....vertex[1], vertex[n] is irrelevant here. Is deterministic in the sense that all the result are the same regardless the order of the calculation. Is very well worth the speed gain for the user. This code will be infinitely more executed than it will be debugged.

For the initial falloff field creation (one falloff value per vertex) the order in which is calculated for the vertex does not matter, at the end of the algorithm all the vertices will have its falloff calculated. If is vertex[0], vertex[1]...vertex[n] or vertex[2], vertex[0]....vertex[1], vertex[n] is irrelevant here. Is deterministic in the sense that all the result are the same regardless the order of the calculation. Is very well worth the speed gain for the user. This code will be infinitely more executed than it will be debugged.

If geometry nodes will let to view randomly selected vertices (by using index of vertex), shouldn't this cause vertex flickering?

If geometry nodes will let to view randomly selected vertices (by using index of vertex), shouldn't this cause vertex flickering?
Member

Just going to mark this as "not resolved" since there was no change here. Better that the conversation is visible for someone reading the diff.

Just going to mark this as "not resolved" since there was no change here. Better that the conversation is visible for someone reading the diff.
Author
Member

If geometry nodes will let to view randomly selected vertices (by using index of vertex), shouldn't this cause vertex flickering?

I don't think so, this is an initialization step that happens only when the falloff field is created and is used only in the expand operator.

> If geometry nodes will let to view randomly selected vertices (by using index of vertex), shouldn't this cause vertex flickering? I don't think so, this is an initialization step that happens only when the falloff field is created and is used only in the expand operator.
farsthary marked this conversation as resolved
@ -303,3 +321,3 @@
}
} // namespace blender::ed::sculpt_paint::geodesic
} /* namespace blender::ed::sculpt_paint::geodesic */
Member

Unrelated change

Unrelated change
farsthary marked this conversation as resolved
tests/data Outdated
@ -1 +1 @@
Subproject commit 612e310bf7ee505a79dad7ea925916ec9a812995
Subproject commit ce850ef056c57d0e4127eda48a514b2c4a651435
Member

Unrelated change

Unrelated change
Author
Member

I don't know how that file got changed, maybe a merge from main?

I don't know how that file got changed, maybe a merge from main?
Member

Either way, it shouldn't be part of this PR.

Either way, it shouldn't be part of this PR.
Author
Member

I invoke the Git Gurus to help me get rid of this XD

I invoke the Git Gurus to help me get rid of this XD
Author
Member

fixed

fixed
farsthary marked this conversation as resolved
Hans Goudey changed title from Fix #102586 Sculpt Mode: Improve Expand performance to Sculpt: Improve Expand performance 2024-04-10 20:37:28 +02:00
Raul Fernandez Hernandez added 69 commits 2024-04-11 22:00:29 +02:00
3b0521976e cppcheck: suppress noisy output
Some of these could be enabled if there is an interest to consider
them warnings, until then they're noisy and don't often hint at errors
in the code.
ac88d0e5e7 Fix assert in the case of a null library_reference items
Missing return allowed a null return (which asserts) & ignored the
r_free assignment.
f9cdf830ea Cleanup: pass const arguments in freestyle functions
Also move implementation notes into the funciton body.
aeb6b3e136 Fix #120367: Attach to root panel if new parent is null
If pointer to new parent is null, use pointer to root panel of interface.

Pull Request: #120369
5ef9d0f51f Anim: add Animation data-block management functions
Add code (including RNA wrappers) for:

- Creating, removing, and accessing `Animation` data-blocks.
- Creating and removing layers, strips, and bindings on those `Animation`
  data-blocks.
- Accessing those via RNA.

Note that this does not include assignment to any animated data-block,
so it is of limited practical use.

For more info, see #113594.

Pull Request: #118677
65408d5c56 Anim: allow assigning Animation data-blocks
Expand the `AnimData` struct with an `Animation *` + an
`binding_stable_index` field, and properly handle those relations.

This also adds functionality for actually pointing animated IDs to
`Animation` data-blocks, and automatically hooking up the relevant
`Binding`.

The Depsgraph code is extended to take these new relations into account,
but doesn't trigger any animation evaluation yet.

For more info, see #113594.

Pull Request: #118677
e4b5cc9fb5 Anim: Add a simple GUI for assigning Animation data-blocks
Add a 'Baklava' panel to the 3D Viewport side-panel. It's a
developer-GUI, not meant for animators (or for inclusion beyond the
experimental feature, for that matter).

Note that this GUI shows all layer properties, even though the data
model is currently limited to a single layer. This means that things
like 'influence' and 'mix mode' are irrelevant, as there is no
underlying layer to mix with.

Also key insertion and animation evaluation are not implemented yet (but
will be in upcoming commits).

For more info, see #113594.

Pull Request: #118677
4709b491ce Anim: allow inserting keys in Animation data-block
Allow inserting keys into Keyframe strips (which is the only type of
strip that is currently implemented).

Note that the data model is currently limited to a single layer, with a
single infinite strip. Because of this, the strip will not be shown in
any UI, as there is no way to manipulate it anyway.

Note that the inserted keys are not yet evaluated, so the animation
isn't visible in the 3D viewport yet. That's for an upcoming commit.

For more info, see #113594.

Pull Request: #118677
2c9b0be956 Anim: add evaluation of Animation data-blocks
Include Animation data-block handling in Blender's animation evaluation
stack. If an `Animation` is assigned to an `ID`, it will take precedence
over the NLA and/or any `Action` that might be assigned as well.

For more info, see #113594.

Pull Request: #118677
250a92b5d8 Anim: show animated properties with the appropriate color in the GUI
Use the `Animation` data-block to find F-Curves, for drawing the
background color of properties in the GUI (green for 'animated', yellow
for 'keyed on this frame', etc.).

This assumes (correctly) that the `Animation` is limited to a single
layer with a single strip. As such, there is only one set of F-Curves
for every animated ID, which means that the correct F-Curve will be
found. This will need adjustment when the same property can have
multiple F-Curves (due to layers) or when an F-Curve may not be
applicable for each point in time (due to the use of finite strips).

Pull Request: #118677
5207259b96 Cleanup: move runtime data to Library.runtime
Having a better separation between runtime and non-runtime data makes
it easier to reason about the code.

Pull Request: #120271
bfdbede6a7 Fix #120374: Knife crashes when displaying measurements
Caused by 7ed02da2be.
`curr.ob_index` is -1 when the ray does not hit any face
(`knife_find_closest_face`). This will lead to crash when mouse
is not over any faces. To avoid crash, use `bvh.ob_index`.

Pull Request: #120392
963057f119 Fix #120071: quadratic complexity for dangling reroute check
To know if link is connected to dangling reroute and can be skipped
as value-less, we need to know if reroute is dangling. This requires
graph traversal. Currently this is done by non-recursive iteration.
But this can lead quadratic complexity for some of the cases.
Other way is to make this linear while cache building.

Pull Request: #120375
deba511d3e Compositor: Allow inter-operation canceling
This patch adds support for inter-operation canceling. Though it should
be noted that canceling will actually not take place except in certain
circumstances because operations are already submitted to the GPU by
this point and can't be canceled.

However, for operations that do CPU<->GPU transfers, like OIDN denoise,
which is one of the most expensive operations, this will actually cancel
the evaluation and greatly improve interactivity.

Pull Request: #119917
e91a110a5e Anim: change experimental flag label & linked issue on Gitea
Changed the experimental flag label from "Animation: Project Baklava" to
"New Animation System" to be more self-explanatory. Also I changed the
associated item on project.blender.org to #120406, which tracks the work
that is still to be done.
c06f269425 Anim: change label & description of experimental flag again
Change the experimental flag "New Animation System" to "New Animation
Data-block", as that's really what it is about. The tooltip is now also
more precise about what this experimental stage is about.
8649603b31 Cycles: Adjust Meteor Lake GPUs architecture version
Released devices are 12.70.4 or higher, there is no need to build for
12.70.0.
804a04e8fd EEVEE-Next: Fix performance regression
The SAFE_UNION_ACCESS is always defined.
The correct check is to test for its value.

Fix DATA_MEMBER declaration too.
2774f9c650 Collection Exporters: Enable file exporters to be specified on Collections
This implements the ability to have file exporters added and configured on Collections.

Exporting is reachable from several locations:
- Individually on each exporter configuration: The `Export` button in each panel header
- For all exporters on the collection: The `Export All` button in the main panel interface
- For all exporters on all collections in the scene: The `File`->`Export All Collections` button

Visibility of which collections currently have exporters configured is done by ways of an icon added to the Collection row in the Outliner.

Adding multiple exporters for the same file type is permitted. The user is free to setup several exports of the same format but with different file locations or settings etc.

Notes:
Only USD and Wavefront OBJ are enabled for the initial commit. Additional formats, including those implemented in Python will be added as separate commits after this.

Ref #115690
Pull Request: #116646
3151b83bbb Cleanup: remove redundant variables & assignments
Co-authored-by: Sean Kim <SeanCTKim@protonmail.com>
8338f9e0ce Cleanup: promote to size_t before calculating sizes
Avoids integer truncation although it's mostly theoretical in this case.
dc775e59df Fix potential null pointer dereference from the RenderView::imbuf
This member is checked for null elsewhere, account for a null imbuf
when accessing pixels too.
3869a8c852 Anim: Detailed report if no keyframes have been inserted
With this PR, when pressing `I` in the viewport and the code
is unable to insert **ANY** keyframes, the user will be presented
with a single message detailing exactly why it has failed.

This PR promotes the functionality introduced in
#117449 into the header file so it can be used elsewhere.

The `CombinedKeyingResult` class is returned
from `insert_key_action` and `insert_key_rna`, and
used to produce a single report from the operator if it
failed to insert any keyframes.

In order to easily create a report from a `CombinedKeyingResult`
the function `generate_keyframe_reports_from_result`
has been moved into the class as `generate_reports`.

In addition to that the `UNABLE_TO_INSERT_TO_NLA_STACK` result
has been added. This notifies the user if keyframe insertion is not
possible due to NLA stack settings.

Pull Request: #119201
3a9b7d076a Anim: thread remake_graph_transdata
On animations with high key counts, `remake_graph_transdata`
takes most of the compute time when moving keys.
This patch threads the loop over FCurves in that function to speed things up.

Test file with 10.000 keys per F-Curve
| - | Before | After |
| - | - | - |
| Moving 1 key of each FCurve |  ~2200ms | ~285ms |
| Moving a single key | ~0.70ms | ~0.72ms |

As demonstrated in the measurements, this speeds up the
case of modifying a lot of data, while not impacting the case
of modifying very little data.
The measurements were taken on an 8c/16t CPU.
The higher the thread count, the better the performance gain.

Measurements of `remake_graph_transdata` using the following test file.
https://download.blender.org/ftp/sybren/animation-rigging/heavy_mocap_test.blend

Pull Request: #119497
81e2b7d7d5 BLI: IndexMask: Add binary set operations
The `IndexMask` class already had a static function `from_union`.
This adds two new functions `from_difference` and `from_intersection`
as well as tests for each of them.

It also uses `from_intersection` in two grease pencil utility functions.

Pull Request: #120419
afa8fbfd8e Fix #120144: Invert Visible doesn't invert Face Sets properly
Caused by 53418fc3a1.
corner vertices of hidden face sets are not `1/true/hidden` (which
is expected). So inverting then hides these corner verts, hence the
small region of face sets is visible afterwards.
This could be avoided if `.hide_poly` attribute is used for changing the
visibility.

Pull Request: #120160
ad66f1caa0 Fix #119702: Set Position node behaves incorrectly on instances with selection
The selection was not taken into account properly and uninitialized data was used.

Pull Request: #120409
85a04ad05a Realtime Compositor: Support variable scaling
This patch adds support for variable scaling in the Scale node for the
Realtime Compositor. This is supported for the CPU compositor.

Pull Request: #120314
9dab4f319c Cycles: Properly default to Metal-RT off unless GPU is a M3 or newer
Ever since commit [1], `use_metalrt_by_default` will be True
if the GPU being used is not a M1 or M2 based system.
The intention of this was to enable MetalRT by default for
M3 and newer devices that have hardware for ray traversal.

However the side effect of this change was that all AMD GPUs would
have `use_metalrt_by_default` set to True. Which appears to be the
main culprit causing crashes on older AMD GPUs in #120126.
Since these GPUs don't support MetalRT.

This commit fixes this issue by only setting
`use_metalrt_by_default` to True if the GPU is not M1 or M2 based,
and the GPU is Apple Silicon based. Which equates to M3 or newer.
Which is the original intent of this code.

This resolves the issue where AMD GPUs were being told to use MetalRT
by default, when they shouldn't be.

[1] 322a2f7b12

Pull Request: #120299
ebb4ac7494 Anim: Avoid loading alpha channel from theme for keyframe drawing
The theme has a color for each keyframe type. This is exposed in RNA and
the preferences UI as an RGB color, so without alpha channel. The code
that loads the color from the theme, however, does include that
unsettable alpha channel, though.

In practice this doesn't cause issues, as the versioning code ensures
that the default colors are stored with `alpha=uint8_t(255)`. While
developing, however, this caused me considerable headscratching, as I
was missing that little bit of versioning code, which means that the
default color was set to all-zeroes, and thus had a zero alpha.

This is now resolved by altering the drawing code to only load the RGB
component from the theme. This way only the user-settable bytes are
loaded, while the alpha component is handled a different way (setting it
to an absolute value, instead of multiplying the theme value).

No user-perceivable changes, just some developer comfort.

Pull Request: #120311
bbb6b0e50f Fix #118509: GPv3: frames disappear during transform in the timeline
While selecting and transforming multiple grease pencil frames in the
timeline, frames would sometimes disappear. This happened when the
transformation overlapped, meaning when a frame replaced another moving
one in the timeline. The frames transformation was happening in place
and in series, and thus leaded to the initial position of the frame to
be removed, even if it was occupied by a freshly transformed framed.

This commit fixes the issue by storing two separate maps in the
transform data structure instead of one, one map to store the static
data, meaning the frames that are not affected by the transform, and
another one for the moving data, meaning the frames that are affected by
the transform.

Some changes were made to account for the potential duplication of the
frames (move + duplicate):
* the map of the duplicated frames was renamed to a more explicit name,
* when a duplication occurs, the original frame is stored in the static
  frames map, and the duplicate is stored in the moving frames map,
* thus the check for the duplicate flag of the operator is now done at
  the init step of the transform, instead of the update step, so that
  the maps are initialized accordingly.

Co-authored-by: Lukas Tönne <lukas@blender.org>
Pull Request: #119337
db3ccf0bc3 Anim: make CombinedKeyingResult plain old data
The data in CombinedKeyingResult was using BLI's Array type, which
heap allocates.  However, we know the array size at compile time,
and therefore can use std::array instead, as suggested in Array's
documentation.  This makes CombinedKeyingResult a
plain-old-data class.

Pull Request: #120425
78abe7d651 Fix #120436: "Show Fill Color" setting not working
The flag was completly unused in the code.
Now this flag does what it says: While drawing,
the fill color is (almost) not visible.

Pull Request: #120438
49977bee1c EEVEE-Next: Add back volume reprojection
This is done almost the same as legacy EEVEE.

The main difference is that we now take into
account the change in volume Z bounds during
reprojection.

The viewport smoothing experience as also be
improved with less blurring when moving
objects.

We also still do the reprojection during
final rendering between samples. This improves
the convergence as the input to the film TAA
is less noisy.

Pull Request: #120320
be8869949a UI: Splash Screen Import Changes
Removal of "What's New" button, addition of a section divider, and many
very small changes to the text, some reordering of items.

Pull Request: #118224
b05d1ab52c UI: Removal of GRAPH_OT_keys_to_samples Confirmation
Removal of the dialog that pops up to confirm this operator. The
Animation & Rigging module feels this isn't needed since this action
is quite visual and hard to miss. This operator is also not often
used and might get removed in future as unnecessary.

Pull Request: #120074
3582319c4d Fix #120443: Knife tool loses precision on edges with high clip distance range
The previous algorithm for interpolating the vertices of an edge using
a lambda in screen space was heavily dependent on the resolution of the
projection matrix, leading to precision loss when the clip distance
range was high.

The updated algorithm now adjusts the lambda according to the
perspective before performing the interpolation, which improves
precision regardless of the clip distance range.
0e1c6c8c57 Fix: Matrix socket is always visible
Fix of typo in weakly typed API.

Pull Request: #120337
5f47ff5333 Fix: Crash calculating edges of empty mesh
At some point (Fill Curve node in my case) it is possible to have
empty task of edges. Simplest fix - ignore such tasks. In future
this might be improved (see: #120224).

Pull Request: #120330
b37bc24285 Cleanup: Use utility function to sample triangle bary coord
Also move the declaration of the BVH tree closer to its use.
9e30209725 Fix #120400: Remove cyclic dependency of pbvh type when building bmesh pbvh
This PR fixes a crash when enabling Dyntopo.

When the PBVH generation code for BMesh was changed in b3aca5b28f, a
bug was introduced where dependent code to generate necessary layers
for BMesh could no longer access the PBVH type, as it was not generated
yet. As the type is known in the code at time of construction, I opted
to add parameters specifying the PBVHType to necessary methods when
accessing attributes.

Pull Request: #120445
ebb2ca8055 Fix crash using "--command" on Windows
Use after free error accessing command line arguments on windows.
Resolve by duplicating the arguments when a "--command" is passed in.
6e32f8836e Fix error accessing the add-on install "target" without a context
Introspecting the add-on install operator from operator cheat-sheet
raised a Python exception.
067c674818 Fix: GPv3: Clear copy strokes buffer if nothing is selected during copy
If `clipboard.curves` has some curve and the copy operation is called
again when nothing is selected, the buffer is expected to be cleared.
Otherwise the paste operations will paste the previous copied curves.

Pull Request: #120454
e9add526d1 Fix #120421: Animated "Burn metadata into image" ignored during render
Seems like scene `RenderData` only gets synched once from `Scene` to
`BaseRender` in `RE_InitState` .
Animation on it is only evaluating on the scene, so the the `BaseRender`
`RenderData` is not properly updated here.

However `R_STAMP_DRAW` is part of that and it is the `BaseRender`
`RenderData` that is checked in `do_render_full_pipeline` (not the
`Scene` one) to determine if we want to stamp.

Later calls to `BKE_render_result_stamp_info` / `renderresult_stampinfo`
/ `stampdata` always get passed the scene, so individual animation stamp
details (such as Render Time) work properly.

So to resolve, use the `Scene` `RenderData` (rather than the
`BaseRender` one) for proper animation update.

NOTE: this (not updating animation of members of `RenderData`) might
actually be a problem elsewhere, too -- havent checked on this in detail
though

Pull Request: #120429
a1e1a6c346 GPv3: Deselect everything from all visible drawings when pasting strokes
To match with gpv2 behavior, fill selection attribute of all the visible
drawings with "false" value to deselect them so only the pasted stroke
from the clipboard is selected.

Pull Request: #120363
8089f37be1 Fix #120459: GPv3: Issue in conversion for layer parents
The layer parenting did not account for storing an
initial parent inverse matrix (to "keep transform").

This adds this matrix, stores it in DNA, and uses it
when we compute the parent matrices on demand.

Note: This PR does not set the parent inverse matrix
outside of conversion from GPv2. Support for
"keep transform" parenting will have to be added
in another PR.

Pull Request: #120462
4bc86cfe83 Curves: add operator to set bezier handle types
This adds an operator to set the handle types of bezier curves. It also
adds the same shortcut that is available in the legacy curve edit mode.

Pull Request: #120426
6b23c321cd UI: Add Rename operator in bone collection context menu
Operator to rename tree view item already exists: `UI_OT_view_item_rename`
Add this in context menu as discussed in !120003

Pull Request: #120086
5aa16f5670 Fix #120402: memory leak in volume grid handling
The openvdb tree was not freed correctly because the destuctor
of the `shared_ptr` was not run.
9890e738a9 Fix #120417: VSE crash rendering a scene strip with missing Editing data
In case a scene strips `Input` was set to `Sequencer`, but the scene did
not actually have `Editing` data (e.g. when just adding a `New` scene as
scene strip), blender would crash trying to render.

Now simply check if the required `Editing` data is present before
proceeding.

Pull Request: #120418
c1534f6f18 BLI: Fix inverted 0 determinant for infinite orthographic projection
This was creating issues with triangle winding order since
the resulting matrix was degenerate (0 determinant). Which
caused the Draw manager to wrongly invert the winding.

This fixes a bug in EEVEE-Next mesh voxelization for volume
rendering (with accurate method).
ff496fcc11 Fix #113182: Crash when animating Viewport visibility of Curves / GPv3 object
This is actually a deeper issue, which roots to the fact that updating
relations of the dependency graph does not properly handle cases when an
operation was previously skipped from evaluation (i.e. as a visibility
optimization).

The fix is to preserve needs-update flag throughout relations update of
a dependency graph, similar to the entry tags.

Pull Request: #120477
8bda7c9036 UI: Indicate type of search in keymap search button
Shows either "Search by Name" or "Search by Key-Binding" as placeholder text in
the search button. This makes it more clear what's expected to be entered here.

Personally I often did/do the mistake of searching for the wrong thing in this
button, because I forgot that I changed it earlier. The placeholder text can
avoid this mistake.

Pull Request: #113681
Raul Fernandez Hernandez added 68 commits 2024-04-12 01:16:14 +02:00
Raul Fernandez Hernandez added 1 commit 2024-04-12 02:28:13 +02:00
Raul Fernandez Hernandez reviewed 2024-04-12 03:09:28 +02:00
tests/data Outdated
@ -1 +1 @@
Subproject commit 612e310bf7ee505a79dad7ea925916ec9a812995
Subproject commit 42979e6ec67f657c7efce9cb672fe07b3f8fff9d
Author
Member

Applied
git reset origin/main tests/data
but still shows a file changed.

Applied git reset origin/main tests/data but still shows a file changed.
farsthary marked this conversation as resolved
Iliya Katushenock reviewed 2024-04-12 11:36:01 +02:00
@ -610,3 +610,2 @@
const float *co = SCULPT_vertex_co_get(ss, symm_vertex);
for (int i = 0; i < totvert; i++) {
PBVHVertRef vertex = BKE_pbvh_index_to_vertex(ss->pbvh, i);
threading::parallel_for(IndexRange(0, totvert), 4096, [&](IndexRange range) {

IndexRange(0, totvert) -> IndexRange(totvert)

`IndexRange(0, totvert)` -> `IndexRange(totvert)`

I think it will be good idea to split this off into a separate PR, as it will make it easier to present and review separate modes of the Expand operator.

For example, if this code is split off into a separate PR it gives an opportunity:

  • Present an exact speedup on a complex mesh for the spherical fall-off expansion
  • The code is much more isolated, and can be landing to the main branch mush sooner, without being blocked by a more tricky parts the PR.
I think it will be good idea to split this off into a separate PR, as it will make it easier to present and review separate modes of the Expand operator. For example, if this code is split off into a separate PR it gives an opportunity: - Present an exact speedup on a complex mesh for the spherical fall-off expansion - The code is much more isolated, and can be landing to the main branch mush sooner, without being blocked by a more tricky parts the PR.
farsthary marked this conversation as resolved
@ -79,14 +79,13 @@ static bool sculpt_geodesic_mesh_test_dist_add(Span<float3> vert_positions,
static float *geodesic_mesh_create(Object *ob, GSet *initial_verts, const float limit_radius)
{
const int UNASSIGNED = -1;

const int UNASSIGNED = -1; -> constexpr const int non_checked = -1;

`const int UNASSIGNED = -1;` -> `constexpr const int non_checked = -1;`
farsthary marked this conversation as resolved
@ -120,0 +111,4 @@
dists[i] = 0.0f;
}
else {
dists[i] = FLT_MAX;

FLT_MAX -> std::numeric_limits<float>::max()

`FLT_MAX` -> `std::numeric_limits<float>::max()`
farsthary marked this conversation as resolved
@ -147,0 +143,4 @@
}
}
Vector<int> queue, queue_next;

Vector<int> queue, queue_next; -> Vector<int> queue;

`Vector<int> queue, queue_next;` -> `Vector<int> queue;`
farsthary marked this conversation as resolved
@ -161,1 +167,3 @@
}
/* queue and queue_next should not differ by a huge amount of size
* since they are an advancing front in the mesh hence no need
* to allocate much more memmory and keep the iteration range smaller */

Vector<int> queue_next;

`Vector<int> queue_next;`
farsthary marked this conversation as resolved
@ -167,1 +173,3 @@
int v2 = edges[e][1];
BitVector<> edge_tag(totedge);
while (!queue.is_empty()) {
threading::parallel_for_each(IndexRange(0, queue.size()), [&](const int val) {

const int val -> const int index_of_...val(value ?)

Have you sure this loop should be parallel_for_each, not parallel_for?

`const int val` -> `const int index_of_...`~~`val`~~(`value` ?) Have you sure this loop should be `parallel_for_each`, not `parallel_for`?
farsthary marked this conversation as resolved
@ -168,0 +174,4 @@
while (!queue.is_empty()) {
threading::parallel_for_each(IndexRange(0, queue.size()), [&](const int val) {
const int edge_index = queue[val];
int v1 = edges[edge_index][0];

int v1 = edges[edge_index][0]; -> int edge_vert_a = edges[edge_index][0];

`int v1 = edges[edge_index][0];` -> `int edge_vert_a = edges[edge_index][0];`
Author
Member

this will add unnecessary verbosity in a context where is clear what v1 and v2 means

this will add unnecessary verbosity in a context where is clear what v1 and v2 means
farsthary marked this conversation as resolved
@ -204,0 +216,4 @@
/* Open addressing with linear probing that minimizes
* collission in hash tables can also be used to avoid
* parallel data storage collisions */
size_t idx = edge_other % new_size;

Please, just use blender::Set instead this a lot of C-like (usually unreadable) math.
Or, at least, move this is separate function. It is hard to deep in the logic of hash table as part of checking geometrical algorithm. Also still size_t -> const int64_t (or just int..).
You can check this as example: #120224

Please, just use `blender::Set` instead this a lot of C-like (usually unreadable) math. Or, at least, move this is separate function. It is hard to deep in the logic of hash table as part of checking geometrical algorithm. Also still `size_t` -> `const int64_t` (or just int..). You can check this as example: https://projects.blender.org/blender/blender/pulls/120224
Author
Member

This will negate the performance gain, Blender set is really slow and this technique works really well, I've spent several days testing and refining it.

This will negate the performance gain, Blender set is really slow and this technique works really well, I've spent several days testing and refining it.
@JacquesLucke /
Author
Member

For this technique I use a dual advancing front which a Set/Hash table does not quite map correctly. Is the core of this PR

For this technique I use a dual advancing front which a Set/Hash table does not quite map correctly. Is the core of this PR
Member

I haven't analysed this case in detail yet. Multi-threaded write-access to a hash table like done here feels not safe at a first glance. But will need to check that again.

I don't know where the speedup here comes from specifically. If it is the linear probing, then that can also be done with Set. The probing strategy is customizable as template parameter. One can just pass in LinearProbingStrategy instead of PythonProbingStrategy. I also recommend to read the descriptions of the probing strategies.

There might be other reasons for why a custom hash table is better here, but those need to be justified a bit more.

I haven't analysed this case in detail yet. Multi-threaded write-access to a hash table like done here feels not safe at a first glance. But will need to check that again. I don't know where the speedup here comes from specifically. If it is the linear probing, then that can also be done with `Set`. The probing strategy is customizable as template parameter. One can just pass in `LinearProbingStrategy` instead of `PythonProbingStrategy`. I also recommend to read the descriptions of the probing strategies. There might be other reasons for why a custom hash table is better here, but those need to be justified a bit more.
Member

Honestly, I was just reading the patch description now. Will have a closer look at how your implementation works soon.

Honestly, I was just reading the patch description now. Will have a closer look at how your implementation works soon.
Author
Member

The main reason on using a preallocated vector/array is to allow independent writes to each bucket in a performant way. Do blender set allows preallocation and independent writes?

The main reason on using a preallocated vector/array is to allow independent writes to each bucket in a performant way. Do blender set allows preallocation and independent writes?
Author
Member

Just tried using VectorSet and Set for this and it didn't worked. It crashes Blender.

Just tried using VectorSet<int> and Set<int> for this and it didn't worked. It crashes Blender.
Member

You cannot write to these data structures from multiple threads.

You cannot write to these data structures from multiple threads.
Author
Member

hence are not a suitable recommendation. I've refactored that part in a separate function so everyone should be happy now ;)

hence are not a suitable recommendation. I've refactored that part in a separate function so everyone should be happy now ;)

This is still not the best if blender have 2 separate hash table thought. Glad to know if this is okay\

This is still not the best if blender have 2 separate hash table thought. Glad to know if this is okay\
Author
Member

If we have in Blender a lock free and wait free performant Hash table I could substitute for it.
I suggest reviewers to actually compile and test the branch to clear any doubt of its correctness.
Then in another PR I can use it to implement a lock free and wait free Vector or Hash

If we have in Blender a lock free and wait free performant Hash table I could substitute for it. I suggest reviewers to actually compile and test the branch to clear any doubt of its correctness. Then in another PR I can use it to implement a lock free and wait free Vector or Hash
Raul Fernandez Hernandez added 1 commit 2024-04-12 17:01:11 +02:00
Raul Fernandez Hernandez added 1 commit 2024-04-12 19:56:55 +02:00
Raul Fernandez Hernandez requested review from Jacques Lucke 2024-04-12 21:05:56 +02:00
Raul Fernandez Hernandez requested review from Iliya Katushenock 2024-04-12 21:06:45 +02:00
Raul Fernandez Hernandez added 1 commit 2024-04-12 22:00:52 +02:00
Raul Fernandez Hernandez added 1 commit 2024-04-12 22:42:58 +02:00
Iliya Katushenock reviewed 2024-04-12 22:50:38 +02:00
@ -80,0 +80,4 @@
/* Open addressing with linear probing that minimizes
* collission in hash tables can also be used to avoid
* parallel data storage collisions */
static void lock_free_emplace(const int edge_other, const int non_checked, Vector<int> &queue_next)

Thanks for separating this in to small function.
If you don't want to push/drop elements, Vector<int> &queue_next -> MutableSpan<int> queue_next

Thanks for separating this in to small function. If you don't want to push/drop elements, `Vector<int> &queue_next` -> `MutableSpan<int> queue_next`
Author
Member

Nice tip thanks!

Nice tip thanks!
farsthary marked this conversation as resolved
@ -147,0 +160,4 @@
}
Vector<int> queue;
Vector<int> queue_next;

If next 10 lines of code is not related with queue_next it's probably better to just move this bottom.

If next 10 lines of code is not related with `queue_next` it's probably better to just move this bottom.
farsthary marked this conversation as resolved
@ -209,0 +245,4 @@
});
queue.clear();
for (int i = 0; i < new_size; ++i) {

const int edge_i : IndexRange(new_size)

`const int edge_i : IndexRange(new_size)`
farsthary marked this conversation as resolved
Raul Fernandez Hernandez added 1 commit 2024-04-13 02:09:43 +02:00
Iliya Katushenock requested changes 2024-04-13 12:16:00 +02:00
Iliya Katushenock left a comment
Member

Can you add table of benchmarks of different parts of code to description of PR?
If simple, I don't like idea of using threads in case where you can use log(N) search. I talk about N^2 search of nearest.
Or at least, i'd like to see this as separate change. With separate benchmarks of different sizes of data.

Next one: lock free. Here is no lock free. Currently this PR contains only UB and data races rather than lock free.
I see how to fix hashmap case, but i am not sure about sculpt_geodesic_mesh_test_dist_add.
Please, check std::atomic<int> for hashmap and proof why sculpt_geodesic_mesh_test_dist_add is acceptable to be in parallel loop.
See as additional resources: https://stackoverflow.com/questions/39393850/is-incrementing-an-int-effectively-atomic-in-specific-cases/39396999#39396999
And: https://en.cppreference.com/w/cpp/atomic/atomic/compare_exchange

Can you add table of benchmarks of different parts of code to description of PR? If simple, I don't like idea of using threads in case where you can use log(N) search. I talk about N^2 search of nearest. Or at least, i'd like to see this as separate change. With separate benchmarks of different sizes of data. Next one: lock free. Here is no lock free. Currently this PR contains only UB and data races rather than lock free. I see how to fix hashmap case, but i am not sure about `sculpt_geodesic_mesh_test_dist_add`. Please, check `std::atomic<int>` for hashmap and proof why `sculpt_geodesic_mesh_test_dist_add` is acceptable to be in parallel loop. See as additional resources: https://stackoverflow.com/questions/39393850/is-incrementing-an-int-effectively-atomic-in-specific-cases/39396999#39396999 And: https://en.cppreference.com/w/cpp/atomic/atomic/compare_exchange
@ -80,0 +92,4 @@
idx = 0;
}
}
queue_next[idx] = edge_other;

This is data race. Any other thread can change queue_next[idx] after queue_next[idx] != non_checked.

This is data race. Any other thread can change `queue_next[idx]` after `queue_next[idx] != non_checked`.
Author
Member

That's an assumption that there's a data race, in the actual code running even with heavy meshes if there's data race it will be seen as unmasked vertices surrounded by masked vertices. I noticed it with previous algorithms that I could for sure trigger a data race.
I couldn't find any error in the masking currently.
Again, I suggest reviewers to apply this patch to Blender and test it.

That's an assumption that there's a data race, in the actual code running even with heavy meshes if there's data race it will be seen as unmasked vertices surrounded by masked vertices. I noticed it with previous algorithms that I could for sure trigger a data race. I couldn't find any error in the masking currently. Again, I suggest reviewers to apply this patch to Blender and test it.

If you add an assert(queue_next[idx] == non_checked) and run in debug mode you'll hit it. The reason it might not show up in the final mask is probably because vertices/edges are reachable from separate "paths", so chance of "missing" scheduled edge is lower because of that, but is not zero.

If you add an `assert(queue_next[idx] == non_checked)` and run in debug mode you'll hit it. The reason it might not show up in the final mask is probably because vertices/edges are reachable from separate "paths", so chance of "missing" scheduled edge is lower because of that, but is not zero.
farsthary marked this conversation as resolved
@ -172,0 +204,4 @@
if (dists[edge_vert_a] > dists[edge_vert_b]) {
std::swap(edge_vert_a, edge_vert_b);
}
sculpt_geodesic_mesh_test_dist_add(vert_positions,

This is data race if you call this function in parallel loop. Do sculpt_geodesic_mesh_test_dist_add perform read/modify operations on dists in random access maner.

This is data race if you call this function in parallel loop. Do `sculpt_geodesic_mesh_test_dist_add` perform read/modify operations on `dists` in random access maner.
farsthary marked this conversation as resolved
Author
Member

Ok lets drop this PR and leave the old slow code then since is what people want.

Ok lets drop this PR and leave the old slow code then since is what people want.

I don't want to say i don't want to improve the speed, but could you fix issues which i point?

I don't want to say i don't want to improve the speed, but could you fix issues which i point?
Author
Member

Can you add table of benchmarks of different parts of code to description of PR?
If simple, I don't like idea of using threads in case where you can use log(N) search. I talk about N^2 search of nearest.
Or at least, i'd like to see this as separate change. With separate benchmarks of different sizes of data.

Next one: lock free. Here is no lock free. Currently this PR contains only UB and data races rather than lock free.
I see how to fix hashmap case, but i am not sure about sculpt_geodesic_mesh_test_dist_add.
Please, check std::atomic<int> for hashmap and proof why sculpt_geodesic_mesh_test_dist_add is acceptable to be in parallel loop.
See as additional resources: https://stackoverflow.com/questions/39393850/is-incrementing-an-int-effectively-atomic-in-specific-cases/39396999#39396999
And: https://en.cppreference.com/w/cpp/atomic/atomic/compare_exchange

Thanks for the links! I did tried atomics and guarded threading approach (is in my commit history) but the overhead make it not worth the change, it will eat the performance away to the point of being almost as slow as before.

> Can you add table of benchmarks of different parts of code to description of PR? > If simple, I don't like idea of using threads in case where you can use log(N) search. I talk about N^2 search of nearest. > Or at least, i'd like to see this as separate change. With separate benchmarks of different sizes of data. > > Next one: lock free. Here is no lock free. Currently this PR contains only UB and data races rather than lock free. > I see how to fix hashmap case, but i am not sure about `sculpt_geodesic_mesh_test_dist_add`. > Please, check `std::atomic<int>` for hashmap and proof why `sculpt_geodesic_mesh_test_dist_add` is acceptable to be in parallel loop. > See as additional resources: https://stackoverflow.com/questions/39393850/is-incrementing-an-int-effectively-atomic-in-specific-cases/39396999#39396999 > And: https://en.cppreference.com/w/cpp/atomic/atomic/compare_exchange Thanks for the links! I did tried atomics and guarded threading approach (is in my commit history) but the overhead make it not worth the change, it will eat the performance away to the point of being almost as slow as before.
Author
Member

This was intended to be a stop-gap change to gain some performance while a deeper refactor that eliminates the need to pre-calculate a falloff field for the whole mesh lands, but now I'm even more concerned if refactors are even possible anymore by normal contributors.

This was intended to be a stop-gap change to gain some performance while a deeper refactor that eliminates the need to pre-calculate a falloff field for the whole mesh lands, but now I'm even more concerned if refactors are even possible anymore by normal contributors.
Author
Member

Thanks everyone for their reviews and opinions.

Thanks everyone for their reviews and opinions.

First of all: Data race is the bug that no one wouldn't love to fix. And more important is how data race is going to be confirmed by bug triggers.
Do you planned to be cause of some bug report which is cannot be confirmed intentionally in the most of the cases. And each month some new user will go in this report to leave their mark and say blender is bugged..
If error probability is low this error still can be. And if there is no rule which will define always correct result, error might occur.

I glad that you can't find the issue. Good way to force me to spend day to actually reproduce the bug (data race! for last one i have is was cow disease, and i was spend 4 hour to simplify bug file! not to triger, but to just simplify it! hm.. okay actually last one was in disjoin... still very corner case it was).

But.. I did that.
Here is the diff of the 2 calls of geodesic_mesh_create:

As we already can see, this is not deterministic algorithm.
And the fact of that i can move the mouse and selection will be different actually point on the fact that there is gradient of the difference.
And as you can see, texture of the difference is looks like attractor. Than this is distant then its larger i think.
You can see the diff in attached file sculpt_topo_diff_of_diff.txt

And now... how this is looks like in case i made:
image
File sculpt_plane_bug.blend

And how to reproduce:
Mesh is generated by protected the geometry nodes group.
And now i can not reproduce the issue, but in the main.

I see 2 fair way to do the things:

  1. You want to improve tool that is not work correct (have bug) and you made fix of bug as part of the improvement of the speed.
  2. You see the bug, and fix this separately.

But i don't see the property of changing the code to improve the speed, and changing the old bug to other one new.
Even more, i don't see how this can be accepted if you don't want to use real atomic types to protect at least hash table and just choose the int. Yeah, exactly there is the cost of correctness.

And last thing i noticed, you still not fix bitvector data race which is was pointed by Hans. So, there is 3 data race in just one PR.

In last half of year i have made several PR's related with Shortest Edge Path node:

Honestly, i am not really deep in the logic of geodesic path computation, so can't really describe the things, but i can see some relations with A*.
I finished by the fact that there is no more memory bandwidth in the speed limitation, next improvements can be only in multithreading or assembler. Second one of the not something real, so first is the most potential.
But this is the wall for me, to have multithreaded A* without a lot of memory usage and stable without overhead.
If geodesic field can be solved by the same way, this probably will be the most fastest implementation.
For the start, you can try to reduce memory bandwidth in the hot loop and move thread-unrelated operations out of loop (for example, prepare some data in the parallel loop before sequential one).

First of all: Data race is the bug that no one wouldn't love to fix. And more important is how data race is going to be confirmed by bug triggers. Do you planned to be cause of some bug report which is cannot be confirmed intentionally in the most of the cases. And each month some new user will go in this report to leave their mark and say `blender is bugged`.. If error probability is low this error still can be. And if there is no rule which will define always correct result, error might occur. I glad that you can't find the issue. Good way to force me to spend day to actually reproduce the bug (data race! for last one i have is was cow disease, and i was spend 4 hour to simplify bug file! not to triger, but to just simplify it! hm.. okay actually last one was in disjoin... still very corner case it was). But.. I did that. Here is the diff of the 2 calls of `geodesic_mesh_create`: <video src="/attachments/d8b6fbf8-a5fe-4388-b423-b131bbc8e56b" title="2024-04-13 23-42-39.mp4" controls></video> As we already can see, this is not deterministic algorithm. And the fact of that i can move the mouse and selection will be different actually point on the fact that there is gradient of the difference. And as you can see, texture of the difference is looks like attractor. Than this is distant then its larger i think. You can see the diff in attached file [sculpt_topo_diff_of_diff.txt](/attachments/0f0ad57c-ccce-446f-b2f0-285c8157781a) And now... how this is looks like in case i made: ![image](/attachments/1ad22289-96bc-4365-88dc-7376adffa721) File [sculpt_plane_bug.blend](/attachments/84e5e7f5-cca7-464a-a8f7-86a29bd90ad5) And how to reproduce: <video src="/attachments/b8b6da06-0b0f-43b5-b9a1-b08ab26b46b3" title="2024-04-14 00-32-47.mp4" controls></video> _Mesh is generated by protected the geometry nodes group._ And now i can not reproduce the issue, but in the main. I see 2 fair way to do the things: 1. You want to improve tool that is not work correct (have bug) and you made fix of bug as part of the improvement of the speed. 2. You see the bug, and fix this separately. But i don't see the property of changing the code to improve the speed, and changing the old bug to other one new. Even more, i don't see how this can be accepted if you don't want to use real atomic types to protect at _least hash table_ and just choose the int. Yeah, exactly there is the cost of correctness. And last thing i noticed, you still not fix bitvector data race which is was pointed by Hans. So, there is 3 data race in just one PR. In last half of year i have made several PR's related with `Shortest Edge Path` node: - https://projects.blender.org/blender/blender/pulls/114707 - https://projects.blender.org/blender/blender/pulls/114683 - https://projects.blender.org/blender/blender/pulls/114684 - https://projects.blender.org/blender/blender/pulls/114733 Honestly, i am not really deep in the logic of geodesic path computation, so can't really describe the things, but i can see some relations with A*. I finished by the fact that there is no more memory bandwidth in the speed limitation, next improvements can be only in multithreading or assembler. Second one of the not something real, so first is the most potential. But this is the wall for me, to have multithreaded A* without a lot of memory usage and stable without overhead. If geodesic field can be solved by the same way, this probably will be the most fastest implementation. For the start, you can try to reduce memory bandwidth in the hot loop and move thread-unrelated operations out of loop (for example, prepare some data in the parallel loop before sequential one).
Author
Member

@mod_moder Let me start by thanking you. This is a really convincing explanation with facts I can use to improve upon and not a battle of egos that PR reviews tends to degenerate on.
I will reopen the PR and will address the open issues.

@mod_moder Let me start by thanking you. This is a really convincing explanation with facts I can use to improve upon and not a battle of egos that PR reviews tends to degenerate on. I will reopen the PR and will address the open issues.
Sergey Sharybin requested changes 2024-04-16 10:55:09 +02:00
Sergey Sharybin left a comment
Owner

I was testing this change on Mac Studio with M2 Ultra (24 cores). The time on the monoHD.blend before this patch is 5.8 seconds to execute geodesic_mesh_create(). With ideal threading code we should be able to lower it down to 240 ms (2.8 / 24).

With the current state of the PR, the timing went down to approximately 800 ms. It is a big change. Although, still feels there is room for improvement.

For the approach itself, I do not think we should be allowing algorithms which rely on race conditions. Ideally, all algorithms we implement in Blender are deterministic, otherwise troubleshooting will became a pure nightmare. There are some exceptions, like GPU rendering which is not strictly deterministic, because samples could be accumulated in different order, but that was a conscious decision.

I've tried a couple of different approaches here, with the general idea of letting threads to operate purely on local data, and then somehow accumulate results of their work. In this case the result of each thread was a Set of new edges it wants to append to the queue.

I've tried to use parallel_reduce which combines Set's. I forgot exact speed, but it was way slower than the proposed PR. Possibly because of a lot of allocations or other memory access of temporary Set's. Maybe there is a different way of getting it to work, I didn't have much time to try harder.

What I found works the best is making it so threads use Set from TLS, and combine TLS "manually" after the parallel loop. This seems to avoid the most of the overhead, and it actually lowered the overall time of geodesic_mesh_create() to about 300 ms. This is way closer to the ideal scaling, but also a bit suspecious, so second pair of eyes would be very much welcome.

I've attached the modified file. It is a bit of a messy state with both PR and my modifications side-by-side.It should be relatively easy to clean it up, and verify correctness.

I was testing this change on Mac Studio with M2 Ultra (24 cores). The time on the `monoHD.blend` before this patch is 5.8 seconds to execute `geodesic_mesh_create()`. With ideal threading code we should be able to lower it down to 240 ms (2.8 / 24). With the current state of the PR, the timing went down to approximately 800 ms. It is a big change. Although, still feels there is room for improvement. For the approach itself, I do not think we should be allowing algorithms which rely on race conditions. Ideally, all algorithms we implement in Blender are deterministic, otherwise troubleshooting will became a pure nightmare. There are some exceptions, like GPU rendering which is not strictly deterministic, because samples could be accumulated in different order, but that was a conscious decision. I've tried a couple of different approaches here, with the general idea of letting threads to operate purely on local data, and then somehow accumulate results of their work. In this case the result of each thread was a `Set` of new edges it wants to append to the queue. I've tried to use `parallel_reduce` which combines `Set`'s. I forgot exact speed, but it was way slower than the proposed PR. Possibly because of a lot of allocations or other memory access of temporary `Set`'s. Maybe there is a different way of getting it to work, I didn't have much time to try harder. What I found works the best is making it so threads use `Set` from TLS, and combine TLS "manually" after the parallel loop. This seems to avoid the most of the overhead, and it actually lowered the overall time of `geodesic_mesh_create()` to about 300 ms. This is way closer to the ideal scaling, but also a bit suspecious, so second pair of eyes would be very much welcome. I've attached the modified file. It is a bit of a messy state with both PR and my modifications side-by-side.It should be relatively easy to clean it up, and verify correctness.
@ -2141,6 +2143,11 @@ static int sculpt_expand_invoke(bContext *C, wmOperator *op, const wmEvent *even
return OPERATOR_CANCELLED;
}
const int totvert = SCULPT_vertex_count_get(ss);

Think this could also be split-off to a separate PR. Or, maybe even landed directly, as it is a very minor and obvious cleanup.

Think this could also be split-off to a separate PR. Or, maybe even landed directly, as it is a very minor and obvious cleanup.
farsthary marked this conversation as resolved
@ -50,3 +50,3 @@
}
BLI_assert(dists[v1] != FLT_MAX);
BLI_assert(dists[v1] != std::numeric_limits<float>::max());

I am not sure how I feel about it. The code becomes much longer, and when we know the type at a compile time I am not sure we are benefitting from numeric_limits. If the type was templated then sure, we'd have to use numeric_limits.

To me FLT_MAX reads much much easier.
But I am also curious what others think of it.

I am not sure how I feel about it. The code becomes much longer, and when we know the type at a compile time I am not sure we are benefitting from `numeric_limits`. If the type was templated then sure, we'd have to use `numeric_limits`. To me `FLT_MAX` reads much much easier. But I am also curious what others think of it.
@ -77,16 +77,33 @@ static bool sculpt_geodesic_mesh_test_dist_add(Span<float3> vert_positions,
return false;
}
/* Open addressing with linear probing that minimizes

We do have line limit of 99. There is no need to wrap lines so early. In a way, I think using all available horizontal space makes comment easier to follow.

collission -> collision

We do have line limit of 99. There is no need to wrap lines so early. In a way, I think using all available horizontal space makes comment easier to follow. `collission` -> `collision`
@ -161,1 +186,3 @@
}
/* queue and queue_next should not differ by a huge amount of size
* since they are an advancing front in the mesh hence no need
* to allocate much more memmory and keep the iteration range smaller */

memmory -> memory

`memmory` -> `memory`
Raul Fernandez Hernandez added 1 commit 2024-04-16 11:06:52 +02:00
Author
Member

Thanks @Sergey I just pushed my changes before seeing your review and code. What I found in my thread safe implementation is that is actually slower than the single threaded one :( adding atomics for the sculpt_geodesic_mesh_test_dist_add() and using tbb::concurrent_vector imply an overhead that erase the good speed that the unsafe code had 😅.
Will check your implementation now.

For comparison on an M1 MBA
ST

Mono
ST Bottleneck: time: 0.225167 seconds
ST Bottleneck: time: 0.203988 seconds
MonoHD
ST Bottleneck: time: 6.95426 seconds <----
Plane
ST Bottleneck: time: 0.252023 seconds

SMT

Mono
SMT Bottleneck: time: 0.271475 seconds
SMT Bottleneck: time: 0.224629 seconds
MonoHD
SMT Bottleneck: time: 10.3282 seconds <------
Plane
SMT Bottleneck: time: 0.281746 seconds

Thanks @Sergey I just pushed my changes before seeing your review and code. What I found in my thread safe implementation is that is actually slower than the single threaded one :( adding atomics for the sculpt_geodesic_mesh_test_dist_add() and using tbb::concurrent_vector imply an overhead that erase the good speed that the unsafe code had 😅. Will check your implementation now. For comparison on an M1 MBA ST Mono ST Bottleneck: time: 0.225167 seconds ST Bottleneck: time: 0.203988 seconds MonoHD ST Bottleneck: time: 6.95426 seconds <---- Plane ST Bottleneck: time: 0.252023 seconds SMT Mono SMT Bottleneck: time: 0.271475 seconds SMT Bottleneck: time: 0.224629 seconds MonoHD SMT Bottleneck: time: 10.3282 seconds <------ Plane SMT Bottleneck: time: 0.281746 seconds
Iliya Katushenock reviewed 2024-04-16 11:21:46 +02:00
@ -52,2 +65,2 @@
BLI_assert(dists[v1] != FLT_MAX);
if (dists[v0] <= dists[v1]) {
BLI_assert(dists[v1].load() != std::numeric_limits<float>::max());
if (dists[v0].load() <= dists[v1].load()) {

Before you start moving forward with this, i'd describe why this doesn't work:
std::atomic::load is not something magic. Only thing that is protected now is that compiler and CPU will not reorder loadings of both floats (for example).
But someone else still can write something in dists[v1] between assert and condition of if statement.
This is why i link atomic/compare_exchange in my second review.
So, you have to check if value is as you expect and try to rewrite this. But if result is false, you have to try again.

And this even not the fix of issue. Some other thread might find local minimum and write this in dists[v0]. And current thread, as more small local minimum, should change this value.
And now you have to rewrite them. This is additional overhead that i talk about in last review.

Before you start moving forward with this, i'd describe why this doesn't work: `std::atomic::load` is not something magic. Only thing that is protected now is that compiler and CPU will not reorder loadings of both floats (for example). But someone else still can write something in `dists[v1]` between assert and condition of `if` statement. This is why i link `atomic/compare_exchange` in my second review. So, you **have** to check if value is as you expect and try to rewrite this. But if result is false, you **have** to try again. And this even not the fix of issue. Some other thread might find local minimum and write this in `dists[v0]`. And current thread, as more small local minimum, should change this value. And now you have to rewrite them. This is additional overhead that i talk about in last review.
Author
Member

@mod_moder You're right, if even now is getting slower than the single threaded version, adding more thread safety will increase the overhead. When I closed the PR on Saturday was mainly because I realized the performance gain was at the expense of thread safety and I was lucky so far to not encounter a test that showed the error. (even though it looked like a tantrum 😅)

I reopened it later because maybe it could be saved with other data containers I may not be aware that Blender already have. But I'm getting more convinced that only an algorithmic refactor of the expand operator is what will bring true and safe performance gain (and scalability) and this route of just adding multithreading on top of the existing algorithm is a dead end.

The geodesic calculation is just too dependent on previous state and not heavy enough to offset the overhead from thread safety.
If all agree I can close it as it is and attempt a better refactor or try few more rounds with @Sergey 's patch

@mod_moder You're right, if even now is getting slower than the single threaded version, adding more thread safety will increase the overhead. When I closed the PR on Saturday was mainly because I realized the performance gain was at the expense of thread safety and I was lucky so far to not encounter a test that showed the error. (even though it looked like a tantrum 😅) I reopened it later because maybe it could be saved with other data containers I may not be aware that Blender already have. But I'm getting more convinced that only an algorithmic refactor of the expand operator is what will bring true and safe performance gain (and scalability) and this route of just adding multithreading on top of the existing algorithm is a dead end. The geodesic calculation is just too dependent on previous state and not heavy enough to offset the overhead from thread safety. If all agree I can close it as it is and attempt a better refactor or try few more rounds with @Sergey 's patch
Member

@Sergey your code has a bug unfortunately: Instead of is_edge_scheduled_map.clear() you need something like is_edge_scheduled_map.fill(false).

I did a little test myself: https://projects.blender.org/blender/blender/compare/main...JacquesLucke:geodesic-speedup-test
It speeds up geodesic_mesh_create from 10.3 ms to 1.5 ms for me (12 cores, 24 threads).

@Sergey your code has a bug unfortunately: Instead of `is_edge_scheduled_map.clear()` you need something like `is_edge_scheduled_map.fill(false)`. I did a little test myself: https://projects.blender.org/blender/blender/compare/main...JacquesLucke:geodesic-speedup-test It speeds up `geodesic_mesh_create` from `10.3 ms` to `1.5 ms` for me (12 cores, 24 threads).
Author
Member

Yup, will give @Sergey 's code a shot, if is fully thread safe then it will be very worth it to save this PR ;)

Yup, will give @Sergey 's code a shot, if is fully thread safe then it will be very worth it to save this PR ;)
Member

This is data race if you call this function in parallel loop. Do sculpt_geodesic_mesh_test_dist_add perform read/modify operations on dists in random access manner.

Hmm, true. Just noticed that too. I don't really have a good solution here. In my previous comment I was also only focused on speeding up the queue. sculpt_geodesic_mesh_test_dist_add seems to be a bigger issue. Would need to look into the entire algorithm more to better understand if and how this can be solved.

> This is data race if you call this function in parallel loop. Do `sculpt_geodesic_mesh_test_dist_add` perform read/modify operations on dists in random access manner. Hmm, true. Just noticed that too. I don't really have a good solution here. In my previous comment I was also only focused on speeding up the queue. `sculpt_geodesic_mesh_test_dist_add` seems to be a bigger issue. Would need to look into the entire algorithm more to better understand if and how this can be solved.

@JacquesLucke Ah, true, the clear() was doing something else. So from the not-so-much-algorithmical-changes my change indeed "failed".

However, why would we want to schedule edge again? Intuitively, it leads to extra calculations of things which are already calculated, with some possibility of early output due to things like dists[ev_other] != FLT_MAX. So if we do not clear the bitmap, we avoid those extra calculations, lowering the overall time?

And yeah, the sculpt_geodesic_mesh_test_dist_add i did not notice before you mentioned it. Maybe there is some purely technical exercise of adding some atomics could work there, or maybe we'd need to find a different formulation of the entire algorithm. Not suer yet.

@JacquesLucke Ah, true, the `clear()` was doing something else. So from the not-so-much-algorithmical-changes my change indeed "failed". However, why would we want to schedule edge again? Intuitively, it leads to extra calculations of things which are already calculated, with some possibility of early output due to things like `dists[ev_other] != FLT_MAX`. So if we do not clear the bitmap, we avoid those extra calculations, lowering the overall time? And yeah, the `sculpt_geodesic_mesh_test_dist_add ` i did not notice before you mentioned it. Maybe there is some purely technical exercise of adding some atomics could work there, or maybe we'd need to find a different formulation of the entire algorithm. Not suer yet.
Member

However, why would we want to schedule edge again?

I don't know for sure. I assume that's part of the algorithm. The shortest path based on the topology may be longer than the shortest geodesic path. In this case it feels like the same edge may need to be processed more than once.

I think your previous solution actually implemented the behavior where the bits were not cleared. When I tried it, it worked well in some cases, but just didn't work on others.

> However, why would we want to schedule edge again? I don't know for sure. I assume that's part of the algorithm. The shortest path based on the topology may be longer than the shortest geodesic path. In this case it feels like the same edge may need to be processed more than once. I think your previous solution actually implemented the behavior where the bits were not cleared. When I tried it, it worked well in some cases, but just didn't work on others.
Raul Fernandez Hernandez added 2 commits 2024-04-21 20:05:24 +02:00
Raul Fernandez Hernandez added 2 commits 2024-04-22 05:07:50 +02:00
Raul Fernandez Hernandez added 1 commit 2024-04-22 05:31:53 +02:00
Author
Member

Ok based on @Sergey modified code I've added safeguard for updating the dists array and still comes
ahead in performance to the serial implementation.
@mod_moder please take a look and if we all agree I can start pushing for final reviews.

Ok based on @Sergey modified code I've added safeguard for updating the **dists** array and still comes ahead in performance to the serial implementation. @mod_moder please take a look and if we all agree I can start pushing for final reviews.
Raul Fernandez Hernandez added 1 commit 2024-04-22 13:11:27 +02:00
Jacques Lucke reviewed 2024-04-22 13:24:01 +02:00
@ -74,0 +80,4 @@
* the actual order of update is irrelevant here */
std::atomic<float> &atomic_dist_v0 = reinterpret_cast<std::atomic<float> &>(dists[v0]);
if (atomic_dist_v0.load() > dist0) {
atomic_dist_v0.store(dist0);
Member

Note that there is still a race-condition here. Another thread may change atomic_dist_v0 between the load and store.
Also, there is undefined behavior in every access of the dists array (one thread may read a value from it that another thread is writing at the same time). You might be able to get a way with std::memory_order_relaxed in a few places though, haven't checked in detail yet.

Note that there is still a race-condition here. Another thread may change `atomic_dist_v0` between the load and store. Also, there is undefined behavior in every access of the `dists` array (one thread may read a value from it that another thread is writing at the same time). You might be able to get a way with `std::memory_order_relaxed` in a few places though, haven't checked in detail yet.

As i said in #120125 (comment), this is UB just because there is cast of float to atomic float.

As i said in https://projects.blender.org/blender/blender/pulls/120125#issuecomment-1172446, this is UB just because there is cast of float to atomic float.
Member

It is, but that's not really the core of the problem here currently imo.

It is, but that's not really the core of the problem here currently imo.

Well, that true. I start thinking that this is not the best direction to go. Might it be better to research for some scientific paper with multithreaded geodesic distant field at begin and stop try to just puting another one synchronization primitive in new place.

Well, that true. I start thinking that this is not the best direction to go. Might it be better to research for some scientific paper with multithreaded geodesic distant field at begin and stop try to just puting another one synchronization primitive in new place.
Author
Member

Closing this PR since the attempts are a dead end as I previously mentioned. No point spending more time in research since the approach of calculating a whole mesh field does not scale well anyways even with multithreading.
The expand operator needs a deeper refactor than that.
Thanks everyone for their feedback.

Closing this PR since the attempts are a dead end as I previously mentioned. No point spending more time in research since the approach of calculating a whole mesh field does not scale well anyways even with multithreading. The expand operator needs a deeper refactor than that. Thanks everyone for their feedback.
Author
Member

If needed we can start from scratch in a fresh PR if a worth research is found.

If needed we can start from scratch in a fresh PR if a worth research is found.

Pull request closed

Sign in to join this conversation.
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
5 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#120125
No description provided.