During geometry nodes evaluation some sockets can be determined
to be unused, for example based on the condition input in a switch node.
Once a socket is determined to be unused, that information has to be
propagated backwards through the tree to free any memory that may
have been reserved for those sockets already. This is happening before
this commit already, but in a less ideal way.
Determining that sockets are unused early is good because it helps with
memory reuse and avoids copy-on-write copies caused by shared data.
Now, nodes that are scheduled because an output became unused have
priority over nodes scheduled for other reasons.
This allows auto-vectorization to happen when the a multi-function is
evaluated in "materialized" mode, i.e. it is processed in chunks where
all input and outputs values are stored in contiguous arrays.
It also unifies the handling input, mutable and output parameters a bit.
Now they all can use tempory buffers in the same way.
This simplifies the code enough so that msvc is able to unroll and
vectorize some multi-functions like simple addition.
The performance improvements are almost as good as the GCC
improvements shown in D16942 (for add and multiply at least).
This mainly helps GCC catch up with Clang in terms of field evaluation
performance in some cases. In some cases this patch can speedup
field evaluation 2-3x (e.g. when there are many float math nodes).
See D16942 for a more detailed benchmark.
This moves all multi-function related code in the `functions` module
into a new `multi_function` namespace. This is similar to how there
is a `lazy_function` namespace.
The main benefit of this is that many types names that were prefixed
with `MF` (for "multi function") can be simplified.
There is also a common shorthand for the `multi_function` namespace: `mf`.
This is also similar to lazy-functions where the shortened namespace
is called `lf`.
* `depends_on_context` was not used for a long time already.
* `param_data_indices` is not used since rB42b88c008861b6.
* The remaining data is moved to a single `Vector` to avoid
having to do two allocations when the size signature becomes
larger than fits into the inline buffer.
This avoids a move of the signature after building it. Tthe value had
to be moved out of `MFSignatureBuilder` in the `build` method.
This also makes the naming a bit less confusing where sometimes
both the `MFSignature` and `MFSignatureBuilder` were referred
to as "signature".
* New `build_mf` namespace for the multi-function builders.
* The type name of the created multi-functions is now "private",
i.e. the caller has to use `auto`. This has the benefit that the
implementation can change more freely without affecting
the caller.
* `CustomMF` does not use `std::function` internally anymore.
This reduces some overhead during code generation and at
run-time.
* `CustomMF` now supports single-mutable parameters.
This refactors how devirtualization is done in general and how
multi-functions use it.
* The old `Devirtualizer` class has been removed in favor of a simpler
solution. It is also more general in the sense that it is not coupled
with `IndexMask` and `VArray`. Instead there is a function that has
inputs which control how different types are devirtualized. The
new implementation is currently less general with regard to the number
of parameters it supports. This can be changed in the future, but
does not seem necessary now and would make the code less obvious.
* Devirtualizers for different types are now defined in their respective
headers.
* The multi-function builder works with the `GVArray` stored in `MFParams`
directly now, instead of first converting it to a `VArray<T>`. This reduces
some constant overhead, which makes the multi-function slightly
faster. This is only noticable when very few elements are processed though.
No functional changes or performance regressions are expected.
The use of `std::variant` allows combining the four vectors
into one which more closely matches the intend and avoids
a workaround used before.
Note that this uses `std::get_if` instead of `std::get` because
`std::get` is only available since macOS 10.14.
Previously, the lifetimes of anonymous attributes were determined by
reference counts which were non-deterministic when multiple threads
are used. Now the lifetimes of anonymous attributes are handled
more explicitly and deterministically. This is a prerequisite for any kind
of caching, because caching the output of nodes that do things
non-deterministically and have "invisible inputs" (reference counts)
doesn't really work.
For more details for how deterministic lifetimes are achieved, see D16858.
No functional changes are expected. Small performance changes are expected
as well (within few percent, anything larger regressions should be reported as
bugs).
Differential Revision: https://developer.blender.org/D16858
Previously, this happened when the "node task" first runs, which might
not actually execute the node if there are missing inputs. Deferring the
allocation of storage and default inputs allows for better memory reuse
later (currently the memory is not reused).
This is part of D16858. Iterating over all field inputs allows us to extract
all anonymous attributes used by a field relatively easily which is necessary
for D16858.
This could potentially be used for better field tooltips for nested fields,
but that needs further investigation.
This was an oversight in rBdba2d828462ae22de5.
The evaluator uses multiple threads to initialize node states
but it is still in single threaded mode.
`get_main_or_local_allocator` did not return the right allocator
in this case.
The geometry nodes evaluator supports "lazy threading", i.e. it starts out
single-threaded. But when it determines that multi-threading can be
benefitial, it switches to multi-threaded mode.
Now it only creates an enumerable-thread-specific if it is actually using
multiple threads. This results in a 6% speedup in my test file with many
node groups and math nodes.
Lazy-function graphs are now evaluated properly even if they contain
cycles. Note that cycles are only ok if there is no data dependency cycle.
For example, a node might output something that is fed back into itself.
As long as the output can be computed without the input that it feeds into,
everything is ok.
The code that builds the graph is responsible for making sure that there
are no actual data dependencies.
a5e7657cee didn't account for slices of zero sizes, and the asserts
were slightly incorrect otherwise. Also, the change didn't apply to
`Span`, only `MutableSpan`, which was a mistake. This also adds "safe"
methods to `IndexMask`, and switches function calls where necessary.
Nodes that were not connected to any output could still impact performance.
While they were never executed, sometimes their inputs could keep references
to geometries that other nodes want to modify. That caused unnecessary geometry
copies, because a geometry can only be modified if it is not shared.
Now, inputs that will never be used are tagged accordingly and they will never
have references to geometries that others might want to modify.
* Support bidirectional type lookups. E.g. finding the base type of a
field was supported, but not the other way around. This also removes
the todo in `get_vector_type`. To achieve this, types have to be
registered up-front.
* Separate `CPPType` from other "type traits". For example, previously
`ValueOrFieldCPPType` adds additional behavior on top of `CPPType`.
Previously, it was a subclass, now it just contains a reference to the
`CPPType` it corresponds to. This follows the composition-over-inheritance
idea. This makes it easier to have self-contained "type traits" without
having to put everything into `CPPType`.
Differential Revision: https://developer.blender.org/D16479
This is the conventional way of dealing with unused arguments in C++,
since it works on all compilers.
Regex find and replace: `UNUSED\((\w+)\)` -> `/*$1*/`
In large node setup the threading overhead was sometimes very significant.
That's especially true when most nodes do very little work.
This commit improves the scheduling by not using multi-threading in many
cases unless it's likely that it will be worth it. For more details see the comments
in `BLI_lazy_threading.hh`.
Differential Revision: https://developer.blender.org/D15976
This refactors the geometry nodes evaluation system. No changes for the
user are expected. At a high level the goals are:
* Support using geometry nodes outside of the geometry nodes modifier.
* Support using the evaluator infrastructure for other purposes like field evaluation.
* Support more nodes, especially when many of them are disabled behind switch nodes.
* Support doing preprocessing on node groups.
For more details see T98492.
There are fairly detailed comments in the code, but here is a high level overview
for how it works now:
* There is a new "lazy-function" system. It is similar in spirit to the multi-function
system but with different goals. Instead of optimizing throughput for highly
parallelizable work, this system is designed to compute only the data that is actually
necessary. What data is necessary can be determined dynamically during evaluation.
Many lazy-functions can be composed in a graph to form a new lazy-function, which can
again be used in a graph etc.
* Each geometry node group is converted into a lazy-function graph prior to evaluation.
To evaluate geometry nodes, one then just has to evaluate that graph. Node groups are
no longer inlined into their parents.
Next steps for the evaluation system is to reduce the use of threads in some situations
to avoid overhead. Many small node groups don't benefit from multi-threading at all.
This is much easier to do now because not everything has to be inlined in one huge
node tree anymore.
Differential Revision: https://developer.blender.org/D15914
Use the newer more generic sampling and interpolation functions
developed recently (ab444a80a2) instead of the `CurveEval` type.
Functions are split up a bit more internally, to allow a separate mode
for supplying the curve index directly in the future (T92474).
In one basic test, the performance seems mostly unchanged from 3.1.
Differential Revision: https://developer.blender.org/D14621
This potentially overallocates buffers so that they are usable
for more data types, which allows buffers to be reused more
easily. That leads to fewer separate allocations and improved
cache usage (in one of my test files the number of separate
allocations went down from 1826 to 1555).
This makes calculation of selected indices slightly faster when the
input is a virtual array (the direct output of various nodes like
Face Area, etc). The utility can be helpful for other areas that
need to find selected indices besides field evaluation.
With the face area node used as a selection with 4 million faces,
the speedup is 3.51 ms to 3.39 ms, just a slight speedup.
Differential Revision: https://developer.blender.org/D15127