Cycles does not generate the exact same images when a scene is rendered twice #101726

New Issue

Sebastian Herholz · 2022-10-10T15:55:03+02:00

Sebastian Herholz commented

2022-10-10 15:55:03 +02:00

System Information
Operating system: Ubuntu 20.04
Graphics card: Nvidia 2070 Super

Blender Version
Broken: master
Worked:

Short description of error

The results generated by Cycles are not 100% deterministic.
As a consequence path guiding ( #92571 ) can not be implemented deterministically.

Exact steps for others to reproduce the error

Start Blender and open a scene like monster
Render the scene with 64spp and store the result as exr image (e.g., monster-run-0.exr).
Repeat 1. and 2. and save the result as exr again (e.g., monster-run-1.exr).
Use an image comparison tool such as tev (https://github.com/Tom94/tev) and compute the difference.

You will see that, even if both renderings were performed on the same machine, the resulting images have minor differences.
Note: It might be necessary to scale the diff-images to see the errors.

Run 0:

Run 1:

Diff:

Based on the default startup or an attached .blend file (as simple as possible).

**System Information** Operating system: Ubuntu 20.04 Graphics card: Nvidia 2070 Super **Blender Version** Broken: master Worked: **Short description of error** The results generated by Cycles are not 100% deterministic. As a consequence path guiding ( #92571 ) can not be implemented deterministically. **Exact steps for others to reproduce the error** 1. Start Blender and open a scene like `monster` 2. Render the scene with 64spp and store the result as exr image (e.g., monster-run-0.exr). 3. Repeat 1. and 2. and save the result as exr again (e.g., monster-run-1.exr). 4. Use an image comparison tool such as tev (https://github.com/Tom94/tev) and compute the difference. You will see that, even if both renderings were performed on the same machine, the resulting images have minor differences. Note: It might be necessary to scale the diff-images to see the errors. Run 0: ![Screenshot from 2022-10-10 15-50-37.png](https://archive.blender.org/developer/F13649368/Screenshot_from_2022-10-10_15-50-37.png) Run 1: ![Screenshot from 2022-10-10 15-50-41.png](https://archive.blender.org/developer/F13649371/Screenshot_from_2022-10-10_15-50-41.png) Diff: ![Screenshot from 2022-10-10 15-51-15.png](https://archive.blender.org/developer/F13649373/Screenshot_from_2022-10-10_15-51-15.png) Based on the default startup or an attached .blend file (as simple as possible).

Sebastian Herholz commented

2022-10-10 15:55:03 +02:00

Added subscriber: @sherholz

Brecht Van Lommel was assigned by Sebastian Herholz

2022-10-10 15:55:47 +02:00

Omar Emara commented

2022-10-10 16:28:44 +02:00

Added subscriber: @OmarEmaraDev

Omar Emara commented

2022-10-10 16:28:44 +02:00

Changed status from 'Needs Triage' to: 'Needs Developer To Reproduce'

Omar Emara commented

2022-10-10 16:28:44 +02:00

I can also reproduce on the BMW scene on CPU. Not sure if the module considers this a bug though. So tagging the module for more information.

Sebastian Herholz commented

2022-10-13 12:10:20 +02:00

The problem here is that when Cycles is not 100% deterministic, it will generate different training samples for path guiding at every run.
As a result, the guiding structure will always be different, as well as the sampling behavior (starting at the 2nd spp), and therefore
the results of two renderings of the same scene will have a completely different noise pattern.

In production, and according to @brecht, this is not acceptable.

The problem here is that when Cycles is not 100% deterministic, it will generate different training samples for path guiding at every run. As a result, the guiding structure will always be different, as well as the sampling behavior (starting at the 2nd spp), and therefore the results of two renderings of the same scene will have a completely different noise pattern. In production, and according to @brecht, this is not acceptable.

Sebastian Herholz commented

2022-10-13 13:24:21 +02:00

It did a little bit more debugging.

By adding some code to print out each path vertex (e.g., position, normal, random number, outgoing direction after BSDF sampling) P3250
I was able to compare two runs (1spp, single-threaded, at a small resolution, and with path guiding disabled) of a modified version of the monster scene.

https://1drv.ms/u/s!At4sZlTrZ-QKigYGyeU2_Jc5sbSF?e=B0h6tF

monster_small-0.log
monster_small-1.log

A diff of the output shows that 99.9% of the path segments are the same, and only a tiny fraction differs.

It seems that in most cases, the divergence starts with a tiny difference in the normal, which leads to a slightly different outgoing direction and so on.

It did a little bit more debugging. By adding some code to print out each path vertex (e.g., position, normal, random number, outgoing direction after BSDF sampling) [P3250](https://archive.blender.org/developer/P3250.txt) I was able to compare two runs (1spp, single-threaded, at a small resolution, and with path guiding disabled) of a modified version of the monster scene. https://1drv.ms/u/s!At4sZlTrZ-QKigYGyeU2_Jc5sbSF?e=B0h6tF [monster_small-0.log](https://archive.blender.org/developer/F13672829/monster_small-0.log) [monster_small-1.log](https://archive.blender.org/developer/F13672830/monster_small-1.log) A diff of the output shows that 99.9% of the path segments are the same, and only a tiny fraction differs. ![Screenshot from 2022-10-13 13-10-52.png](https://archive.blender.org/developer/F13672815/Screenshot_from_2022-10-13_13-10-52.png) It seems that in most cases, the divergence starts with a tiny difference in the normal, which leads to a slightly different outgoing direction and so on.

Sebastian Herholz commented

2022-10-13 13:47:55 +02:00

I tested now multiple versions of Blender (3.0.1 and 3.1.2), and it seems that
this happens in all versions but is away less prominent in 3.0.1:
3.0.1:

3.1.2:

I tested now multiple versions of Blender (3.0.1 and 3.1.2), and it seems that this happens in all versions but is away less prominent in 3.0.1: 3.0.1: ![Screenshot from 2022-10-13 13-45-35.png](https://archive.blender.org/developer/F13672927/Screenshot_from_2022-10-13_13-45-35.png) 3.1.2: ![Screenshot from 2022-10-13 13-45-43.png](https://archive.blender.org/developer/F13672929/Screenshot_from_2022-10-13_13-45-43.png)

Sebastian Herholz commented

2022-10-13 15:30:17 +02:00

I believe I identified, not all, but 3 problematic regions:

normal the intersection
random numbers
BSDF sampling

In all of these parts, it can happen that the output values are slightly different.
To test that, I did a dirty hack and quantized the outputs to 4 floating-point digits.
P3251

The behavior is not perfect but now similar to 3.0.1.:

@brecht I hope that helps.

I believe I identified, not all, but 3 problematic regions: - normal the intersection - random numbers - BSDF sampling In all of these parts, it can happen that the output values are slightly different. To test that, I did a dirty hack and quantized the outputs to 4 floating-point digits. [P3251](https://archive.blender.org/developer/P3251.txt) The behavior is not perfect but now similar to 3.0.1.: ![Screenshot from 2022-10-13 15-27-37.png](https://archive.blender.org/developer/F13673232/Screenshot_from_2022-10-13_15-27-37.png) @brecht I hope that helps.

Brecht Van Lommel commented

2022-10-13 20:32:38 +02:00

I'm seeing an exact match in the monster when running ./blender -t 1. I suspect multi-threading in normal or tangent calculation, doing atomic float adds in undefined order. I suspect different in random numbers and BSDF sampling may be indirect consequences of different normals earlier in the path. Though there may be other unexplained factors.

I think these kinds of differences are fairly acceptable by themselves since it's quite localized, though not ideal. For OpenPGL, does this lead to completely different noise patterns over the entire image, or is it more localized?

I've wanted to store normals and tangents in some compressed/quantized way to save memory, which may indirectly help with this, but it would be an unreliable workaround at best. In general multi-threading in geometry nodes may not generate bit for bit matching results for positions or any attributes unless it was carefully implemented to avoid this. So I'm not sure if there is a practical and complete solution to this.

I'm seeing an exact match in the monster when running `./blender -t 1`. I suspect multi-threading in normal or tangent calculation, doing atomic float adds in undefined order. I suspect different in random numbers and BSDF sampling may be indirect consequences of different normals earlier in the path. Though there may be other unexplained factors. I think these kinds of differences are fairly acceptable by themselves since it's quite localized, though not ideal. For OpenPGL, does this lead to completely different noise patterns over the entire image, or is it more localized? I've wanted to store normals and tangents in some compressed/quantized way to save memory, which may indirectly help with this, but it would be an unreliable workaround at best. In general multi-threading in geometry nodes may not generate bit for bit matching results for positions or any attributes unless it was carefully implemented to avoid this. So I'm not sure if there is a practical and complete solution to this.

Raimund Klink commented

2022-10-14 11:17:02 +02:00

Added subscriber: @Raimund58

Sebastian Herholz commented

2022-10-14 12:00:43 +02:00

I can verify that starting blender with blender -t 1 instead of just setting the rendering to single threaded via Performance->Thread->Thread Mode = fixed and Performance->Thread->Threads = 1
generates the same result.
This strengthens @brecht's theory that it is related to some multi-threaded pre-processing step (e.g., normal or tangent calculations).

@brecht the effect on path guiding can be big. While in the first rendering iteration, only a small set of samples will differ, they still lead to a different guiding structure.
In the second iteration, this slightly different structure leads to more variations of the samples for the second training iteration, and so on, and so on ...
The difference will get worse/larger with every training iteration.

Here is an example with 32spp (Note: this time, I didn't even need to scale the error):

At the moment, the determinism of path guiding is pretty unreliable.
Depending on the scene, it might work, or it doesn't.

One interesting fact is that this behavior was way less prominent in 3.0.1. Was there a change in the way the normal and tangents are pre-processed?

I can verify that starting blender with `blender -t 1` instead of just setting the rendering to single threaded via `Performance->Thread->Thread Mode = fixed` and `Performance->Thread->Threads = 1` generates the same result. This strengthens @brecht's theory that it is related to some multi-threaded pre-processing step (e.g., normal or tangent calculations). @brecht the effect on path guiding can be big. While in the first rendering iteration, only a small set of samples will differ, they still lead to a different guiding structure. In the second iteration, this slightly different structure leads to more variations of the samples for the second training iteration, and so on, and so on ... The difference will get worse/larger with every training iteration. Here is an example with 32spp (Note: this time, I didn't even need to scale the error): ![Screenshot from 2022-10-14 11-55-51.png](https://archive.blender.org/developer/F13677089/Screenshot_from_2022-10-14_11-55-51.png) At the moment, the determinism of path guiding is pretty unreliable. Depending on the scene, it might work, or it doesn't. One interesting fact is that this behavior was way less prominent in 3.0.1. Was there a change in the way the normal and tangents are pre-processed?

Sebastian Herholz commented

2022-11-03 14:33:45 +01:00

Added subscriber: @LukasStockner

Sebastian Herholz commented

2022-11-03 14:34:45 +01:00

@brecht I had a chat with @LukasStockner at BCON he might have some ideas where this comes from.

Lukas Stockner commented

2022-11-13 17:33:11 +01:00

Looks like the two sources of non-determinism are BKE_mesh_calc_normals_poly_and_vertex and Mikktspace::generateTSpaces. If you disable parallelism in both of those, the data buffers being copied to the device end up being identical between renders, and so do the rendered outputs.

And yes, @brecht got it right, both of those functions are doing atomic floating-point accumulations.

In theory it would probably work to do the accumulation either in fixed-point precision (which would honestly be fine for normals/tangents since they're bound to the -1..1 range anyways, and would even let us avoid the atomic CAS tricks that are needed for floats) or in double floating-point precision. Not sure how practical either of those are.

Looks like the two sources of non-determinism are `BKE_mesh_calc_normals_poly_and_vertex` and `Mikktspace::generateTSpaces`. If you disable parallelism in both of those, the data buffers being copied to the device end up being identical between renders, and so do the rendered outputs. And yes, @brecht got it right, both of those functions are doing atomic floating-point accumulations. In theory it would probably work to do the accumulation either in fixed-point precision (which would honestly be fine for normals/tangents since they're bound to the -1..1 range anyways, and would even let us avoid the atomic CAS tricks that are needed for floats) or in double floating-point precision. Not sure how practical either of those are.

Brecht Van Lommel commented

2022-11-14 14:12:52 +01:00

Fixed precision would be good to try, though have not worked out if there would be problems with high vertex valence or angle weighting with small and large angles.

Brecht Van Lommel removed their assignment 2023-02-08 03:35:34 +01:00

Philipp Oeser removed the

Download

What's New

Blender Studio

Manual

Developers Blog

Documentation

Benchmark

Blender Conference

Development Fund

One-time Donations

Cycles does not generate the exact same images when a scene is rendered twice #101726