[regression] OpenCL performance becomes very random with big scenes. #53249

Closed
opened 2017-11-04 08:33:00 +01:00 by mathieu menuet · 27 comments

System Information
Win7 x64, Vega64, driver 17.10.2

Blender Version
Broken: e8daf2e
Worked: 2.79

Short description of error
When rendering the same frame of the same scene multiple times with OpenCL with the same instance of Blender, render times vary from 100% to >300%. Restarting Blender and rerendering gives the same time as the first time

Exact steps for others to reproduce the error

I did it on a small border render part at low resolution to have about 20 tiles of 64x64 with fur and grass. The first render after (re)starting was always 29seconds long, the next ones were random, between 45seconds and 102seconds.
The same test with 2.79 gave 145.6seconds +/-1% on 3 consecutive renders without restart. So the bug appeared after 2.79. Something was changed in allocation or global size calculation maybe?

**System Information** Win7 x64, Vega64, driver 17.10.2 **Blender Version** Broken: e8daf2e Worked: 2.79 **Short description of error** When rendering the same frame of the same scene multiple times with OpenCL with the same instance of Blender, render times vary from 100% to >300%. Restarting Blender and rerendering gives the same time as the first time **Exact steps for others to reproduce the error** - Start latest master of blender with --debug-cycles - Open the victor scene from official benchmark pack https://download.blender.org/demo/test/cycles_benchmark_20160228.zip , render with GPU (tile size 64x64 for example) many times and note the times without synchronization. I did it on a small border render part at low resolution to have about 20 tiles of 64x64 with fur and grass. The first render after (re)starting was always 29seconds long, the next ones were random, between 45seconds and 102seconds. The same test with 2.79 gave 145.6seconds +/-1% on 3 consecutive renders without restart. So the bug appeared after 2.79. Something was changed in allocation or global size calculation maybe?
Author

Changed status to: 'Open'

Changed status to: 'Open'
Author

Added subscriber: @bliblubli

Added subscriber: @bliblubli
Author

Note that in all above cases, system memory is used as the scene doesn't fit in the dedicated 8Gb memory. So it doesn't seem to come from how the drivers allocates the memory between dedicated and system memory.
Also, power usage of GPU was reduced to ensure no throttling happens, frequency were stable during all the tests.

Note that in all above cases, system memory is used as the scene doesn't fit in the dedicated 8Gb memory. So it doesn't seem to come from how the drivers allocates the memory between dedicated and system memory. Also, power usage of GPU was reduced to ensure no throttling happens, frequency were stable during all the tests.
Mai Lavelle was assigned by mathieu menuet 2017-11-04 08:37:58 +01:00
Author

Added subscribers: @Sergey, @brecht

Added subscribers: @Sergey, @brecht
mathieu menuet changed title from OpenCL performance becomes very random with big scenes. to [regression] OpenCL performance becomes very random with big scenes. 2017-11-04 09:49:13 +01:00

I thought Victor only rendered on AMD after ec8ae4d, not yet in the 2.79 release?

In any case, we never added explicit support for using system memory, and leave it totally up to the driver to decide which memory to move where. If it decides to e.g. put image textures in VRAM and keep the BVH or tile render buffers in system memory, that could cause big performance differences.

I guess the first step would be to git bisect to where the problem started. I don't have an AMD card with HBCC support though. Testing with the OpenCL context cache disabled could gives some clues if it's something in the context that leaks or has an unintended lasting effect, or if it's something else.

I thought Victor only rendered on AMD after ec8ae4d, not yet in the 2.79 release? In any case, we never added explicit support for using system memory, and leave it totally up to the driver to decide which memory to move where. If it decides to e.g. put image textures in VRAM and keep the BVH or tile render buffers in system memory, that could cause big performance differences. I guess the first step would be to git bisect to where the problem started. I don't have an AMD card with HBCC support though. Testing with the OpenCL context cache disabled could gives some clues if it's something in the context that leaks or has an unintended lasting effect, or if it's something else.
Author

ec8ae4d5e9 only added support for more than 4GB of textures iirc.
You don't need HBCC support. On win7, even on Vega, there is no HBCC and my RX480 also renders full victor scene since a year on windows and since some months on Linux.
@brecht Is there a simple command to disable the context caching?
I could try to bisect, but @MaiLavelle should have better guesses of what could have introduced this bug. The scene preparation of Victor takes more than 2 minutes on my computer. With compile time on windows on top, shooting in the dark to bisect would take a lot of time.

ec8ae4d5e9 only added support for more than 4GB of textures iirc. You don't need HBCC support. On win7, even on Vega, there is no HBCC and my RX480 also renders full victor scene since a year on windows and since some months on Linux. @brecht Is there a simple command to disable the context caching? I could try to bisect, but @MaiLavelle should have better guesses of what could have introduced this bug. The scene preparation of Victor takes more than 2 minutes on my computer. With compile time on windows on top, shooting in the dark to bisect would take a lot of time.
Author

just to give an idea of the mess to bisect:

  • cuda disables completly opencl in the majority of revision, so you have to rebuild without cuda
  • device selection changed, so userpref have to be modified depending on the revision you test and bisecting requires to go back and forth in time.
  • kernel compilation takes 1min50 for victor
  • scene preparation takes 2min04
    so it takes about 5minutes of VS compile, then manual tweaks for user pref, then 2minute kernel compile+ then 2 renders at 2 (scene preps)+2(render)=8minutes of rendering. That's a quarter of an hour with 4 user intervention between which you can't do much.

So here is my contribution after an hour of work: the bug was already there 29.09.2017

just to give an idea of the mess to bisect: - cuda disables completly opencl in the majority of revision, so you have to rebuild without cuda - device selection changed, so userpref have to be modified depending on the revision you test and bisecting requires to go back and forth in time. - kernel compilation takes 1min50 for victor - scene preparation takes 2min04 so it takes about 5minutes of VS compile, then manual tweaks for user pref, then 2minute kernel compile+ then 2 renders at 2 (scene preps)+2(render)=8minutes of rendering. That's a quarter of an hour with 4 user intervention between which you can't do much. So here is my contribution after an hour of work: the bug was already there 29.09.2017
Author

the bug was already there 24.08.2017, so my guess is that ec8ae4d5e9 is the commit we look for.

the bug was already there 24.08.2017, so my guess is that ec8ae4d5e9 is the commit we look for.
Author

This comment was removed by @bliblubli

*This comment was removed by @bliblubli*
Author

got some explanations on IRC, sorry I didn't know the whole story.

got some explanations on IRC, sorry I didn't know the whole story.
Author

commit b53e35c655 already has the bug, so it's not due to the buffer patch.

commit b53e35c655d4 already has the bug, so it's not due to the buffer patch.
Author

actually, 2.79 has the bug, only the official one had the device selection bug and took the 1080Ti instead, which doesn't use system memory.
So it may be a driver bug, but then why is the first render always 30sec?
After some renders, I got up to 114seconds to render = nearly 3x slower... At this point however, the GPU was idling a lot, maybe waiting all the time for system memory access?
Here is a picture of the task manager with 2 consecutive renders on the same instance of Blender.
bug victor OpenCL.png
It may be a coincidence, but VS2013 builds had only +/-10% between first and consecutive renders (made 5 of them) while VS2015 builds go crazy with up to 3x the render time.
If someone could test on Linux with a RX480 to see if GCC or the Linux driver handles this differently. As said before, the RX480 can render this scene. On Linux, the Nvidia drivers destroy a part of the AMD driver and I couldn't find a solution to have both drivers side by side yet.

actually, 2.79 has the bug, only the official one had the device selection bug and took the 1080Ti instead, which doesn't use system memory. So it may be a driver bug, but then why is the first render always 30sec? After some renders, I got up to 114seconds to render = nearly 3x slower... At this point however, the GPU was idling a lot, maybe waiting all the time for system memory access? Here is a picture of the task manager with 2 consecutive renders on the same instance of Blender. ![bug victor OpenCL.png](https://archive.blender.org/developer/F1096521/bug_victor_OpenCL.png) It may be a coincidence, but VS2013 builds had only +/-10% between first and consecutive renders (made 5 of them) while VS2015 builds go crazy with up to 3x the render time. If someone could test on Linux with a RX480 to see if GCC or the Linux driver handles this differently. As said before, the RX480 can render this scene. On Linux, the Nvidia drivers destroy a part of the AMD driver and I couldn't find a solution to have both drivers side by side yet.
Member

Added subscriber: @LazyDodo

Added subscriber: @LazyDodo
Member

Could be interesting to look at the output of gpu-z to see what the cards actual memory is doing in between those two runs.

Could be interesting to look at the output of gpu-z to see what the cards actual memory is doing in between those two runs.

Thanks for the tests! To be clear I'm not expecting anyone to work on this bug, and if no one else does I'll probably do it at some point, but the work is certainly helpful.

The graph from the task manager is interesting. It is only showing host memory so it doesn't give the whole picture, but it does look like there is no significant host memory leak after the first render. The profiles are similar for both renders, only at the start of the second render there seems to be an extra bump. Perhaps we can spot a corresponding allocation in the output of running with --debug-cycles.

If not I guess it's something internal in the OpenCL driver, for which I don't think there is any debug output we can look at? Maybe it's the driver deciding to migrate some device memory back to host memory, possibly memory that we leaked from the previous render? I couldn't spot any memory leaks in the OpenCL device code, and it's not clear to me what exactly could be hanging around in the context.

Here's a patch to disable the context cache: P555.

Eventually we should probably use clEnqueueMigrateMemObjects to explicitly tell the drivers which buffers should go on the device and which on the host. But I expect there's some other issue going on here.

Thanks for the tests! To be clear I'm not expecting anyone to work on this bug, and if no one else does I'll probably do it at some point, but the work is certainly helpful. The graph from the task manager is interesting. It is only showing host memory so it doesn't give the whole picture, but it does look like there is no significant host memory leak after the first render. The profiles are similar for both renders, only at the start of the second render there seems to be an extra bump. Perhaps we can spot a corresponding allocation in the output of running with `--debug-cycles`. If not I guess it's something internal in the OpenCL driver, for which I don't think there is any debug output we can look at? Maybe it's the driver deciding to migrate some device memory back to host memory, possibly memory that we leaked from the previous render? I couldn't spot any memory leaks in the OpenCL device code, and it's not clear to me what exactly could be hanging around in the context. Here's a patch to disable the context cache: [P555](https://archive.blender.org/developer/P555.txt). Eventually we should probably use `clEnqueueMigrateMemObjects` to explicitly tell the drivers which buffers should go on the device and which on the host. But I expect there's some other issue going on here.
Author

@LazyDodo the GPU-Z log is wrong somehow, it ignores half of the memory. But it gives the impression that no memory leak happens on the GPU.
victor.png

@brecht here is a log of 2 consecutive render
victor.log

Contrary to GPU-Z, here Cycles reports that free memory is different and calculate a very different global size, which is known to impact performance a lot. The strange thing is that the second render reports more free memory (about 4Gb against 1Gb). It results in a bigger global size, which should speedup the rendering, but as most of the data is then in system memory, it waits most of the time.

So it could look like the second time for some reason the driver decides to put some buffers in system memory. However, if we compare with the task manager graph, memory usage between first and second render is more in the +8gb range, while cycles reports only 3GB (from 1 to 4GB) more as free on the GPU...

If someone has a direct wire to the AMD driver team, that would be great to tell them about this bug.

@LazyDodo the GPU-Z log is wrong somehow, it ignores half of the memory. But it gives the impression that no memory leak happens on the GPU. ![victor.png](https://archive.blender.org/developer/F1098121/victor.png) @brecht here is a log of 2 consecutive render [victor.log](https://archive.blender.org/developer/F1098127/victor.log) Contrary to GPU-Z, here Cycles reports that free memory is different and calculate a very different global size, which is known to impact performance a lot. The strange thing is that the second render reports more free memory (about 4Gb against 1Gb). It results in a bigger global size, which should speedup the rendering, but as most of the data is then in system memory, it waits most of the time. So it could look like the second time for some reason the driver decides to put some buffers in system memory. However, if we compare with the task manager graph, memory usage between first and second render is more in the +8gb range, while cycles reports only 3GB (from 1 to 4GB) more as free on the GPU... If someone has a direct wire to the AMD driver team, that would be great to tell them about this bug.
Author

@brecht thanks for P555 . Tried it but bug is still there.

@brecht thanks for [P555](https://archive.blender.org/developer/P555.txt) . Tried it but bug is still there.

Where is the "Free mem AMD" print coming from? I can't find that code in master or earlier revisions. In master, the split kernel global size is determined by max_buffer_size and num_elements, which from the logs don't appear to change. Yet the global size is reported as being different.

In any case, the split kernel global size should not be affected by the amount of free memory on the device I think, at least in the current code. If there is not enough space to fit both the scene and working memory, then there is a trade-off between using host memory for the scene and using more working memory. But it's difficult to predict which is better, and if we are going to predict it then we need to do much more careful memory usage accounting to get accurate numbers for scene and working memory (see D2056 for difficulties with that, for the split kernel it gets more complicated).

Where is the "Free mem AMD" print coming from? I can't find that code in master or earlier revisions. In master, the split kernel global size is determined by `max_buffer_size` and `num_elements`, which from the logs don't appear to change. Yet the global size is reported as being different. In any case, the split kernel global size should not be affected by the amount of free memory on the device I think, at least in the current code. If there is not enough space to fit both the scene and working memory, then there is a trade-off between using host memory for the scene and using more working memory. But it's difficult to predict which is better, and if we are going to predict it then we need to do much more careful memory usage accounting to get accurate numbers for scene and working memory (see [D2056](https://archive.blender.org/developer/D2056) for difficulties with that, for the split kernel it gets more complicated).
Author

In #53249#469559, @brecht wrote:
Where is the "Free mem AMD" print coming from? I can't find that code in master or earlier revisions. In master, the split kernel global size is determined by max_buffer_size and num_elements, which from the logs don't appear to change. Yet the global size is reported as being different.

Yes, I used another version to get the free memory reported and tried to see if limiting global size to make it all fit in memory would solve the problem, but it didn't. I can redo the log with vanilla master if you want. Here is the code:

		VLOG(1) << "Maximum device allocation size: "
		        << string_human_readable_number(max_buffer_size) << " bytes. ("
		        << string_human_readable_size(max_buffer_size) << ").";

		/* Limit to 2gb, as we shouldn't need more than that and some devices may support much more. */
		*max_buffer_size = min(max_buffer_size / 2, (cl_ulong)2l*1024*1024*1024);* size_t num_elements = max_elements_for_max_buffer_size(kg, data, max_buffer_size);
		cl_ulong free_mem_amd = 0;
		if(clGetDeviceInfo(device->cdDevice, CL_DEVICE_GLOBAL_FREE_MEMORY_AMD, sizeof(cl_ulong), &free_mem_amd, NULL) == CL_SUCCESS) {
			free_mem_amd *= 1024;
			VLOG(1) << "Free mem AMD: "
			        << string_human_readable_number(free_mem_amd) << " bytes. ("
			        << string_human_readable_size(free_mem_amd) << ").";
			if(max_buffer_size > free_mem_amd) {
				max_buffer_size = free_mem_amd;
			}
		}

code is from @nirved-1

In any case, the split kernel global size should not be affected by the amount of free memory on the device I think, at least in the current code. If there is not enough space to fit both the scene and working memory, then there is a trade-off between using host memory for the scene and using more working memory. But it's difficult to predict which is better, and if we are going to predict it then we need to do much more careful memory usage accounting to get accurate numbers for scene and working memory (see D2056 for difficulties with that, for the split kernel it gets more complicated).

Each scene will certainly have it's own optimal memory layout. Some scene have very simple materials/textures, but heavy BVH, some the opposite, etc. Wouldn't it be possible to make a bit like PGO, have a "optimize" button that would render the scene in a special mode, putting the different buffers/textures at different places and save the timings of the different layouts. Then write somewhere in the scene custom data the fastest layout. This one would then be used for all renders of that scene. Of course, it may have to be updated later if big changes are made, but most of the time, it will be used once before sending to the render farm.

> In #53249#469559, @brecht wrote: > Where is the "Free mem AMD" print coming from? I can't find that code in master or earlier revisions. In master, the split kernel global size is determined by `max_buffer_size` and `num_elements`, which from the logs don't appear to change. Yet the global size is reported as being different. Yes, I used another version to get the free memory reported and tried to see if limiting global size to make it all fit in memory would solve the problem, but it didn't. I can redo the log with vanilla master if you want. Here is the code: ``` VLOG(1) << "Maximum device allocation size: " << string_human_readable_number(max_buffer_size) << " bytes. (" << string_human_readable_size(max_buffer_size) << ")."; /* Limit to 2gb, as we shouldn't need more than that and some devices may support much more. */ *max_buffer_size = min(max_buffer_size / 2, (cl_ulong)2l*1024*1024*1024);* size_t num_elements = max_elements_for_max_buffer_size(kg, data, max_buffer_size); cl_ulong free_mem_amd = 0; if(clGetDeviceInfo(device->cdDevice, CL_DEVICE_GLOBAL_FREE_MEMORY_AMD, sizeof(cl_ulong), &free_mem_amd, NULL) == CL_SUCCESS) { free_mem_amd *= 1024; VLOG(1) << "Free mem AMD: " << string_human_readable_number(free_mem_amd) << " bytes. (" << string_human_readable_size(free_mem_amd) << ")."; if(max_buffer_size > free_mem_amd) { max_buffer_size = free_mem_amd; } } ``` code is from @nirved-1 > > In any case, the split kernel global size should not be affected by the amount of free memory on the device I think, at least in the current code. If there is not enough space to fit both the scene and working memory, then there is a trade-off between using host memory for the scene and using more working memory. But it's difficult to predict which is better, and if we are going to predict it then we need to do much more careful memory usage accounting to get accurate numbers for scene and working memory (see [D2056](https://archive.blender.org/developer/D2056) for difficulties with that, for the split kernel it gets more complicated). Each scene will certainly have it's own optimal memory layout. Some scene have very simple materials/textures, but heavy BVH, some the opposite, etc. Wouldn't it be possible to make a bit like PGO, have a "optimize" button that would render the scene in a special mode, putting the different buffers/textures at different places and save the timings of the different layouts. Then write somewhere in the scene custom data the fastest layout. This one would then be used for all renders of that scene. Of course, it may have to be updated later if big changes are made, but most of the time, it will be used once before sending to the render farm.
Author

Added subscriber: @nirved-1

Added subscriber: @nirved-1

Ok, if the code was modified then indeed a new log would be useful.

PGO gives a poor user experience and is impractical, there's too many combinations to test. We can almost certainly find automatic algorithms that are good enough, just no one has tried yet. For example something like P556 or some variation of it could help keep working memory on the device. Still I don't think we understand the actual issue here, so it's difficult to know what the fix is.

It is not clear how to interpret CL_DEVICE_GLOBAL_FREE_MEMORY_AMD exactly. For example the OS or OpenGL might be using some device memory which the driver can migrate to the host (or discard) to make room for running the OpenCL kernel. So if the driver does that kind of thing, then the second run of the OpenCL kernel may report more free memory, after all memory from the first run was freed. But it doesn't necessarily mean that more memory is actually available.

Ok, if the code was modified then indeed a new log would be useful. PGO gives a poor user experience and is impractical, there's too many combinations to test. We can almost certainly find automatic algorithms that are good enough, just no one has tried yet. For example something like [P556](https://archive.blender.org/developer/P556.txt) or some variation of it could help keep working memory on the device. Still I don't think we understand the actual issue here, so it's difficult to know what the fix is. It is not clear how to interpret `CL_DEVICE_GLOBAL_FREE_MEMORY_AMD` exactly. For example the OS or OpenGL might be using some device memory which the driver can migrate to the host (or discard) to make room for running the OpenCL kernel. So if the driver does that kind of thing, then the second run of the OpenCL kernel may report more free memory, after all memory from the first run was freed. But it doesn't necessarily mean that more memory is actually available.
Author

@brecht thanks for the patch. Latest master with it (I had to apply manually as it seems it was done on a branch?) gives this log on 3 consecutive renders:
victor_P556.log

@brecht thanks for the patch. Latest master with it (I had to apply manually as it seems it was done on a branch?) gives this log on 3 consecutive renders: [victor_P556.log](https://archive.blender.org/developer/F1099347/victor_P556.log)
Author

and that's the log with latest buildbot:
victor_8a72be7.log

and that's the log with latest buildbot: [victor_8a72be7.log](https://archive.blender.org/developer/F1099398/victor_8a72be7.log)
Author

P556 seems to limit the slowdown to about 68seconds from 48 while latest buildbot 8a72be7 goes up to 78sec from 45sec and it's slowdown grows on each new render.

[P556](https://archive.blender.org/developer/P556.txt) seems to limit the slowdown to about 68seconds from 48 while latest buildbot 8a72be7 goes up to 78sec from 45sec and it's slowdown grows on each new render.
Author

I rechecked with VS2013 builds. The system memory usage varies a bit (max 500MB compared to many GB with 2015) and the performance also is more stable (max 35% variation during 10 renders).
Could someone confirm those behaviours on Windows and test on Linux?

I rechecked with VS2013 builds. The system memory usage varies a bit (max 500MB compared to many GB with 2015) and the performance also is more stable (max 35% variation during 10 renders). Could someone confirm those behaviours on Windows and test on Linux?

Changed status from 'Open' to: 'Archived'

Changed status from 'Open' to: 'Archived'

Archiving old report. It may be possible to improve performance for out of core OpenCL renders, but it was never an officially supported feature and I would consider it outside the scope of the bug tracker.

Archiving old report. It may be possible to improve performance for out of core OpenCL renders, but it was never an officially supported feature and I would consider it outside the scope of the bug tracker.
Sign in to join this conversation.
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#53249
No description provided.