Improve per-object overhead in the Draw module #113771

New Issue

Miguel Pozo · 2023-10-16T13:03:19+02:00

Miguel Pozo commented

2023-10-16 13:03:19 +02:00

Introduction

At the moment, large scenes in Blender are usually CPU-bound.

For testing purposes, I've made a simple scene that instances a triangle 100k times using geometry nodes.
I've profiled it in Workbench mode with all optional features disabled.
The triangles are not even shown in the screen so all the draw calls should be culled-out.
This is what the performance profile looks like:

I think these are the main potential areas for optimization:

Redundant operations for each DupliObject.
Per DupliObject memory allocations/deallocations.
CPU to GPU bandwidth.
Lack of caching between samples/frames.

Here are some possible options for improving per-object overhead in viewport rendering. They're mostly orthogonal to each other and go from the most simple local/isolated improvements to more structural/wider refactors.

Update: Related to-do with similar ideas: #92963

1. Optimize ObjectBounds::sync

We make one BKE_mesh_bounbox_get() call per DupliObject.
Since the runtime.bb reference is removed when creating the dupliobject, calling this function does way more work than it has to, taking a ~7% of the total frame time (~10% if you count the time it takes to deallocate the bounding boxes).

Fixing this can be as simple as storing the Object.data pointer from the last synced handle and restoring a cached ObjectBounds if it's the same.

Update: #96968 and specifically #113465 should heavily improve this issue.

2. Upload less data per object

Over 9% (and more on lower-end hardware) of the frame time is spent in uploading buffer data inside Manager::end_sync.
We could improve this by simply compressing the data we send to the GPU:

Change the object 4x4 matrix to a 3x4 matrix and compute the inverse object matrix in draw_resource_finalize.
Upload only the bounds min/max, and compute the corners and the sphere in draw_resource_finalize.
Move ObjectInfos to engine-specific storage buffers, since none of the properties are used across all engines (OBJECT_NEGATIVE_SCALE could be uploaded along the min/max values).

This would reduce the uploaded data for each object from 272 bytes to just 80.

3. Optimize `DupliObject` generation

Around a third of the frame time is spent iterating the depsgraph.
A good chunk of this time comes from memory allocation/deallocation when generating DupliObjects and computing data we may not need.

A simple solution may be to have an alternative function to DEG_OBJECT_ITER_BEGIN/END that takes a per-object callback instead. So each dupli generated Object could be allocated on the stack and sent to the callback function immediately, instead of having to add it to a linked list for later iteration.

A more time consuming, but potentially better alternative could be to update the engines sync_object function to simply receive the original object and a list of per-dupli data.
That would allow avoiding redundant work, like GPUBatch requests (7% of frame time) or material setups, in a more natural way.

4. Avoid unnecessary re-syncs

While a more granular way of updating objects would be more complex (more on that later), we could "easily" skip re-syncs between samples of the same frame, and between frames as long as there has not been any scene update.
This wont help with playback or editing performance, but it could provide a huge speed boost for viewport navigation and EEVEE super-sampling.

We can simply skip the Manager and engines init/sync functions when possible, and update the engines so they handle per-sample update logic directly in their draw function, just like they already do for image rendering.

Right now we create one handle for each engine instance. So one per viewport and 2x times that once Overlay Next is ready.
We could avoid that overhead by creating the handles on the draw manager side and passing them to the engines sync function.

6. Static/Dynamic object pools

Ideally we should completely skip re-sync for objects that have not been updated.
This is the most complex one, but the good news is that one of the hardest part (indirect drawing) is already done.

This could be achieved by making a distinction between dynamic and static objects.
New/updated objects would always be dynamic, and if they have not changed after N frames, they're are moved to the static pool.
The dynamic object pool is synced every frame, while the static one is only synced when it changes.

The pool logic could be built into the Manager, the MainPass and a new type of special per-object Storage Buffer, so engines don't have to manage this directly.

## Introduction At the moment, large scenes in Blender are usually CPU-bound. For testing purposes, I've made a simple scene that instances a triangle 100k times using geometry nodes. I've profiled it in Workbench mode with all optional features disabled. The triangles are not even shown in the screen so all the draw calls should be culled-out. This is what the performance profile looks like: ![275026427-53a76b84-015d-4f8b-8c5c-d8c58addaa3c](/attachments/2f6d8017-9dc9-47ee-bc78-910a70ead926) I think these are the main potential areas for optimization: * Redundant operations for each `DupliObject`. * Per `DupliObject` memory allocations/deallocations. * CPU to GPU bandwidth. * Lack of caching between samples/frames. Here are some possible options for improving per-object overhead in viewport rendering. They're mostly orthogonal to each other and go from the most simple local/isolated improvements to more structural/wider refactors. *Update:* Related to-do with similar ideas: #92963 ## 1. Optimize ObjectBounds::sync We make one `BKE_mesh_bounbox_get()` call per `DupliObject`. Since the `runtime.bb` reference is removed when creating the dupliobject, calling this function does way more work than it has to, taking a ~7% of the total frame time (~10% if you count the time it takes to deallocate the bounding boxes). Fixing this can be as simple as storing the `Object.data` pointer from the last synced handle and restoring a cached `ObjectBounds` if it's the same. *Update:* #96968 and specifically #113465 should heavily improve this issue. ## 2. Upload less data per object Over 9% (and more on lower-end hardware) of the frame time is spent in uploading buffer data inside `Manager::end_sync`. We could improve this by simply compressing the data we send to the GPU: - Change the object 4x4 matrix to a 3x4 matrix and compute the inverse object matrix in `draw_resource_finalize`. - Upload only the bounds min/max, and compute the corners and the sphere in `draw_resource_finalize`. - Move `ObjectInfos` to engine-specific storage buffers, since none of the properties are used across all engines (`OBJECT_NEGATIVE_SCALE` could be uploaded along the min/max values). This would reduce the uploaded data for each object from 272 bytes to just 80. ## 3. Optimize `DupliObject` generation Around a third of the frame time is spent iterating the depsgraph. A good chunk of this time comes from memory allocation/deallocation when generating `DupliObjects` and computing data we may not need. A simple solution may be to have an alternative function to `DEG_OBJECT_ITER_BEGIN/END` that takes a per-object callback instead. So each dupli generated `Object` could be allocated on the stack and sent to the callback function immediately, instead of having to add it to a linked list for later iteration. A more time consuming, but potentially better alternative could be to update the engines `sync_object` function to simply receive the original object and a list of per-dupli data. That would allow avoiding redundant work, like `GPUBatch` requests (7% of frame time) or material setups, in a more natural way. ## 4. Avoid unnecessary re-syncs While a more granular way of updating objects would be more complex (more on that later), we could "easily" skip re-syncs between samples of the same frame, and between frames as long as there has not been any scene update. This wont help with playback or editing performance, but it could provide a huge speed boost for viewport navigation and EEVEE super-sampling. We can simply skip the Manager and engines init/sync functions when possible, and update the engines so they handle per-sample update logic directly in their draw function, just like they already do for image rendering. ## 5. Share handles across engines/viewports Right now we create one handle for each engine instance. So one per viewport and 2x times that once Overlay Next is ready. We could avoid that overhead by creating the handles on the draw manager side and passing them to the engines sync function. ## 6. Static/Dynamic object pools Ideally we should completely skip re-sync for objects that have not been updated. This is the most complex one, but the good news is that one of the hardest part (indirect drawing) is already done. This could be achieved by making a distinction between dynamic and static objects. New/updated objects would always be dynamic, and if they have not changed after N frames, they're are moved to the static pool. The dynamic object pool is synced every frame, while the static one is only synced when it changes. The pool logic could be built into the `Manager`, the `MainPass` and a new type of special per-object Storage Buffer, so engines don't have to manage this directly.

275026427-53a76b84-015d-4f8b-8c5c-d8c58addaa3c.png

78 KiB

👍 3 ❤️ 3 🚀 4

Miguel Pozo added the

Download

What's New

Blender Studio

Manual

Developers Blog

Documentation

Benchmark

Blender Conference

Development Fund

One-time Donations

Improve per-object overhead in the Draw module #113771