BLF: optimizations and fixes to font shader #119653

Merged
Aras Pranckevicius merged 7 commits from aras_p/blender:text-shader-opt into main 2024-03-19 16:29:30 +01:00

A discussion on chat mentioned that among the shaders that are always initialized upon Blender startup, the text/font shader is taking the longest to compile (usually only 1st time when running a particular blender version). So this PR tries to simplify/optimize that shader. It runs faster now too, but text rasterization is usually not an actual performance problem.

  • Instead of doing blur via individual bilinear samples (where each sample is 4 texel fetches), do raw texel fetches of the kernel footprint and compute final result by shifting the kernel weights according to bilinear fraction weight. For 5x5 blur, this reduces number of texel fetches from 64 down to 36.
  • Instead of checking "is the texel fetch inside the glyph box? if so, then fetch it", first fetch it, and then set result to zero if it was outside. Simplifies the branching code flow in the compiled GPU shader.
  • Avoid costly integer modulo/division for "unwrapping" the font texture. The texture width is always power of two size, so division/modulo can be replaced by masking and a shift. Setup glyph_tex_size uniform from CPU side to contain the needed data.

Fixes while I'm at it

  • The 3x3 blur was not doing a 3x3 blur, due to a copy-pasta typo (one of the sample offsets was repeated twice, and thus another sample offset was missing). This has been the case since year 2018.
  • Blur towards left/top edges of the glyphs had artifacts, because float->int casting in GLSL rounds towards zero, but the code actually wanted to round towards floor.

As a result of these fixes, the blur looks a tiny bit different. Not really noticeable for regular text, but here's a screenshot with really scaled up text (at small font size), created from Python script. White text, plus 5 blur, plus 3 blur both in orange, current main branch:

And here's the same in this PR:

First time initialization

  • Windows 10, NVIDIA RTX 3080Ti, OpenGL: 274.4ms -> 51.3ms
  • macOS, Apple M1 Max, Metal: 456ms -> 289ms (I'm including both the GLSL->Metal translation time, as well as time to initialize 3 text related Metal pipeline objects done during startup).

Shader performance/complexity

Performance I only measured on macOS (M1 Max), by making a BLF text that is scaled up to cover most of screen via Python. Using Xcode Metal profiler, drawing that text with 5x5 shadow blur: 1.5ms -> 0.3ms.

There aren't that many tools that can tell how "fast" a particular shader is, especially for OpenGL. Several that I found:

  • AMD Radeon GPU Analyzer, analyzing with "vk-offline" profile for "gfx1100" GPU:
    • ISA code size: 38368 -> 5492
  • ARM Mali Offline Compiler (Blender does not typically run on this GPU, but hey I'll use whatever tools I can find), analyzed for Mali-G76 GPU:
    • Work registers: 64 -> 32
    • Uniform registers: 8 -> 10
    • Stack spilling: 724 bytes -> none!
    • Total instruction cycles: ALU 119.5 -> 6.0, Load/Store 161.0 -> 4.0, Texture 88.0 -> 5.0 (not sure if these cycle reports are correct, feels too large of an improvement)
A discussion on chat mentioned that among the shaders that are always initialized upon Blender startup, the text/font shader is taking the longest to compile (usually only 1st time when running a particular blender version). So this PR tries to simplify/optimize that shader. It runs faster now too, but text rasterization is usually not an actual performance problem. * Instead of doing blur via individual bilinear samples (where each sample is 4 texel fetches), do raw texel fetches of the kernel footprint and compute final result by shifting the kernel weights according to bilinear fraction weight. For 5x5 blur, this reduces number of texel fetches from 64 down to 36. * Instead of checking "is the texel fetch inside the glyph box? if so, then fetch it", first fetch it, and then set result to zero if it was outside. Simplifies the branching code flow in the compiled GPU shader. * Avoid costly integer modulo/division for "unwrapping" the font texture. The texture width is always power of two size, so division/modulo can be replaced by masking and a shift. Setup `glyph_tex_size` uniform from CPU side to contain the needed data. ### Fixes while I'm at it * The 3x3 blur was not doing a 3x3 blur, due to a copy-pasta typo (one of the sample offsets was repeated twice, and thus another sample offset was missing). This has been the case since year 2018. * Blur towards left/top edges of the glyphs had artifacts, because float->int casting in GLSL rounds towards zero, but the code actually wanted to round towards floor. As a result of these fixes, the blur looks a tiny bit different. Not really noticeable for regular text, but here's a screenshot with really scaled up text (at small font size), created from Python script. White text, plus 5 blur, plus 3 blur both in orange, current main branch: ![](/attachments/77658949-1207-4fef-81a4-39421a769791) And here's the same in this PR: ![](/attachments/1ad0f1d7-ef05-40a8-a02d-566f72c74308) ### First time initialization * Windows 10, NVIDIA RTX 3080Ti, OpenGL: **274.4ms -> 51.3ms** * macOS, Apple M1 Max, Metal: 456ms -> 289ms (I'm including both the GLSL->Metal translation time, as well as time to initialize 3 text related Metal pipeline objects done during startup). ### Shader performance/complexity Performance I only measured on macOS (M1 Max), by making a BLF text that is scaled up to cover most of screen via Python. Using Xcode Metal profiler, drawing that text with 5x5 shadow blur: **1.5ms -> 0.3ms**. There aren't that many tools that can tell how "fast" a particular shader is, especially for OpenGL. Several that I found: * AMD Radeon GPU Analyzer, analyzing with "vk-offline" profile for "gfx1100" GPU: * ISA code size: 38368 -> 5492 * ARM Mali Offline Compiler (Blender does not typically run on this GPU, but hey I'll use whatever tools I can find), analyzed for Mali-G76 GPU: * Work registers: 64 -> 32 * Uniform registers: 8 -> 10 * Stack spilling: 724 bytes -> none! * Total instruction cycles: ALU 119.5 -> 6.0, Load/Store 161.0 -> 4.0, Texture 88.0 -> 5.0 (not sure if these cycle reports are correct, feels too large of an improvement)
Aras Pranckevicius added 4 commits 2024-03-19 11:26:48 +01:00
046b692988 BLF: simplify text shader texel_fetch
Avoid very costly integer division/modulo as well as two texelFetch
calls separated by a branch. We know that font texture width
is power of two, so we can replace division/modulo with a shift
and a mask, that is set from the calling code via glyph_tex_size
uniform.

Analyzing the shader with Mali Offline Compiler for Mali-G76 arch:
- Stack spilling: 724 -> 692 bytes
- Total cycles, arithmetic: 119.5 -> 33.2
- Total cycles, load/store: 161.0 -> 119.0
- Total cycles, texture: 88.0 -> 44.0
0a56dcec6f BLF: artithmetic simplifications for text shader
Fold various divisions/multiplications etc.

Analyzing the shader with Mali Offline Compiler for Mali-G76 arch:
- Stack spilling: 692 -> 644 bytes
- Total cycles, arithmetic: 33.2 -> 31.8
- Total cycles, load/store: 119.0 -> 114.0
- Total cycles, texture: unchanged 44.0
06848c1f64 BLF: simplify fetching of 4 samples in texture_1D_custom_bilinear_filter
Always fetch the 4 corners for a bilinear sample, and then set their
values to zero if they are outside the glyph bounds.

Analyzing the shader with Mali Offline Compiler for Mali-G76 arch:
- Stack spilling: 644 bytes -> none!
- Work registers: unchanged 64
- Uniform registers: 10 -> 18
- Total cycles, arithmetic: 31.8 -> 26.2
- Total cycles, load/store: 114.0 -> 0.0
- Total cycles, texture: 44.0 -> 42.0
buildbot/vexp-code-patch-lint Build done. Details
buildbot/vexp-code-patch-linux-x86_64 Build done. Details
buildbot/vexp-code-patch-darwin-x86_64 Build done. Details
buildbot/vexp-code-patch-darwin-arm64 Build done. Details
buildbot/vexp-code-patch-windows-amd64 Build done. Details
buildbot/vexp-code-patch-coordinator Build done. Details
1e4c089a75
BLF: simplify font blurring shader
Instead of doing manual bilinear (4 samples) for each tap (total
16 texture fetches for 3x3, 64 fetches for 5x5), fetch the (N+1)x(N+1)
raw texels and interpolate with a bilinearly shifted filter kernel.
For 5x5 blur, this is 36 texture samples instead of 64.

Analyzing the shader with Mali Offline Compiler for Mali-G76 arch:
- Work registers: 64 -> 32
- Uniform registers: unchanged 10
- Total cycles, arithmetic: 26.2 -> 5.97
- Total cycles, load/store: 0.0 -> 4.0
- Total cycles, texture: 44.0 -> 5.0

1st time initialization of the shader (Win10, RTX 3080Ti): 51.3ms
(main branch: 274.4ms)
Aras Pranckevicius added 2 commits 2024-03-19 13:41:12 +01:00
94c69c0589 BLF: fix sampling artifacts towards top/left edges with blur (exist on main too)
Casting float UV coordinate to int rounds towards zero, but we rounding
towards negative infinity, i.e. a floor.
buildbot/vexp-code-patch-lint Build done. Details
buildbot/vexp-code-patch-linux-x86_64 Build done. Details
buildbot/vexp-code-patch-darwin-x86_64 Build done. Details
buildbot/vexp-code-patch-darwin-arm64 Build done. Details
buildbot/vexp-code-patch-windows-amd64 Build done. Details
buildbot/vexp-code-patch-coordinator Build done. Details
f55fb2b4c5
Merge branch 'main' into text-shader-opt
Aras Pranckevicius changed title from WIP: BLF: optimize text shader (mostly for compile time) to WIP: BLF: optimizations and fixes to font shader 2024-03-19 13:47:13 +01:00
Aras Pranckevicius changed title from WIP: BLF: optimizations and fixes to font shader to BLF: optimizations and fixes to font shader 2024-03-19 13:48:42 +01:00
Author
Member

@blender-bot build

@blender-bot build
Aras Pranckevicius added this to the User Interface project 2024-03-19 13:56:15 +01:00
Clément Foucault requested review from Clément Foucault 2024-03-19 14:28:53 +01:00
Clément Foucault requested changes 2024-03-19 15:08:52 +01:00
Dismissed
@ -15,3 +18,1 @@
return texelFetch(glyph, ivec2(index % size_x, index / size_x), 0).r;
}
return texelFetch(glyph, ivec2(index, 0), 0).r;
/* glyph_tex_size: upper 8 bits is log2 of texture width, lower 24 bits is width-1 */

Does it really helps to have glyph_tex_size encoded as one uniform? I would rather see two uniforms for clarity and less code on the GLSL side.

Does it really helps to have `glyph_tex_size` encoded as one uniform? I would rather see two uniforms for clarity and less code on the GLSL side.
Author
Member

Indeed, two uniforms is cleaner

Indeed, two uniforms is cleaner
aras_p marked this conversation as resolved
@ -18,0 +18,4 @@
/* glyph_tex_size: upper 8 bits is log2 of texture width, lower 24 bits is width-1 */
int col_mask = glyph_tex_size & 0xFFFFFF;
int row_shift = glyph_tex_size >> 24;
ivec2 uv = ivec2(index & col_mask, index >> row_shift);

We use texel for pixel coordinate, otherwise it's confusing.

Should definitely be added to the style guide (done).

We use `texel` for pixel coordinate, otherwise it's confusing. Should definitely be added to the style guide (done).
Author
Member

Ah, good to know! Without being aware of the style guide, I would have guessed that "texel" would refer to actual texel color/value, not "texel location". But if style guide says so, so be it.

Ah, good to know! Without being aware of the style guide, I would have guessed that "texel" would refer to actual texel color/value, not "texel location". But if style guide says so, so be it.
aras_p marked this conversation as resolved
@ -27,3 +32,1 @@
vec2 texel_2d = uv * vec2(glyph_dim) + vec2(0.5);
ivec2 texel_2d_near = ivec2(texel_2d) - 1;
int frag_offset = glyph_offset + texel_2d_near.y * glyph_dim.x + texel_2d_near.x;
ivec2 iuv = ivec2(floor(uv)) - 1;

Rename as texel.

Rename as `texel`.
aras_p marked this conversation as resolved
@ -118,0 +117,4 @@
for (int ix = 0; ix < 4; ++ix) {
int ofsx = ix - 1;
float v = texel_fetch(frag_offset + ofsy * glyph_dim.x + ofsx);
if (!is_inside_box(iuv + ivec2(ofsx, ofsy)))
Always use brackets. See https://developer.blender.org/docs/handbook/guidelines/c_cpp/#braces We also have GLSL guidelines https://developer.blender.org/docs/handbook/guidelines/glsl/ Applies to all this file.
aras_p marked this conversation as resolved
Clément Foucault requested changes 2024-03-19 15:36:35 +01:00
Dismissed
Clément Foucault left a comment
Member

Looking back at your screenshot, the second orange line seems less bright. Why so? And which one is the correct one?

Looking back at your screenshot, the second orange line seems less bright. Why so? And which one is the correct one?
@ -161,0 +168,4 @@
++idx;
}
}
fragColor.a = sum * (1.0 / 80.0);

Why 80 and not 36?

Why `80` and not `36`?
Author
Member

It is the sum of all the weights. Just like previous code was dividing by 20, not by 16.

It is the sum of all the weights. Just like previous code was dividing by 20, not by 16.
fclem marked this conversation as resolved
Author
Member

Looking back at your screenshot, the second orange line seems less bright. Why so? And which one is the correct one?

Because the 3x3 filter was incorrect due to a copy-paste error. The effective kernel weights were:

2 2 0
3 4 1
1 2 1

instead of what it was trying to do,

1 2 1
2 4 2
1 2 1

So effectively it was over-weighting one corner of the filter, and not taking one texel into account at all. So it is less "blurred" than expected, and hence looks a bit brighter.

> Looking back at your screenshot, the second orange line seems less bright. Why so? And which one is the correct one? Because the 3x3 filter was incorrect due to a copy-paste error. The effective kernel weights were: ``` 2 2 0 3 4 1 1 2 1 ``` instead of what it was trying to do, ``` 1 2 1 2 4 2 1 2 1 ``` So effectively it was over-weighting one corner of the filter, and not taking one texel into account at all. So it is less "blurred" than expected, and hence looks a bit brighter.
Aras Pranckevicius added 1 commit 2024-03-19 16:15:49 +01:00
Aras Pranckevicius requested review from Clément Foucault 2024-03-19 16:16:20 +01:00
Clément Foucault approved these changes 2024-03-19 16:17:35 +01:00
Aras Pranckevicius merged commit a05adbef28 into main 2024-03-19 16:29:30 +01:00
Aras Pranckevicius deleted branch text-shader-opt 2024-03-19 16:29:33 +01:00
Sign in to join this conversation.
No reviewers
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#119653
No description provided.