Instead of doing manual bilinear (4 samples) for each tap (total
16 texture fetches for 3x3, 64 fetches for 5x5), fetch the (N+1)x(N+1)
raw texels and interpolate with a bilinearly shifted filter kernel.
For 5x5 blur, this is 36 texture samples instead of 64.
Analyzing the shader with Mali Offline Compiler for Mali-G76 arch:
- Work registers: 64 -> 32
- Uniform registers: unchanged 10
- Total cycles, arithmetic: 26.2 -> 5.97
- Total cycles, load/store: 0.0 -> 4.0
- Total cycles, texture: 44.0 -> 5.0
1st time initialization of the shader (Win10, RTX 3080Ti): 51.3ms
(main branch: 274.4ms)
Always fetch the 4 corners for a bilinear sample, and then set their
values to zero if they are outside the glyph bounds.
Analyzing the shader with Mali Offline Compiler for Mali-G76 arch:
- Stack spilling: 644 bytes -> none!
- Work registers: unchanged 64
- Uniform registers: 10 -> 18
- Total cycles, arithmetic: 31.8 -> 26.2
- Total cycles, load/store: 114.0 -> 0.0
- Total cycles, texture: 44.0 -> 42.0
Fold various divisions/multiplications etc.
Analyzing the shader with Mali Offline Compiler for Mali-G76 arch:
- Stack spilling: 692 -> 644 bytes
- Total cycles, arithmetic: 33.2 -> 31.8
- Total cycles, load/store: 119.0 -> 114.0
- Total cycles, texture: unchanged 44.0
Avoid very costly integer division/modulo as well as two texelFetch
calls separated by a branch. We know that font texture width
is power of two, so we can replace division/modulo with a shift
and a mask, that is set from the calling code via glyph_tex_size
uniform.
Analyzing the shader with Mali Offline Compiler for Mali-G76 arch:
- Stack spilling: 724 -> 692 bytes
- Total cycles, arithmetic: 119.5 -> 33.2
- Total cycles, load/store: 161.0 -> 119.0
- Total cycles, texture: 88.0 -> 44.0