Linux: Use huge pages in jemalloc to speed up allocations #116663

Merged
Brecht Van Lommel merged 1 commits from Eugene-Kuznetsov/blender:linux_huge_pages into main 2024-01-16 16:37:50 +01:00
Contributor

This PR requests the Linux kernel to use huge (2 MB) pages for large allocations. This has the effect of speeding up first accesses to those allocations, and possibly also speeds up future accesses by reducing TLB faults. (By default, 4 KB pages are used unless the user enables huge pages through a kernel parameter or an obscure sysfs setting.)

Try the attached program to observe the effect.

This PR requests the Linux kernel to use huge (2 MB) pages for large allocations. This has the effect of speeding up first accesses to those allocations, and possibly also speeds up future accesses by reducing TLB faults. (By default, 4 KB pages are used unless the user enables huge pages through a kernel parameter or an obscure sysfs setting.) Try the attached program to observe the effect.
Eugene-Kuznetsov force-pushed linux_huge_pages from a59ba4f078 to 257d5c43bf 2023-12-31 05:51:56 +01:00 Compare
Iliya Katushenock added this to the Core project 2023-12-31 08:21:06 +01:00
Iliya Katushenock added the
Platform
Linux
label 2023-12-31 08:21:13 +01:00

Can you say what the numbers are for running the attached test on your system?

Can you demonstrate performance impact of this in real Blender operations, like rendering, geometry nodes, mesh editing, .. ?

In general different allocation strategies have trade-offs, maybe additional memory usage or fragmentation, etc. If a certain strategy was better for every application it would likely be the default, so can you explain this more?

There may also be bad interactions or redundancy with jemalloc that we use by default for Linux release builds.

So I don't think synthetic test results alone are enough for us to enable this.

Can you say what the numbers are for running the attached test on your system? Can you demonstrate performance impact of this in real Blender operations, like rendering, geometry nodes, mesh editing, .. ? In general different allocation strategies have trade-offs, maybe additional memory usage or fragmentation, etc. If a certain strategy was better for every application it would likely be the default, so can you explain this more? There may also be bad interactions or redundancy with jemalloc that we use by default for Linux release builds. So I don't think synthetic test results alone are enough for us to enable this.
Author
Contributor

I originally spotted this as part of optimization of #116545, where this function call 4f91c770ce/source/blender/nodes/geometry/nodes/node_geo_deform_curves_on_surface.cc (L501) ("Deform on Surfaces" geonode) ultimately led to a malloc for 50 MB of memory every frame, followed by an immediate memcpy into it, and that memcpy was about 5x slower than it was supposed to be, because it was causing a page fault for every 4 KB of the newly allocated memory. The proposed change led to a noticeable reduction in overall execution time of that geonode. I see an overall improvement in viewport rendering times as well. I'll try to put together some demo files.
The tradeoff is that memory has to be allocated in multiples of 2 MB, so there's some overhead and it only makes sense for sufficiently large allocations.
I see that there's support of the same feature in jemalloc, but huge pages are clearly not on by default in my own release builds, I'll need to investigate this (in fact, I'm not sure if jemalloc is even being used.)

I originally spotted this as part of optimization of https://projects.blender.org/blender/blender/pulls/116545, where this function call https://projects.blender.org/blender/blender/src/commit/4f91c770cefe1384d2397160c320a29fed44dbbd/source/blender/nodes/geometry/nodes/node_geo_deform_curves_on_surface.cc#L501 ("Deform on Surfaces" geonode) ultimately led to a malloc for 50 MB of memory every frame, followed by an immediate memcpy into it, and that memcpy was about 5x slower than it was supposed to be, because it was causing a page fault for every 4 KB of the newly allocated memory. The proposed change led to a noticeable reduction in overall execution time of that geonode. I see an overall improvement in viewport rendering times as well. I'll try to put together some demo files. The tradeoff is that memory has to be allocated in multiples of 2 MB, so there's some overhead and it only makes sense for sufficiently large allocations. I see that there's support of the same feature in jemalloc, but huge pages are clearly not on by default in my own release builds, I'll need to investigate this (in fact, I'm not sure if jemalloc is even being used.)

The main problem is not the speed of allocations, but the fact that they exist.

The main problem is not the speed of allocations, but the fact that they exist.
Author
Contributor

It would of course be better for geonodes to reuse all allocations, instead of recreating them every time the node graph is executed, but that does not seem to be possible without major changes of the architecture.

It would of course be better for geonodes to reuse all allocations, instead of recreating them every time the node graph is executed, but that does not seem to be possible without major changes of the architecture.

It would of course be better for geonodes to reuse all allocations, instead of recreating them every time the node graph is executed, but that does not seem to be possible without major changes of the architecture.

I think it is better to do that, instead of just make allocation bigger? Especially in the function you are optimizing.

> It would of course be better for geonodes to reuse all allocations, instead of recreating them every time the node graph is executed, but that does not seem to be possible without major changes of the architecture. I think it is better to do that, instead of just make allocation bigger? Especially in the function you are optimizing.

I don't think the geometry nodes execution should be caching allocations, that can significantly increase peak memory usage depending on the scene.

I don't think the geometry nodes execution should be caching allocations, that can significantly increase peak memory usage depending on the scene.
Author
Contributor

It would of course be better for geonodes to reuse all allocations, instead of recreating them every time the node graph is executed, but that does not seem to be possible without major changes of the architecture.

I think it is better to do that, instead of just make allocation bigger? Especially in the function you are optimizing.

I guess I don't understand what you're suggesting.

This geonode receives an "original" curves object and it produces a "new" curves object with modified positions.

The only way to get rid of the allocation, is to reuse the previous frame's positions layer. I don't see how that can be done. The geonode is, by design, basically stateless. It does not remember what the previous allocation was. I had a hard time just figuring out how to keep results of ReverseUVSampler computations from frame to frame (and I'm not sure if I did it entirely legally.) Even if I extend the same hack to keep a reference to the previous allocation, that allocation might still be in use when the node is executed again.

P.S. but maybe it can reuse the input layer, since it will almost certainly not be needed after execution of this node?

> > It would of course be better for geonodes to reuse all allocations, instead of recreating them every time the node graph is executed, but that does not seem to be possible without major changes of the architecture. > > I think it is better to do that, instead of just make allocation bigger? Especially in the function you are optimizing. I guess I don't understand what you're suggesting. This geonode receives an "original" curves object and it produces a "new" curves object with modified positions. The only way to get rid of the allocation, is to reuse the previous frame's positions layer. I don't see how that can be done. The geonode is, by design, basically stateless. It does not remember what the previous allocation was. I had a hard time just figuring out how to keep results of ReverseUVSampler computations from frame to frame (and I'm not sure if I did it entirely legally.) Even if I extend the same hack to keep a reference to the previous allocation, that allocation might still be in use when the node is executed again. P.S. but maybe it can reuse the _input_ layer, since it will almost certainly not be needed after execution of this node?

You can just allocate enough memory each time once, and do not use vector of vectors, i wrote this in my first comment in original pr, and link task for that.

You can just allocate enough memory each time once, and do not use vector of vectors, i wrote this in my first comment in original pr, and link task for that.
Author
Contributor

This is a completely different allocation.

This is a completely different allocation.
Eugene-Kuznetsov force-pushed linux_huge_pages from 257d5c43bf to 95fffefefb 2024-01-13 18:52:42 +01:00 Compare
Author
Contributor

Looks like I can get the same or better effect just by tweaking the configuration of jemalloc.

time blender -b monkey_test.blend -a
Rendering engine set to Cycles, 4 samples per pixel (since we are not interested in benchmarking the GPU)

Version time (real), s, monkey_test.blend time (real), s, rug0113.blend
huge pages (new) 12.172, 12.335, 11.831 21.633, 21.675, 21.934
huge pages (orig) 13.252, 12.937, 12.449 22.157, 22.371, 22.381
main 14.292, 14.352, 14.331 24.499, 24.572, 24.628
Looks like I can get the same or better effect just by tweaking the configuration of jemalloc. `time blender -b monkey_test.blend -a` Rendering engine set to Cycles, 4 samples per pixel (since we are not interested in benchmarking the GPU) Version | time (real), s, monkey_test.blend | time (real), s, rug0113.blend --- | --- | --- huge pages (new) | 12.172, 12.335, 11.831 | 21.633, 21.675, 21.934 huge pages (orig) | 13.252, 12.937, 12.449 | 22.157, 22.371, 22.381 main | 14.292, 14.352, 14.331 | 24.499, 24.572, 24.628
Brecht Van Lommel requested changes 2024-01-16 15:03:20 +01:00
Brecht Van Lommel left a comment
Owner

I can confirm speedups in my own tests, will add those in a minute. The changes to aligned_malloc I think we should leave out though.

I can confirm speedups in my own tests, will add those in a minute. The changes to `aligned_malloc` I think we should leave out though.
@ -214,3 +214,3 @@
len = SIZET_ALIGN_4(len);
memh = (MemHead *)calloc(1, len + sizeof(MemHead));
memh = (MemHead *)aligned_malloc(len + sizeof(MemHead), 8);

I don't think this should be done. With calloc the OS can give you memory that it already knows is zero, avoiding the cost of zero-ing it again.

I don't think this should be done. With `calloc` the OS can give you memory that it already knows is zero, avoiding the cost of zero-ing it again.
Eugene-Kuznetsov marked this conversation as resolved
@ -256,3 +257,3 @@
len = SIZET_ALIGN_4(len);
memh = (MemHead *)malloc(len + sizeof(MemHead));
memh = (MemHead *)aligned_malloc(len + sizeof(MemHead), 8);

I don't think this helps, malloc is already aligning to at least 8 bytes by default.

I don't think this helps, malloc is already aligning to at least 8 bytes by default.
Author
Contributor

yeah, both these changes were purely to make sure that allocations through these API calls got huge pages. They are not needed any more.

yeah, both these changes were purely to make sure that allocations through these API calls got huge pages. They are not needed any more.
brecht marked this conversation as resolved

Here is the Cycles CPU rendering benchmark. There is some timing noise, but overall it's faster.

                                         PR116663             jemalloc_flags       aligned_malloc       reference             
attic                                    0.5019s              0.4952s              0.5302s              0.5258s              
barbershop_interior                      0.8386s              0.8270s              0.8359s              0.8331s              
bmw27                                    0.0476s              0.0486s              0.0491s              0.0499s              
classroom                                0.6007s              0.6015s              0.6242s              0.6250s              
fishy_cat                                0.0764s              0.0760s              0.0777s              0.0784s              
junkshop                                 0.3537s              0.3508s              0.4013s              0.4048s              
koro                                     0.2106s              0.2092s              0.2165s              0.2202s              
ladder                                   0.1541s              0.1567s              0.1593s              0.1624s              
monster                                  0.2423s              0.2446s              0.2604s              0.2589s              
pabellon                                 0.1864s              0.1869s              0.1908s              0.1945s              
sponza                                   0.0842s              0.0844s              0.0877s              0.0876s              
spring                                   0.3543s              0.3546s              0.3652s              0.3658s              
victor                                   0.4922s              0.5066s              0.5254s              0.5154s              
wdas_cloud                               0.2085s              0.2135s              0.2232s              0.2252s              

What's interesting is that this actually barely does any memory allocations, I guess most of the benefit is coming from better memory layout somehow. Regardless it's a very nice speedup.

CC @mont29 @ideasman42 @Sergey for visibility, in case making this change has some unexpected side effects later on.

Here is the Cycles CPU rendering benchmark. There is some timing noise, but overall it's faster. ``` PR116663 jemalloc_flags aligned_malloc reference attic 0.5019s 0.4952s 0.5302s 0.5258s barbershop_interior 0.8386s 0.8270s 0.8359s 0.8331s bmw27 0.0476s 0.0486s 0.0491s 0.0499s classroom 0.6007s 0.6015s 0.6242s 0.6250s fishy_cat 0.0764s 0.0760s 0.0777s 0.0784s junkshop 0.3537s 0.3508s 0.4013s 0.4048s koro 0.2106s 0.2092s 0.2165s 0.2202s ladder 0.1541s 0.1567s 0.1593s 0.1624s monster 0.2423s 0.2446s 0.2604s 0.2589s pabellon 0.1864s 0.1869s 0.1908s 0.1945s sponza 0.0842s 0.0844s 0.0877s 0.0876s spring 0.3543s 0.3546s 0.3652s 0.3658s victor 0.4922s 0.5066s 0.5254s 0.5154s wdas_cloud 0.2085s 0.2135s 0.2232s 0.2252s ``` What's interesting is that this actually barely does any memory allocations, I guess most of the benefit is coming from better memory layout somehow. Regardless it's a very nice speedup. CC @mont29 @ideasman42 @Sergey for visibility, in case making this change has some unexpected side effects later on.
Eugene-Kuznetsov force-pushed linux_huge_pages from 95fffefefb to 4eda6a5d5a 2024-01-16 16:15:15 +01:00 Compare
Brecht Van Lommel changed title from Using huge pages for large allocations in Linux to Linux: Use huge pages in jemalloc to speed up allocations 2024-01-16 16:28:01 +01:00
Brecht Van Lommel approved these changes 2024-01-16 16:31:19 +01:00
Brecht Van Lommel merged commit 10dfa07e36 into main 2024-01-16 16:37:50 +01:00
Brecht Van Lommel deleted branch linux_huge_pages 2024-01-16 16:37:52 +01:00

I created a follow up task in #117175.

I created a follow up task in #117175.
Sign in to join this conversation.
No reviewers
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset System
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Code Documentation
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Viewport & EEVEE
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Asset Browser Project
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Module
Viewport & EEVEE
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Severity
High
Severity
Low
Severity
Normal
Severity
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#116663
No description provided.