Linux: Use huge pages in jemalloc to speed up allocations #116663
No reviewers
Labels
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset System
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Code Documentation
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Viewport & EEVEE
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Asset Browser Project
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Module
Viewport & EEVEE
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Severity
High
Severity
Low
Severity
Normal
Severity
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: blender/blender#116663
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "Eugene-Kuznetsov/blender:linux_huge_pages"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This PR requests the Linux kernel to use huge (2 MB) pages for large allocations. This has the effect of speeding up first accesses to those allocations, and possibly also speeds up future accesses by reducing TLB faults. (By default, 4 KB pages are used unless the user enables huge pages through a kernel parameter or an obscure sysfs setting.)
Try the attached program to observe the effect.
a59ba4f078
to257d5c43bf
Can you say what the numbers are for running the attached test on your system?
Can you demonstrate performance impact of this in real Blender operations, like rendering, geometry nodes, mesh editing, .. ?
In general different allocation strategies have trade-offs, maybe additional memory usage or fragmentation, etc. If a certain strategy was better for every application it would likely be the default, so can you explain this more?
There may also be bad interactions or redundancy with jemalloc that we use by default for Linux release builds.
So I don't think synthetic test results alone are enough for us to enable this.
I originally spotted this as part of optimization of #116545, where this function call
4f91c770ce/source/blender/nodes/geometry/nodes/node_geo_deform_curves_on_surface.cc (L501)
("Deform on Surfaces" geonode) ultimately led to a malloc for 50 MB of memory every frame, followed by an immediate memcpy into it, and that memcpy was about 5x slower than it was supposed to be, because it was causing a page fault for every 4 KB of the newly allocated memory. The proposed change led to a noticeable reduction in overall execution time of that geonode. I see an overall improvement in viewport rendering times as well. I'll try to put together some demo files.The tradeoff is that memory has to be allocated in multiples of 2 MB, so there's some overhead and it only makes sense for sufficiently large allocations.
I see that there's support of the same feature in jemalloc, but huge pages are clearly not on by default in my own release builds, I'll need to investigate this (in fact, I'm not sure if jemalloc is even being used.)
The main problem is not the speed of allocations, but the fact that they exist.
It would of course be better for geonodes to reuse all allocations, instead of recreating them every time the node graph is executed, but that does not seem to be possible without major changes of the architecture.
I think it is better to do that, instead of just make allocation bigger? Especially in the function you are optimizing.
I don't think the geometry nodes execution should be caching allocations, that can significantly increase peak memory usage depending on the scene.
I guess I don't understand what you're suggesting.
This geonode receives an "original" curves object and it produces a "new" curves object with modified positions.
The only way to get rid of the allocation, is to reuse the previous frame's positions layer. I don't see how that can be done. The geonode is, by design, basically stateless. It does not remember what the previous allocation was. I had a hard time just figuring out how to keep results of ReverseUVSampler computations from frame to frame (and I'm not sure if I did it entirely legally.) Even if I extend the same hack to keep a reference to the previous allocation, that allocation might still be in use when the node is executed again.
P.S. but maybe it can reuse the input layer, since it will almost certainly not be needed after execution of this node?
You can just allocate enough memory each time once, and do not use vector of vectors, i wrote this in my first comment in original pr, and link task for that.
This is a completely different allocation.
257d5c43bf
to95fffefefb
Looks like I can get the same or better effect just by tweaking the configuration of jemalloc.
time blender -b monkey_test.blend -a
Rendering engine set to Cycles, 4 samples per pixel (since we are not interested in benchmarking the GPU)
I can confirm speedups in my own tests, will add those in a minute. The changes to
aligned_malloc
I think we should leave out though.@ -214,3 +214,3 @@
len = SIZET_ALIGN_4(len);
memh = (MemHead *)calloc(1, len + sizeof(MemHead));
memh = (MemHead *)aligned_malloc(len + sizeof(MemHead), 8);
I don't think this should be done. With
calloc
the OS can give you memory that it already knows is zero, avoiding the cost of zero-ing it again.@ -256,3 +257,3 @@
len = SIZET_ALIGN_4(len);
memh = (MemHead *)malloc(len + sizeof(MemHead));
memh = (MemHead *)aligned_malloc(len + sizeof(MemHead), 8);
I don't think this helps, malloc is already aligning to at least 8 bytes by default.
yeah, both these changes were purely to make sure that allocations through these API calls got huge pages. They are not needed any more.
Here is the Cycles CPU rendering benchmark. There is some timing noise, but overall it's faster.
What's interesting is that this actually barely does any memory allocations, I guess most of the benefit is coming from better memory layout somehow. Regardless it's a very nice speedup.
CC @mont29 @ideasman42 @Sergey for visibility, in case making this change has some unexpected side effects later on.
95fffefefb
to4eda6a5d5a
Using huge pages for large allocations in Linuxto Linux: Use huge pages in jemalloc to speed up allocationsI created a follow up task in #117175.