Linux: Use huge pages in jemalloc to speed up allocations #116663

Eugene-Kuznetsov · 2023-12-31T01:13:13+01:00

Eugene-Kuznetsov commented

2023-12-31 01:13:13 +01:00

This PR requests the Linux kernel to use huge (2 MB) pages for large allocations. This has the effect of speeding up first accesses to those allocations, and possibly also speeds up future accesses by reducing TLB faults. (By default, 4 KB pages are used unless the user enables huge pages through a kernel parameter or an obscure sysfs setting.)

Try the attached program to observe the effect.

This PR requests the Linux kernel to use huge (2 MB) pages for large allocations. This has the effect of speeding up first accesses to those allocations, and possibly also speeds up future accesses by reducing TLB faults. (By default, 4 KB pages are used unless the user enables huge pages through a kernel parameter or an obscure sysfs setting.) Try the attached program to observe the effect.

huge_page_test.cc

2.0 KiB

👍 4 👀 1 🚀 3

Eugene-Kuznetsov force-pushed linux_huge_pages from a59ba4f078 to 257d5c43bf

2023-12-31 05:51:56 +01:00

Compare

Iliya Katushenock added this to the Core project 2023-12-31 08:21:06 +01:00

Iliya Katushenock added the

Platform

Linux

label 2023-12-31 08:21:13 +01:00

Brecht Van Lommel commented

2024-01-12 15:33:53 +01:00

Can you say what the numbers are for running the attached test on your system?

Can you demonstrate performance impact of this in real Blender operations, like rendering, geometry nodes, mesh editing, .. ?

In general different allocation strategies have trade-offs, maybe additional memory usage or fragmentation, etc. If a certain strategy was better for every application it would likely be the default, so can you explain this more?

There may also be bad interactions or redundancy with jemalloc that we use by default for Linux release builds.

So I don't think synthetic test results alone are enough for us to enable this.

Can you say what the numbers are for running the attached test on your system? Can you demonstrate performance impact of this in real Blender operations, like rendering, geometry nodes, mesh editing, .. ? In general different allocation strategies have trade-offs, maybe additional memory usage or fragmentation, etc. If a certain strategy was better for every application it would likely be the default, so can you explain this more? There may also be bad interactions or redundancy with jemalloc that we use by default for Linux release builds. So I don't think synthetic test results alone are enough for us to enable this.

👍 1

Eugene-Kuznetsov commented

2024-01-12 17:15:39 +01:00

I originally spotted this as part of optimization of #116545, where this function call 4f91c770ce/source/blender/nodes/geometry/nodes/node_geo_deform_curves_on_surface.cc (L501) ("Deform on Surfaces" geonode) ultimately led to a malloc for 50 MB of memory every frame, followed by an immediate memcpy into it, and that memcpy was about 5x slower than it was supposed to be, because it was causing a page fault for every 4 KB of the newly allocated memory. The proposed change led to a noticeable reduction in overall execution time of that geonode. I see an overall improvement in viewport rendering times as well. I'll try to put together some demo files.
The tradeoff is that memory has to be allocated in multiples of 2 MB, so there's some overhead and it only makes sense for sufficiently large allocations.
I see that there's support of the same feature in jemalloc, but huge pages are clearly not on by default in my own release builds, I'll need to investigate this (in fact, I'm not sure if jemalloc is even being used.)

I originally spotted this as part of optimization of https://projects.blender.org/blender/blender/pulls/116545, where this function call https://projects.blender.org/blender/blender/src/commit/4f91c770cefe1384d2397160c320a29fed44dbbd/source/blender/nodes/geometry/nodes/node_geo_deform_curves_on_surface.cc#L501 ("Deform on Surfaces" geonode) ultimately led to a malloc for 50 MB of memory every frame, followed by an immediate memcpy into it, and that memcpy was about 5x slower than it was supposed to be, because it was causing a page fault for every 4 KB of the newly allocated memory. The proposed change led to a noticeable reduction in overall execution time of that geonode. I see an overall improvement in viewport rendering times as well. I'll try to put together some demo files. The tradeoff is that memory has to be allocated in multiples of 2 MB, so there's some overhead and it only makes sense for sufficiently large allocations. I see that there's support of the same feature in jemalloc, but huge pages are clearly not on by default in my own release builds, I'll need to investigate this (in fact, I'm not sure if jemalloc is even being used.)

Iliya Katushenock commented

2024-01-12 17:25:23 +01:00

The main problem is not the speed of allocations, but the fact that they exist.

Eugene-Kuznetsov commented

2024-01-12 17:42:56 +01:00

It would of course be better for geonodes to reuse all allocations, instead of recreating them every time the node graph is executed, but that does not seem to be possible without major changes of the architecture.

Iliya Katushenock commented

2024-01-12 17:47:41 +01:00

It would of course be better for geonodes to reuse all allocations, instead of recreating them every time the node graph is executed, but that does not seem to be possible without major changes of the architecture.

I think it is better to do that, instead of just make allocation bigger? Especially in the function you are optimizing.

> It would of course be better for geonodes to reuse all allocations, instead of recreating them every time the node graph is executed, but that does not seem to be possible without major changes of the architecture. I think it is better to do that, instead of just make allocation bigger? Especially in the function you are optimizing.

Brecht Van Lommel commented

2024-01-12 17:53:40 +01:00

I don't think the geometry nodes execution should be caching allocations, that can significantly increase peak memory usage depending on the scene.

Eugene-Kuznetsov commented

2024-01-12 18:10:37 +01:00

It would of course be better for geonodes to reuse all allocations, instead of recreating them every time the node graph is executed, but that does not seem to be possible without major changes of the architecture.

I think it is better to do that, instead of just make allocation bigger? Especially in the function you are optimizing.

I guess I don't understand what you're suggesting.

This geonode receives an "original" curves object and it produces a "new" curves object with modified positions.

The only way to get rid of the allocation, is to reuse the previous frame's positions layer. I don't see how that can be done. The geonode is, by design, basically stateless. It does not remember what the previous allocation was. I had a hard time just figuring out how to keep results of ReverseUVSampler computations from frame to frame (and I'm not sure if I did it entirely legally.) Even if I extend the same hack to keep a reference to the previous allocation, that allocation might still be in use when the node is executed again.

P.S. but maybe it can reuse the input layer, since it will almost certainly not be needed after execution of this node?

> > It would of course be better for geonodes to reuse all allocations, instead of recreating them every time the node graph is executed, but that does not seem to be possible without major changes of the architecture. > > I think it is better to do that, instead of just make allocation bigger? Especially in the function you are optimizing. I guess I don't understand what you're suggesting. This geonode receives an "original" curves object and it produces a "new" curves object with modified positions. The only way to get rid of the allocation, is to reuse the previous frame's positions layer. I don't see how that can be done. The geonode is, by design, basically stateless. It does not remember what the previous allocation was. I had a hard time just figuring out how to keep results of ReverseUVSampler computations from frame to frame (and I'm not sure if I did it entirely legally.) Even if I extend the same hack to keep a reference to the previous allocation, that allocation might still be in use when the node is executed again. P.S. but maybe it can reuse the _input_ layer, since it will almost certainly not be needed after execution of this node?

Iliya Katushenock commented

2024-01-12 19:47:04 +01:00

You can just allocate enough memory each time once, and do not use vector of vectors, i wrote this in my first comment in original pr, and link task for that.

Eugene-Kuznetsov commented

2024-01-12 19:57:06 +01:00

This is a completely different allocation.

Eugene-Kuznetsov force-pushed linux_huge_pages from 257d5c43bf to 95fffefefb

2024-01-13 18:52:42 +01:00

Compare

Eugene-Kuznetsov commented

2024-01-13 18:56:00 +01:00

Looks like I can get the same or better effect just by tweaking the configuration of jemalloc.

time blender -b monkey_test.blend -a
Rendering engine set to Cycles, 4 samples per pixel (since we are not interested in benchmarking the GPU)

Version	time (real), s, monkey_test.blend	time (real), s, rug0113.blend
huge pages (new)	12.172, 12.335, 11.831	21.633, 21.675, 21.934
huge pages (orig)	13.252, 12.937, 12.449	22.157, 22.371, 22.381
main	14.292, 14.352, 14.331	24.499, 24.572, 24.628

Looks like I can get the same or better effect just by tweaking the configuration of jemalloc. `time blender -b monkey_test.blend -a` Rendering engine set to Cycles, 4 samples per pixel (since we are not interested in benchmarking the GPU) Version | time (real), s, monkey_test.blend | time (real), s, rug0113.blend --- | --- | --- huge pages (new) | 12.172, 12.335, 11.831 | 21.633, 21.675, 21.934 huge pages (orig) | 13.252, 12.937, 12.449 | 22.157, 22.371, 22.381 main | 14.292, 14.352, 14.331 | 24.499, 24.572, 24.628

monkey_test.blend

7.3 MiB

rug0113.blend

1.7 MiB

Brecht Van Lommel requested changes 2024-01-16 15:03:20 +01:00

Brecht Van Lommel left a comment

I can confirm speedups in my own tests, will add those in a minute. The changes to aligned_malloc I think we should leave out though.

I can confirm speedups in my own tests, will add those in a minute. The changes to `aligned_malloc` I think we should leave out though.

intern/guardedalloc/intern/mallocn_lockfree_impl.c Outdated

						
				@ -214,3 +214,3 @@

				  len = SIZET_ALIGN_4(len);

				  memh = (MemHead *)calloc(1, len + sizeof(MemHead));

				  memh = (MemHead *)aligned_malloc(len + sizeof(MemHead), 8);

Brecht Van Lommel commented

2024-01-16 15:01:16 +01:00

I don't think this should be done. With calloc the OS can give you memory that it already knows is zero, avoiding the cost of zero-ing it again.

I don't think this should be done. With `calloc` the OS can give you memory that it already knows is zero, avoiding the cost of zero-ing it again.

Eugene-Kuznetsov marked this conversation as resolved

intern/guardedalloc/intern/mallocn_lockfree_impl.c Outdated

						
				@ -256,3 +257,3 @@

				  len = SIZET_ALIGN_4(len);

				  memh = (MemHead *)malloc(len + sizeof(MemHead));

				  memh = (MemHead *)aligned_malloc(len + sizeof(MemHead), 8);

Brecht Van Lommel commented

2024-01-16 15:01:42 +01:00

I don't think this helps, malloc is already aligning to at least 8 bytes by default.

Eugene-Kuznetsov commented

2024-01-16 16:17:18 +01:00

yeah, both these changes were purely to make sure that allocations through these API calls got huge pages. They are not needed any more.

brecht marked this conversation as resolved

Brecht Van Lommel commented

2024-01-16 15:10:59 +01:00

Here is the Cycles CPU rendering benchmark. There is some timing noise, but overall it's faster.

                                         PR116663             jemalloc_flags       aligned_malloc       reference             
attic                                    0.5019s              0.4952s              0.5302s              0.5258s              
barbershop_interior                      0.8386s              0.8270s              0.8359s              0.8331s              
bmw27                                    0.0476s              0.0486s              0.0491s              0.0499s              
classroom                                0.6007s              0.6015s              0.6242s              0.6250s              
fishy_cat                                0.0764s              0.0760s              0.0777s              0.0784s              
junkshop                                 0.3537s              0.3508s              0.4013s              0.4048s              
koro                                     0.2106s              0.2092s              0.2165s              0.2202s              
ladder                                   0.1541s              0.1567s              0.1593s              0.1624s              
monster                                  0.2423s              0.2446s              0.2604s              0.2589s              
pabellon                                 0.1864s              0.1869s              0.1908s              0.1945s              
sponza                                   0.0842s              0.0844s              0.0877s              0.0876s              
spring                                   0.3543s              0.3546s              0.3652s              0.3658s              
victor                                   0.4922s              0.5066s              0.5254s              0.5154s              
wdas_cloud                               0.2085s              0.2135s              0.2232s              0.2252s

What's interesting is that this actually barely does any memory allocations, I guess most of the benefit is coming from better memory layout somehow. Regardless it's a very nice speedup.

CC @mont29 @ideasman42 @Sergey for visibility, in case making this change has some unexpected side effects later on.

Here is the Cycles CPU rendering benchmark. There is some timing noise, but overall it's faster. ``` PR116663 jemalloc_flags aligned_malloc reference attic 0.5019s 0.4952s 0.5302s 0.5258s barbershop_interior 0.8386s 0.8270s 0.8359s 0.8331s bmw27 0.0476s 0.0486s 0.0491s 0.0499s classroom 0.6007s 0.6015s 0.6242s 0.6250s fishy_cat 0.0764s 0.0760s 0.0777s 0.0784s junkshop 0.3537s 0.3508s 0.4013s 0.4048s koro 0.2106s 0.2092s 0.2165s 0.2202s ladder 0.1541s 0.1567s 0.1593s 0.1624s monster 0.2423s 0.2446s 0.2604s 0.2589s pabellon 0.1864s 0.1869s 0.1908s 0.1945s sponza 0.0842s 0.0844s 0.0877s 0.0876s spring 0.3543s 0.3546s 0.3652s 0.3658s victor 0.4922s 0.5066s 0.5254s 0.5154s wdas_cloud 0.2085s 0.2135s 0.2232s 0.2252s ``` What's interesting is that this actually barely does any memory allocations, I guess most of the benefit is coming from better memory layout somehow. Regardless it's a very nice speedup. CC @mont29 @ideasman42 @Sergey for visibility, in case making this change has some unexpected side effects later on.

Eugene-Kuznetsov force-pushed linux_huge_pages from 95fffefefb to 4eda6a5d5a

2024-01-16 16:15:15 +01:00

Compare

Brecht Van Lommel changed title from ~~Using huge pages for large allocations in Linux~~ to Linux: Use huge pages in jemalloc to speed up allocations

2024-01-16 16:28:01 +01:00

Brecht Van Lommel approved these changes 2024-01-16 16:31:19 +01:00

Brecht Van Lommel referenced this pull request

2024-01-16 16:36:40 +01:00

Cycles: Test Embree huge pages #117175

Brecht Van Lommel merged commit 10dfa07e36 into main

2024-01-16 16:37:50 +01:00

Brecht Van Lommel referenced this issue from a commit

2024-01-16 16:37:51 +01:00

Linux: Use huge pages in jemalloc to speed up allocations

Brecht Van Lommel deleted branch linux_huge_pages

2024-01-16 16:37:52 +01:00