July 3rd, 2024

Meta Sees ~5% Performance Gains to Optimizing the Linux Kernel with Bolt

Facebook's Meta uses BOLT to enhance Linux kernel layout, yielding 5% performance boost. Benefits vary based on kernel usage, with tasks like databases and networks benefiting most. Engineer Maksim Panchenko shares optimization guide.

Read original articleLink Icon
Meta Sees ~5% Performance Gains to Optimizing the Linux Kernel with Bolt

Meta/Facebook has been experimenting with using BOLT to optimize the Linux kernel binary layout, resulting in around a 5% performance improvement compared to the default setup. The benefits of this optimization vary depending on how much time an application spends in the kernel space, with tasks like database servers and network-intensive workloads benefiting the most. Meta engineer Maksim Panchenko recently published a guide for optimizing the Linux kernel with BOLT, explaining that the performance boost is achieved by reducing instruction cache misses and branch mispredictions. By following the guide, users can expect better system performance by scaling the improvement based on their application's kernel usage. This optimization work has been ongoing since BOLT was integrated into LLVM, showcasing tangible gains in kernel performance for specific workloads.

Related

NUMA Emulation Yields "Significant Performance Uplift" to Raspberry Pi 5

NUMA Emulation Yields "Significant Performance Uplift" to Raspberry Pi 5

Engineers at Igalia developed NUMA emulation for ARM64, enhancing Raspberry Pi 5 performance. Linux kernel patches showed 18% multi-core and 6% single-core improvement in Geekbench tests. The concise code may be merged into the mainline kernel for broader benefits.

How eBPF is shaping the future of Linux and platform engineering

How eBPF is shaping the future of Linux and platform engineering

eBPF, developed by Daniel Borkmann, revolutionizes Linux by enabling custom programs in the kernel. It enhances networking, security, and observability, bridging monolithic and microkernel architectures for improved performance and flexibility.

Four lines of code it was four lines of code

Four lines of code it was four lines of code

The programmer resolved a CPU utilization issue by removing unnecessary Unix domain socket code from a TCP and TLS service handler. This debugging process emphasized meticulous code review and system interaction understanding.

Beating NumPy's matrix multiplication in 150 lines of C code

Beating NumPy's matrix multiplication in 150 lines of C code

Aman Salykov's blog delves into high-performance matrix multiplication in C, surpassing NumPy with OpenBLAS on AMD Ryzen 7700 CPU. Scalable, portable code with OpenMP, targeting Intel Core and AMD Zen CPUs. Discusses BLAS, CPU performance limits, and hints at GPU optimization.

Do not taunt happy fun branch predictor

Do not taunt happy fun branch predictor

The author shares insights on optimizing AArch64 assembly code by reducing jumps in loops. Replacing ret with br x30 improved performance, leading to an 8.8x speed increase. Considerations on branch prediction and SIMD instructions are discussed.

Link Icon 10 comments
By @fooker - 5 months
Interestingly Google went ahead and made their own version of Bolt after it was already open source.

It's called Propeller and it had some purported advantages afaik.

Anyone know if such large scale experiments have been conducted with this for the sake for comparison?

By @nickcw - 5 months
Here is an article on the Meta engineering blog about BOLT

https://engineering.fb.com/2018/06/19/data-infrastructure/ac...

By @snehasish - 5 months
The gains are dependent on how much time the workload spends in the kernel. The Propeller team showed 2% performance improvement over PGO+LTO in this LLVM discussion post: https://discourse.llvm.org/t/optimizing-the-linux-kernel-wit...

More details about Propeller are available in a recently published paper: https://research.google/pubs/propeller-a-profile-guided-reli...

By @lawlessone - 5 months
Is this specific to big programs running on data centers or could it be applied to programs people run locally on their computers or phones?

I'd love to know if the kernel improvements could be improved in this way too for PC / Android users.

By @Yoric - 5 months
I seem to remember that glandium had done something like this on the Firefox binary to optimize the startup, with impressive speedup. Can't find the details, though.
By @djmips - 5 months
If some of the optimization gain is from bad alignment of potentially fused operations that can't because they straddle a 64 byte cache boundary, it feels like this is something the compiler should be aware of and mitigating.
By @nsguy - 5 months
Aren't profile guided optimizers capable of doing similar optimizations?
By @bjourne - 5 months
If it increases performance by 5% then 5% of execution time was spent missing branches and instruction cache misses... Which sounds utterly implausible. Conventional wisdom has it that instruction caching is not a problem because whatever the size of the binary it is dwarfed by the size of the data. And hot loops are generally no more than a few KBs at most anyway. I'm skeptical.
By @pvg - 5 months
This looks like iffy blogspam of https://github.com/llvm/llvm-project/blob/main/bolt/docs/Opt...

The claim in it is 'up to 5% improvement' so the title seems overstated as well.