Meta Sees ~5% Performance Gains to Optimizing the Linux Kernel with Bolt
Facebook's Meta uses BOLT to enhance Linux kernel layout, yielding 5% performance boost. Benefits vary based on kernel usage, with tasks like databases and networks benefiting most. Engineer Maksim Panchenko shares optimization guide.
Read original articleMeta/Facebook has been experimenting with using BOLT to optimize the Linux kernel binary layout, resulting in around a 5% performance improvement compared to the default setup. The benefits of this optimization vary depending on how much time an application spends in the kernel space, with tasks like database servers and network-intensive workloads benefiting the most. Meta engineer Maksim Panchenko recently published a guide for optimizing the Linux kernel with BOLT, explaining that the performance boost is achieved by reducing instruction cache misses and branch mispredictions. By following the guide, users can expect better system performance by scaling the improvement based on their application's kernel usage. This optimization work has been ongoing since BOLT was integrated into LLVM, showcasing tangible gains in kernel performance for specific workloads.
Related
NUMA Emulation Yields "Significant Performance Uplift" to Raspberry Pi 5
Engineers at Igalia developed NUMA emulation for ARM64, enhancing Raspberry Pi 5 performance. Linux kernel patches showed 18% multi-core and 6% single-core improvement in Geekbench tests. The concise code may be merged into the mainline kernel for broader benefits.
How eBPF is shaping the future of Linux and platform engineering
eBPF, developed by Daniel Borkmann, revolutionizes Linux by enabling custom programs in the kernel. It enhances networking, security, and observability, bridging monolithic and microkernel architectures for improved performance and flexibility.
Four lines of code it was four lines of code
The programmer resolved a CPU utilization issue by removing unnecessary Unix domain socket code from a TCP and TLS service handler. This debugging process emphasized meticulous code review and system interaction understanding.
Beating NumPy's matrix multiplication in 150 lines of C code
Aman Salykov's blog delves into high-performance matrix multiplication in C, surpassing NumPy with OpenBLAS on AMD Ryzen 7700 CPU. Scalable, portable code with OpenMP, targeting Intel Core and AMD Zen CPUs. Discusses BLAS, CPU performance limits, and hints at GPU optimization.
Do not taunt happy fun branch predictor
The author shares insights on optimizing AArch64 assembly code by reducing jumps in loops. Replacing ret with br x30 improved performance, leading to an 8.8x speed increase. Considerations on branch prediction and SIMD instructions are discussed.
It's called Propeller and it had some purported advantages afaik.
Anyone know if such large scale experiments have been conducted with this for the sake for comparison?
https://engineering.fb.com/2018/06/19/data-infrastructure/ac...
More details about Propeller are available in a recently published paper: https://research.google/pubs/propeller-a-profile-guided-reli...
I'd love to know if the kernel improvements could be improved in this way too for PC / Android users.
The claim in it is 'up to 5% improvement' so the title seems overstated as well.
Related
NUMA Emulation Yields "Significant Performance Uplift" to Raspberry Pi 5
Engineers at Igalia developed NUMA emulation for ARM64, enhancing Raspberry Pi 5 performance. Linux kernel patches showed 18% multi-core and 6% single-core improvement in Geekbench tests. The concise code may be merged into the mainline kernel for broader benefits.
How eBPF is shaping the future of Linux and platform engineering
eBPF, developed by Daniel Borkmann, revolutionizes Linux by enabling custom programs in the kernel. It enhances networking, security, and observability, bridging monolithic and microkernel architectures for improved performance and flexibility.
Four lines of code it was four lines of code
The programmer resolved a CPU utilization issue by removing unnecessary Unix domain socket code from a TCP and TLS service handler. This debugging process emphasized meticulous code review and system interaction understanding.
Beating NumPy's matrix multiplication in 150 lines of C code
Aman Salykov's blog delves into high-performance matrix multiplication in C, surpassing NumPy with OpenBLAS on AMD Ryzen 7700 CPU. Scalable, portable code with OpenMP, targeting Intel Core and AMD Zen CPUs. Discusses BLAS, CPU performance limits, and hints at GPU optimization.
Do not taunt happy fun branch predictor
The author shares insights on optimizing AArch64 assembly code by reducing jumps in loops. Replacing ret with br x30 improved performance, leading to an 8.8x speed increase. Considerations on branch prediction and SIMD instructions are discussed.