September 13th, 2024

GNU C Library Tuning for AArch64 Helps Memset Performance by ~24%

A recent glibc patch enhances memset() performance by 24% on Arm Neoverse-N1 cores, optimizing Assembly code and utilizing the DC ZVA instruction, with inclusion in glibc 2.41 expected in February 2024.

Read original articleLink Icon
GNU C Library Tuning for AArch64 Helps Memset Performance by ~24%

A recent patch to the GNU C Library (glibc) has improved the performance of the memset() function by approximately 24% on the Arm Neoverse-N1 core. The optimization, implemented by Wilco Dijkstra from Arm, involves refining the hand-tuned Assembly code to enhance small memsets by eliminating branches and utilizing overlapping stores. Additionally, the patch employs the DC ZVA instruction for memory copies exceeding 128 bytes and removes redundant code for sizes other than 64 and 128 bytes. This performance boost is particularly relevant for systems using the Neoverse-N1 architecture, such as the Ampere Altra and Altra Max servers. The optimization is expected to be included in the upcoming glibc 2.41 release, scheduled for February 2024. The impact of this enhancement on other Arm cores remains to be evaluated.

- A 24% performance improvement for memset() on Arm Neoverse-N1 cores.

- Optimization involves avoiding branches and using overlapping stores.

- DC ZVA instruction utilized for memory copies over 128 bytes.

- Patch expected to be included in glibc 2.41 release in February 2024.

- Potential for performance improvements on other Arm architectures is still to be assessed.

Related

Memory sealing for the GNU C Library

Memory sealing for the GNU C Library

The GNU C Library introduces mseal() system call for enhanced security by preventing address space changes. Adhemerval Zanella's patch series adds support, improving memory manipulation protection in upcoming releases.

NUMA Emulation Yields "Significant Performance Uplift" to Raspberry Pi 5

NUMA Emulation Yields "Significant Performance Uplift" to Raspberry Pi 5

Engineers at Igalia developed NUMA emulation for ARM64, enhancing Raspberry Pi 5 performance. Linux kernel patches showed 18% multi-core and 6% single-core improvement in Geekbench tests. The concise code may be merged into the mainline kernel for broader benefits.

Integrated assembler improvements in LLVM 19

Integrated assembler improvements in LLVM 19

LLVM 19 brings significant enhancements to the integrated assembler, focusing on the MC library for assembly, disassembly, and object file formats. Performance improvements include optimized fragment sizes, streamlined symbol handling, and simplified expression evaluation. These changes aim to boost performance, reduce memory usage, and lay the groundwork for future enhancements.

Do not taunt happy fun branch predictor

Do not taunt happy fun branch predictor

The author shares insights on optimizing AArch64 assembly code by reducing jumps in loops. Replacing ret with br x30 improved performance, leading to an 8.8x speed increase. Considerations on branch prediction and SIMD instructions are discussed.

Intel Further Speeds Up Strnlen() in the GNU C Library for Recent Intel/AMD CPUs

Intel Further Speeds Up Strnlen() in the GNU C Library for Recent Intel/AMD CPUs

Intel has optimized the strnlen() function in glibc for better performance on modern CPUs, unifying implementations and showing significant improvements in benchmark tests. The update will be in glibc 2.41.

Link Icon 0 comments