October 1st, 2024

tolower() small string performance

Tony Finch's blog post analyzes the performance of the `tolower()` function for small strings, revealing that scalar code is faster for strings under 5 bytes, while AVX-512 excels for longer strings.

Read original article

Tony Finch's recent blog post discusses the performance of the `tolower()` function for small strings, particularly focusing on strings up to 100 bytes long. This analysis builds on previous work that examined larger strings and aims to identify the crossover point between scalar code and AVX-512 with masked loads and stores. Finch's testing involved various implementations of `memcpy` and `tolower`, with performance measured using Linux's `perf_event_open(2)`. The results indicate that for very small strings, scalar code is faster for strings less than 5 bytes, while AVX-512 outperforms scalar code for strings longer than 5 bytes. The study also highlights that Clang's AVX-256 implementation is more efficient for strings between 32 and 256 bytes. Finch encountered challenges in benchmarking, particularly with implausibly low timing measurements for small strings, which he resolved by incorporating memory fences. He concludes that benchmarking is complex and invites suggestions for improving measurement accuracy. The findings emphasize the need for further optimization in handling small strings in performance-critical applications.

- The crossover point for performance between scalar code and AVX-512 is at 5 bytes.

- Scalar code is faster for strings less than 5 bytes, while AVX-512 is better for longer strings.

- Clang's AVX-256 implementation is more efficient for strings between 32 and 256 bytes.

- Benchmarking small string performance presents significant challenges.

- The results may not accurately reflect performance in real-world applications due to inlining and optimization.

Do not taunt happy fun branch predictor

The author shares insights on optimizing AArch64 assembly code by reducing jumps in loops. Replacing ret with br x30 improved performance, leading to an 8.8x speed increase. Considerations on branch prediction and SIMD instructions are discussed.

Counting Bytes Faster Than You'd Think Possible

Matt Stuchlik's high-performance computing method counts bytes with a value of 127 in a 250MB stream, achieving 550 times faster performance using SIMD instructions and an innovative memory read pattern.

tolower() with AVX-512

Tony Finch's blog post details the implementation of the tolower() function using AVX-512-BW SIMD instructions, optimizing string processing and outperforming standard methods, particularly for short strings.

Strlcpy and how CPUs can defy common sense

The article compares the performance of `strlcpy` in OpenBSD and glibc, revealing glibc's faster execution despite double traversal, emphasizing instruction-level parallelism and advocating for sized strings for efficiency.

Intel Further Speeds Up Strnlen() in the GNU C Library for Recent Intel/AMD CPUs

Intel has optimized the strnlen() function in glibc for better performance on modern CPUs, unifying implementations and showing significant improvements in benchmark tests. The update will be in glibc 2.41.

0 comments

tolower() small string performance

Related

Do not taunt happy fun branch predictor

Counting Bytes Faster Than You'd Think Possible

tolower() with AVX-512

Strlcpy and how CPUs can defy common sense

Intel Further Speeds Up Strnlen() in the GNU C Library for Recent Intel/AMD CPUs

Related

Do not taunt happy fun branch predictor

Counting Bytes Faster Than You'd Think Possible

tolower() with AVX-512

Strlcpy and how CPUs can defy common sense

Intel Further Speeds Up Strnlen() in the GNU C Library for Recent Intel/AMD CPUs