tolower() with AVX-512
Tony Finch's blog post details the implementation of the tolower() function using AVX-512-BW SIMD instructions, optimizing string processing and outperforming standard methods, particularly for short strings.
Read original articleTony Finch's blog post discusses the implementation of the tolower() function using AVX-512-BW SIMD instructions, focusing on optimizing string processing. Finch references previous work on using SWAR techniques for bulk string conversion and highlights Olivier Giniaux's article on handling small strings with SIMD for fast hash functions. He notes the challenges of transferring short strings between memory and vector registers and explores the potential of masked loads and stores in AVX-512-BW, which is available on recent AMD Zen processors.
The post details the creation of a tolower64() function that processes 64 bytes at once, utilizing Intel intrinsics for efficient byte-wise operations. Finch explains the use of masked operations to handle both long and short strings effectively, emphasizing the performance benefits of AVX-512-BW. Benchmarking results show that his implementation consistently outperforms other methods, including standard tolower() functions and various memcpy versions.
Finch concludes that AVX-512-BW is advantageous for string manipulation, particularly for short strings, and expresses a desire for wider availability of such instruction set extensions to enhance string handling performance. He also mentions the potential of ARM SVE for similar applications, although he has not tested it personally. The complete code for the discussed implementations is available on his website, inviting feedback from readers.
Related
Do not taunt happy fun branch predictor
The author shares insights on optimizing AArch64 assembly code by reducing jumps in loops. Replacing ret with br x30 improved performance, leading to an 8.8x speed increase. Considerations on branch prediction and SIMD instructions are discussed.
Scan HTML faster with SIMD instructions: .NET/C# Edition
WebKit and Chromium enhance HTML content scanning with fast SIMD routines, boosting performance significantly. .NET8 now supports speedy SIMD instructions for C#, achieving impressive speeds comparable to C/C++ implementations.
Summing ASCII encoded integers on Haswell at almost the speed of memcpy
Matt Stuchlik presents a high-performance algorithm for summing ASCII-encoded integers on Haswell systems. It utilizes SIMD instructions, lookup tables, and efficient operations to achieve speed enhancements, showcasing innovative approaches in integer sum calculations.
Scan HTML even faster with SIMD instructions (C++ and C#)
Major web engines like WebKit/Safari and Chromium/Chrome/Edge/Brave have implemented accelerated HTML parsing using SIMD instructions for faster performance. Results show significant speed improvements, especially with 64-bit vectorized classification methods, promising enhanced HTML parsing efficiency.
Counting Bytes Faster Than You'd Think Possible
Matt Stuchlik's high-performance computing method counts bytes with a value of 127 in a 250MB stream, achieving 550 times faster performance using SIMD instructions and an innovative memory read pattern.
- Concerns about undefined behavior in Rust and LLVM memory models, particularly regarding unsafe memory access.
- Speculation on the competition between AMD's AVX512 and Intel's upcoming AVX10, with a focus on performance and adoption issues.
- Discussion of SWAR optimizations and their limitations, especially regarding string alignment and performance implications.
- Interest in the handling of Unicode characters and the broader implications of string processing beyond ASCII.
- Technical insights into assembly code and performance comparisons between different compilers and architectures.
The big gains delivered in the article are all on a double-pumped Zen4 core! AVX512 brings a lot to the table so its quite frustrating that Intel market-segmented support for it so heavily as to completely inhibit its adoption in broad-based client code.
Is that an error in the post ? Shouldnt it do the addition when is_upper is false and copy the same when it is true ?
If your SWAR algorithm is applied on a non-aligned string, it is often slower than the original algorithm.
And splitting the algorith in 3 parts (handling the beginning up to an aligned address, then the aligned part, and then the less-than-8-bytes tail) takes even more instructions.
Here is a similar case on a false claim of a faster utf8.IsValid in Go, with benchmarks: https://github.com/sugawarayuuta/charcoal/pull/1
For anyone interested, the author's core loop in ASM is as compiled by GCC
.L3:
vmovdqu8 zmm0, ZMMWORD PTR [rcx+rax]
vmovdqa64 zmm1, zmm0
vpcmpb k1, zmm0, zmm4, 5
vpcmpb k0, zmm0, zmm3, 2
kandq k1, k1, k0
vpaddb zmm1{k1}, zmm0, zmm2
vmovdqu8 ZMMWORD PTR [rdi+rax], zmm1
add rax, 64
cmp rax, r8
jne .L3
uiCA (CQA/MAQAO) (https://uica.uops.info/, make sure to pick CQA + Ice Lake) says it achieves nice 32B/cycle on Ice Lake. If you multiply by say 3 to match 3 GHz, this gives us almost 96 GiB/s assuming memory access is not a bottleneck (it always is in such algorithms).But this seems not as close to optimal utilization as it could be. Using Clang instead yields much better, nicely unrolled result with better instruction selection.
.LBB0_9:
vmovdqu64 zmm3, zmmword ptr [rsi]
vmovdqu64 zmm5, zmmword ptr [rsi + 64]
vmovdqu64 zmm6, zmmword ptr [rsi + 128]
add rdx, -512
vpaddb zmm4, zmm3, zmm0
vpcmpltub k1, zmm4, zmm1
vpaddb zmm4, zmm5, zmm0
vpaddb zmm3 {k1}, zmm3, zmm2
vpcmpltub k1, zmm4, zmm1
vpaddb zmm4, zmm6, zmm0
vpaddb zmm5 {k1}, zmm5, zmm2
vmovdqu64 zmmword ptr [rcx], zmm3
vpcmpltub k1, zmm4, zmm1
vmovdqu64 zmmword ptr [rcx + 64], zmm5
vmovdqu64 zmm5, zmmword ptr [rsi + 192]
vpaddb zmm6 {k1}, zmm6, zmm2
vmovdqu64 zmmword ptr [rcx + 128], zmm6
vmovdqu64 zmm6, zmmword ptr [rsi + 256]
vpaddb zmm4, zmm5, zmm0
vpcmpltub k1, zmm4, zmm1
vpaddb zmm4, zmm6, zmm0
vpaddb zmm5 {k1}, zmm5, zmm2
vpcmpltub k1, zmm4, zmm1
vmovdqu64 zmmword ptr [rcx + 192], zmm5
vmovdqu64 zmm5, zmmword ptr [rsi + 320]
vpaddb zmm6 {k1}, zmm6, zmm2
vmovdqu64 zmmword ptr [rcx + 256], zmm6
vmovdqu64 zmm6, zmmword ptr [rsi + 384]
vpaddb zmm4, zmm5, zmm0
vpcmpltub k1, zmm4, zmm1
vpaddb zmm4, zmm6, zmm0
vpaddb zmm5 {k1}, zmm5, zmm2
vpcmpltub k1, zmm4, zmm1
vmovdqu64 zmmword ptr [rcx + 320], zmm5
vmovdqu64 zmm5, zmmword ptr [rsi + 448]
vpaddb zmm6 {k1}, zmm6, zmm2
add rsi, 512
vmovdqu64 zmmword ptr [rcx + 384], zmm6
vpaddb zmm4, zmm5, zmm0
vpcmpltub k1, zmm4, zmm1
vpaddb zmm5 {k1}, zmm5, zmm2
vmovdqu64 zmmword ptr [rcx + 448], zmm5
add rcx, 512
cmp rdx, 63
ja .LBB0_9
This extracts more impressive 42.67B/c, I don't think even L2 cache can sustain such a throughput, but it's nice to know that medium length strings get up/downcased in about the same time it takes light from your screen to reach your cornea.The core for short lengths there is one instruction less:
.LBB0_5:
vmovdqu64 zmm3, zmmword ptr [rsi + rcx]
vpaddb zmm4, zmm3, zmm0
vpcmpltub k1, zmm4, zmm1
vpaddb zmm3 {k1}, zmm3, zmm2
vmovdqu64 zmmword ptr [rax + rcx], zmm3
add rcx, 64
cmp r8, rcx
jne .LBB0_5
Some months ago I wrote a similar ASCII in UTF-8 upcase/downcase implementation in C#: https://github.com/U8String/U8String/blob/main/Sources/U8Str...(the unrolled conversion for below vectorization lengths is required as short strings dominate most codebases so handling it fast is important - the switch compiles to jump table and then branchless fall-through to return)
For now it goes as wide as 256b as it already saturates e.g. Zen 3 or 4 which have only 256x4 SIMD units (even though Zen 4 can do fancy 512b shuffles natively and has very good 512b implementation). The core loop compiles to compact
G_M48884_IG05:
vmovups ymm3, ymmword ptr [rdi+rax]
vpaddb ymm4, ymm3, ymm1
vpcmpgtb ymm4, ymm2, ymm4
vpand ymm4, ymm4, ymm0
vpor ymm3, ymm4, ymm3
vmovups ymmword ptr [rsi+rax], ymm3
add rax, 32
cmp rax, rdx
jbe SHORT G_M48884_IG05
Side by side with C ones: https://godbolt.org/z/eTGYhTPanI believe you can also achieve similar with 3-instruction conversion with AVX512 with vpternlogd, as when I had access to AVX512 hardware, this is what .NET optimized it to for 256b width + AVX512VL, but strangely enough I can't make it do so for 512b width right now.
You may notice failed SWAR attempt for switch dispatch case and I was wondering what kind of license your posts are distributed under? (gave up on it back then because per-element fall-through was already fast enough, but if yours passes the test suite, I'd love to use it haha)
If your work involves some process whose timely completion depends on how fast ASCII tolower executes, you better shake things up somehow and change the ground rules.
Using SIMD for making text into … uwu.
Shame most programmers only care about ASCII. There is a whole world that exists outside of the standard [a-z,A-Z,0-9] character set
Well, supposedly RISC-V implementations will have none of this malarkey while still rivaling x64/ARM64 in processing speed at comparable technology/clock rates/prices, just with plain old loads-and-xors-and-stores?
Related
Do not taunt happy fun branch predictor
The author shares insights on optimizing AArch64 assembly code by reducing jumps in loops. Replacing ret with br x30 improved performance, leading to an 8.8x speed increase. Considerations on branch prediction and SIMD instructions are discussed.
Scan HTML faster with SIMD instructions: .NET/C# Edition
WebKit and Chromium enhance HTML content scanning with fast SIMD routines, boosting performance significantly. .NET8 now supports speedy SIMD instructions for C#, achieving impressive speeds comparable to C/C++ implementations.
Summing ASCII encoded integers on Haswell at almost the speed of memcpy
Matt Stuchlik presents a high-performance algorithm for summing ASCII-encoded integers on Haswell systems. It utilizes SIMD instructions, lookup tables, and efficient operations to achieve speed enhancements, showcasing innovative approaches in integer sum calculations.
Scan HTML even faster with SIMD instructions (C++ and C#)
Major web engines like WebKit/Safari and Chromium/Chrome/Edge/Brave have implemented accelerated HTML parsing using SIMD instructions for faster performance. Results show significant speed improvements, especially with 64-bit vectorized classification methods, promising enhanced HTML parsing efficiency.
Counting Bytes Faster Than You'd Think Possible
Matt Stuchlik's high-performance computing method counts bytes with a value of 127 in a 250MB stream, achieving 550 times faster performance using SIMD instructions and an innovative memory read pattern.