July 28th, 2024

tolower() with AVX-512

Tony Finch's blog post details the implementation of the tolower() function using AVX-512-BW SIMD instructions, optimizing string processing and outperforming standard methods, particularly for short strings.

Read original article

CuriosityFrustrationConfusion

Tony Finch's blog post discusses the implementation of the tolower() function using AVX-512-BW SIMD instructions, focusing on optimizing string processing. Finch references previous work on using SWAR techniques for bulk string conversion and highlights Olivier Giniaux's article on handling small strings with SIMD for fast hash functions. He notes the challenges of transferring short strings between memory and vector registers and explores the potential of masked loads and stores in AVX-512-BW, which is available on recent AMD Zen processors.

The post details the creation of a tolower64() function that processes 64 bytes at once, utilizing Intel intrinsics for efficient byte-wise operations. Finch explains the use of masked operations to handle both long and short strings effectively, emphasizing the performance benefits of AVX-512-BW. Benchmarking results show that his implementation consistently outperforms other methods, including standard tolower() functions and various memcpy versions.

Finch concludes that AVX-512-BW is advantageous for string manipulation, particularly for short strings, and expresses a desire for wider availability of such instruction set extensions to enhance string handling performance. He also mentions the potential of ARM SVE for similar applications, although he has not tested it personally. The complete code for the discussed implementations is available on his website, inviting feedback from readers.

Do not taunt happy fun branch predictor

The author shares insights on optimizing AArch64 assembly code by reducing jumps in loops. Replacing ret with br x30 improved performance, leading to an 8.8x speed increase. Considerations on branch prediction and SIMD instructions are discussed.

Scan HTML faster with SIMD instructions: .NET/C# Edition

WebKit and Chromium enhance HTML content scanning with fast SIMD routines, boosting performance significantly. .NET8 now supports speedy SIMD instructions for C#, achieving impressive speeds comparable to C/C++ implementations.

Summing ASCII encoded integers on Haswell at almost the speed of memcpy

Matt Stuchlik presents a high-performance algorithm for summing ASCII-encoded integers on Haswell systems. It utilizes SIMD instructions, lookup tables, and efficient operations to achieve speed enhancements, showcasing innovative approaches in integer sum calculations.

Scan HTML even faster with SIMD instructions (C++ and C#)

Major web engines like WebKit/Safari and Chromium/Chrome/Edge/Brave have implemented accelerated HTML parsing using SIMD instructions for faster performance. Results show significant speed improvements, especially with 64-bit vectorized classification methods, promising enhanced HTML parsing efficiency.

Counting Bytes Faster Than You'd Think Possible

Matt Stuchlik's high-performance computing method counts bytes with a value of 127 in a 250MB stream, achieving 550 times faster performance using SIMD instructions and an innovative memory read pattern.

AI: What people are saying

The comments on Tony Finch's blog post about the tolower() function implementation using AVX-512-BW SIMD instructions reveal several key themes and discussions among readers.

Concerns about undefined behavior in Rust and LLVM memory models, particularly regarding unsafe memory access.
Speculation on the competition between AMD's AVX512 and Intel's upcoming AVX10, with a focus on performance and adoption issues.
Discussion of SWAR optimizations and their limitations, especially regarding string alignment and performance implications.
Interest in the handling of Unicode characters and the broader implications of string processing beyond ASCII.
Technical insights into assembly code and performance comparisons between different compilers and architectures.

13 comments

By @anderskaseorg - 10 months

Note that the “unsafe read beyond of death” trick is considered undefined behavior in the Rust and LLVM memory model, even if it’s allowed by the underlying hardware. Like any undefined behavior, compilers are allowed to assume it doesn’t happen for the purpose of optimization, leading to results you don’t expect. The only way around this is to use inline assembly.

https://github.com/ogxd/gxhash/issues/82

By @Remnant44 - 10 months

Neat and performant code like the article makes me very curious how the competition will shake out between AMD's AVX512 implementation and Intel's upcoming AVX10. The entire point of AVX10 seems to be to resolve Intel's P vs E core situation, while AMD seems to have taken a better approach of using either full width (Zen5) or double-pumped 256bit (Zen4, Zen5 mobile) as appropriate to the situation, while making the API seamless.

The big gains delivered in the article are all on a double-pumped Zen4 core! AVX512 brings a lot to the table so its quite frustrating that Intel market-segmented support for it so heavily as to completely inhibit its adoption in broad-based client code.

By @h2odragon - 10 months

just for giggles: http://www.unicode.org/Public/3.1-Update1/CaseFolding-4.txt

By @LelouBil - 10 months

> Finally, we do a masked add. We pass c twice: bytes from the first c are copied to the result when is_uppper is false, and when is_upper is true the result is c + to_upper.

Is that an error in the post ? Shouldnt it do the addition when is_upper is false and copy the same when it is true ?

By @dolmen - 10 months

Unfortunately those SWAR optimizations are only useful for strings that are aligned on 8 bytes address.

If your SWAR algorithm is applied on a non-aligned string, it is often slower than the original algorithm.

And splitting the algorith in 3 parts (handling the beginning up to an aligned address, then the aligned part, and then the less-than-8-bytes tail) takes even more instructions.

Here is a similar case on a false claim of a faster utf8.IsValid in Go, with benchmarks: https://github.com/sugawarayuuta/charcoal/pull/1

By @neonsunset - 10 months

Mask add looks neat! I wish there was a way to directly manipulate AVX512's mask registers in .NET intrinsics but for now we have to live with "recognized idioms".

For anyone interested, the author's core loop in ASM is as compiled by GCC

    .L3:
        vmovdqu8        zmm0, ZMMWORD PTR [rcx+rax]
        vmovdqa64       zmm1, zmm0
        vpcmpb  k1, zmm0, zmm4, 5
        vpcmpb  k0, zmm0, zmm3, 2
        kandq   k1, k1, k0
        vpaddb  zmm1{k1}, zmm0, zmm2
        vmovdqu8        ZMMWORD PTR [rdi+rax], zmm1
        add     rax, 64
        cmp     rax, r8
        jne     .L3

uiCA (CQA/MAQAO) (https://uica.uops.info/, make sure to pick CQA + Ice Lake) says it achieves nice 32B/cycle on Ice Lake. If you multiply by say 3 to match 3 GHz, this gives us almost 96 GiB/s assuming memory access is not a bottleneck (it always is in such algorithms).

But this seems not as close to optimal utilization as it could be. Using Clang instead yields much better, nicely unrolled result with better instruction selection.

    .LBB0_9:
        vmovdqu64       zmm3, zmmword ptr [rsi]
        vmovdqu64       zmm5, zmmword ptr [rsi + 64]
        vmovdqu64       zmm6, zmmword ptr [rsi + 128]
        add     rdx, -512
        vpaddb  zmm4, zmm3, zmm0
        vpcmpltub       k1, zmm4, zmm1
        vpaddb  zmm4, zmm5, zmm0
        vpaddb  zmm3 {k1}, zmm3, zmm2
        vpcmpltub       k1, zmm4, zmm1
        vpaddb  zmm4, zmm6, zmm0
        vpaddb  zmm5 {k1}, zmm5, zmm2
        vmovdqu64       zmmword ptr [rcx], zmm3
        vpcmpltub       k1, zmm4, zmm1
        vmovdqu64       zmmword ptr [rcx + 64], zmm5
        vmovdqu64       zmm5, zmmword ptr [rsi + 192]
        vpaddb  zmm6 {k1}, zmm6, zmm2
        vmovdqu64       zmmword ptr [rcx + 128], zmm6
        vmovdqu64       zmm6, zmmword ptr [rsi + 256]
        vpaddb  zmm4, zmm5, zmm0
        vpcmpltub       k1, zmm4, zmm1
        vpaddb  zmm4, zmm6, zmm0
        vpaddb  zmm5 {k1}, zmm5, zmm2
        vpcmpltub       k1, zmm4, zmm1
        vmovdqu64       zmmword ptr [rcx + 192], zmm5
        vmovdqu64       zmm5, zmmword ptr [rsi + 320]
        vpaddb  zmm6 {k1}, zmm6, zmm2
        vmovdqu64       zmmword ptr [rcx + 256], zmm6
        vmovdqu64       zmm6, zmmword ptr [rsi + 384]
        vpaddb  zmm4, zmm5, zmm0
        vpcmpltub       k1, zmm4, zmm1
        vpaddb  zmm4, zmm6, zmm0
        vpaddb  zmm5 {k1}, zmm5, zmm2
        vpcmpltub       k1, zmm4, zmm1
        vmovdqu64       zmmword ptr [rcx + 320], zmm5
        vmovdqu64       zmm5, zmmword ptr [rsi + 448]
        vpaddb  zmm6 {k1}, zmm6, zmm2
        add     rsi, 512
        vmovdqu64       zmmword ptr [rcx + 384], zmm6
        vpaddb  zmm4, zmm5, zmm0
        vpcmpltub       k1, zmm4, zmm1
        vpaddb  zmm5 {k1}, zmm5, zmm2
        vmovdqu64       zmmword ptr [rcx + 448], zmm5
        add     rcx, 512
        cmp     rdx, 63
        ja      .LBB0_9

This extracts more impressive 42.67B/c, I don't think even L2 cache can sustain such a throughput, but it's nice to know that medium length strings get up/downcased in about the same time it takes light from your screen to reach your cornea.

The core for short lengths there is one instruction less:

    .LBB0_5:
        vmovdqu64       zmm3, zmmword ptr [rsi + rcx]
        vpaddb  zmm4, zmm3, zmm0
        vpcmpltub       k1, zmm4, zmm1
        vpaddb  zmm3 {k1}, zmm3, zmm2
        vmovdqu64       zmmword ptr [rax + rcx], zmm3
        add     rcx, 64
        cmp     r8, rcx
        jne     .LBB0_5

Some months ago I wrote a similar ASCII in UTF-8 upcase/downcase implementation in C#: https://github.com/U8String/U8String/blob/main/Sources/U8Str...

(the unrolled conversion for below vectorization lengths is required as short strings dominate most codebases so handling it fast is important - the switch compiles to jump table and then branchless fall-through to return)

For now it goes as wide as 256b as it already saturates e.g. Zen 3 or 4 which have only 256x4 SIMD units (even though Zen 4 can do fancy 512b shuffles natively and has very good 512b implementation). The core loop compiles to compact

       G_M48884_IG05:
              vmovups  ymm3, ymmword ptr [rdi+rax]
              vpaddb   ymm4, ymm3, ymm1
              vpcmpgtb ymm4, ymm2, ymm4
              vpand    ymm4, ymm4, ymm0
              vpor     ymm3, ymm4, ymm3
              vmovups  ymmword ptr [rsi+rax], ymm3
              add      rax, 32
              cmp      rax, rdx
              jbe      SHORT G_M48884_IG05

Side by side with C ones: https://godbolt.org/z/eTGYhTPan

I believe you can also achieve similar with 3-instruction conversion with AVX512 with vpternlogd, as when I had access to AVX512 hardware, this is what .NET optimized it to for 256b width + AVX512VL, but strangely enough I can't make it do so for 512b width right now.

You may notice failed SWAR attempt for switch dispatch case and I was wondering what kind of license your posts are distributed under? (gave up on it back then because per-element fall-through was already fast enough, but if yours passes the test suite, I'd love to use it haha)

By @ipunchghosts - 10 months

I'm lost, what is swar?

By @kazinator - 9 months

With the advent of Unicode, the concept of upper and lowercase is a quagmire. You need a whole bunch of data to do it right.

If your work involves some process whose timely completion depends on how fast ASCII tolower executes, you better shake things up somehow and change the ground rules.

By @rowanG077 - 10 months

In the past I have added a black border around an image to be able to completely avoid the read beyond buffer SIMD problem. It worked very well and allowed us to beat some opencv implementation in terms of speed. But you don't always have full control over the input like that.

By @kolbe - 10 months

Did you try something like this? The autovectorizer looks pretty clean to me.

https://godbolt.org/z/1c5joKK5n

By @bonyt - 10 months

Using SIMD for making text into … uwu.

By @xyst - 10 months

Now account for Unicode characters (ie, Â -> â) and I’ll be impressed.

Shame most programmers only care about ASCII. There is a whole world that exists outside of the standard [a-z,A-Z,0-9] character set

By @Joker_vD - 10 months

...you know, while I personally think that the RISC approach was an honest mistake, stuff like this makes me see why some people wanted to got rid of complex instructions.

Well, supposedly RISC-V implementations will have none of this malarkey while still rivaling x64/ARM64 in processing speed at comparable technology/clock rates/prices, just with plain old loads-and-xors-and-stores?

tolower() with AVX-512

Related

Do not taunt happy fun branch predictor

Scan HTML faster with SIMD instructions: .NET/C# Edition

Summing ASCII encoded integers on Haswell at almost the speed of memcpy

Scan HTML even faster with SIMD instructions (C++ and C#)

Counting Bytes Faster Than You'd Think Possible

Related

Do not taunt happy fun branch predictor

Scan HTML faster with SIMD instructions: .NET/C# Edition

Summing ASCII encoded integers on Haswell at almost the speed of memcpy

Scan HTML even faster with SIMD instructions (C++ and C#)

Counting Bytes Faster Than You'd Think Possible