July 1st, 2024

Using SIMD for Parallel Processing in Rust

SIMD is vital for performance in Rust. Options include auto-vectorization, platform-specific intrinsics, and std::simd module. Balancing performance, portability, and ease of use is key. Leveraging auto-vectorization and intrinsics optimizes Rust projects for high-performance computing, multimedia, systems programming, and cryptography.

Read original articleLink Icon
Using SIMD for Parallel Processing in Rust

SIMD (Single Instruction, Multiple Data) is a crucial tool for enhancing performance in data-intensive operations. In Rust, various avenues exist for SIMD development, including auto-vectorization by the Rust compiler, platform-specific intrinsics through std::arch, and the experimental SIMD module in std::simd. These approaches offer trade-offs in performance, portability, and ease of use. Practical SIMD techniques in stable Rust involve leveraging compiler auto-vectorization and platform-specific intrinsics for performance gains. SIMD operations in Rust are beneficial for high-performance computing, multimedia processing, systems programming, embedded systems, and cryptography applications. While auto-vectorization simplifies SIMD usage, platform-specific intrinsics provide direct control for maximum performance. Developers should consider factors like data alignment, portability, complexity, and testing when implementing SIMD in Rust projects. Auto-vectorization in Rust optimizes code by transforming loops into SIMD instructions, but developers should focus on clear coding practices and benchmarking for performance validation. Platform-specific intrinsics in Rust, like ARM NEON for ARM architectures, offer direct access to SIMD instructions for specific CPU optimizations, enhancing performance in targeted applications.

Link Icon 9 comments
By @oconnor663 - 5 months
There are a lot of factors that go into how fast a hash function is, but the case we're showing in the big red chart at https://github.com/BLAKE3-team/BLAKE3 is almost entirely driven by SIMD. It's a huge deal.
By @ww520 - 5 months
Zig actually has a very nice abstraction for SIMD in the form of vector programming. The size of the vector is agnostic to the underlying cpu architecture. The compiler or LLVM will generate code for using SIMD128, 256, or 512 registers. And you are just programming straight vectors.
By @thomashabets2 - 5 months
The portable SIMD is quite nice. We can't really trust a "sufficiently smart compiler" to make the best SIMD decisions, since it may not see through what you're actually doing.

https://blog.habets.se/2024/04/Rust-is-faster-than-C.html and code at https://github.com/ThomasHabets/zipbrute/blob/master/rust/sr... showed me getting 3x faster using portable SIMD, on my first attempt.

By @nbrempel - 5 months
Thanks for reading everyone. I’ve gotten some feedback over on Reddit as well that the example is not effectively showing the benefits of SIMD. I plan on revising this.

One of my goals of writing these articles is to learn so feedback is more than welcome!

By @eachro - 5 months
This is cool that simd primitives exist in the std lib of rust. I've wanted wanted to mess around a bit more with simd in python but I don't think that native support exists. Or your have to go down to C/C++ bindings to actually mess around with it (last I checked at least, please correct me if I'm wrong).
By @anonymousDan - 5 months
The interesting question for me is whether Rust makes it easier for the compiler to extract SIMD parallelism automatically given the restrictions imposed by its type system.
By @IshKebab - 5 months
Minor nit: RISC-V Vector isn't SIMD. It's actually like ARM's Scalable Vector Extension. Unlike traditional SIMD the code is agnostic to the register width and different hardware can run the same code with different widths.

There is also a traditional SIMD extension (P I think?) but it isn't finished. Most focus has been on the vector extension.

I am wondering how and if Rust will support these vector processing extensions.

By @brundolf - 5 months
std::simd is a delight. I'd never done SIMD before in any language, and it was very easy and natural (and safe!) to introduce to my code, and just automatically works cross-platform. Can't recommend it enough
By @neonsunset - 5 months
If you like SIMD and would like to dabble in it, I can strongly recommend trying it out in C# via its platform-agnostic SIMD abstraction. It is very accessible especially if you already know a little bit of C or C++, and compiles to very competent codegen for AdvSimd, SSE2/4.2/AVX1/2/AVX512, WASM's Packed SIMD and, in .NET 9, SVE1/2:

https://github.com/dotnet/runtime/blob/main/docs/coding-guid...

Here's an example of "checked" sum over a span of integers that uses platform-specific vector width:

https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...

Other examples:

CRC64 https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...

Hamming distance https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...

Default syntax is a bit ugly in my opinion, but it can be significantly improved with helper methods like here where the code is a port of simdutf's UTF-8 code point counting: https://github.com/U8String/U8String/blob/main/Sources/U8Str...

There are more advanced scenarios. Bepuphysics2 engine heavily leverages SIMD to perform as fast as PhysX's CPU back-end: https://github.com/bepu/bepuphysics2/blob/master/BepuPhysics...

Note that practically none of these need to reach out to platform-specific intrinsics (except for replacing movemask emulation with efficient ARM64 alternative) and use the same path for all platforms, varied by vector width rather than specific ISA.