July 1st, 2024

Using SIMD for Parallel Processing in Rust

SIMD is vital for performance in Rust. Options include auto-vectorization, platform-specific intrinsics, and std::simd module. Balancing performance, portability, and ease of use is key. Leveraging auto-vectorization and intrinsics optimizes Rust projects for high-performance computing, multimedia, systems programming, and cryptography.

Read original article

Using SIMD for Parallel Processing in Rust

SIMD (Single Instruction, Multiple Data) is a crucial tool for enhancing performance in data-intensive operations. In Rust, various avenues exist for SIMD development, including auto-vectorization by the Rust compiler, platform-specific intrinsics through std::arch, and the experimental SIMD module in std::simd. These approaches offer trade-offs in performance, portability, and ease of use. Practical SIMD techniques in stable Rust involve leveraging compiler auto-vectorization and platform-specific intrinsics for performance gains. SIMD operations in Rust are beneficial for high-performance computing, multimedia processing, systems programming, embedded systems, and cryptography applications. While auto-vectorization simplifies SIMD usage, platform-specific intrinsics provide direct control for maximum performance. Developers should consider factors like data alignment, portability, complexity, and testing when implementing SIMD in Rust projects. Auto-vectorization in Rust optimizes code by transforming loops into SIMD instructions, but developers should focus on clear coding practices and benchmarking for performance validation. Platform-specific intrinsics in Rust, like ARM NEON for ARM architectures, offer direct access to SIMD instructions for specific CPU optimizations, enhancing performance in targeted applications.

Binrw

The tool binrw simplifies binary parsing and serialization with a declarative approach, offering readability and maintainability. It supports common tasks, generics, custom parsers, predefined types, and is safe for various environments.

My experience crafting an interpreter with Rust (2021)

Manuel Cerón details creating an interpreter with Rust, transitioning from Clojure. Leveraging Rust's safety features, he faced challenges with closures and classes, optimizing code for performance while balancing safety.

Own Constant Folder in C/C++

Neil Henning discusses precision issues in clang when using the sqrtps intrinsic with -ffast-math, suggesting inline assembly for instruction selection. He introduces a workaround using __builtin_constant_p for constant folding optimization, enhancing code efficiency.

Download Accelerator – Async Rust Edition

This post explores creating a download accelerator with async Rust, emphasizing its advantages over traditional methods. It demonstrates improved file uploads to Amazon S3 and provides code for parallel downloads.

The Inconceivable Types of Rust: How to Make Self-Borrows Safe

The article addresses Rust's limitations on self-borrows, proposing solutions like named lifetimes and inconceivable types to improve support for async functions. Enhancing Rust's type system is crucial for advanced features.

9 comments

By @oconnor663 - 10 months

There are a lot of factors that go into how fast a hash function is, but the case we're showing in the big red chart at https://github.com/BLAKE3-team/BLAKE3 is almost entirely driven by SIMD. It's a huge deal.

By @ww520 - 10 months

Zig actually has a very nice abstraction for SIMD in the form of vector programming. The size of the vector is agnostic to the underlying cpu architecture. The compiler or LLVM will generate code for using SIMD128, 256, or 512 registers. And you are just programming straight vectors.

By @thomashabets2 - 10 months

The portable SIMD is quite nice. We can't really trust a "sufficiently smart compiler" to make the best SIMD decisions, since it may not see through what you're actually doing.

https://blog.habets.se/2024/04/Rust-is-faster-than-C.html and code at https://github.com/ThomasHabets/zipbrute/blob/master/rust/sr... showed me getting 3x faster using portable SIMD, on my first attempt.

By @nbrempel - 10 months

Thanks for reading everyone. I’ve gotten some feedback over on Reddit as well that the example is not effectively showing the benefits of SIMD. I plan on revising this.

One of my goals of writing these articles is to learn so feedback is more than welcome!

By @eachro - 10 months

This is cool that simd primitives exist in the std lib of rust. I've wanted wanted to mess around a bit more with simd in python but I don't think that native support exists. Or your have to go down to C/C++ bindings to actually mess around with it (last I checked at least, please correct me if I'm wrong).

By @anonymousDan - 10 months

The interesting question for me is whether Rust makes it easier for the compiler to extract SIMD parallelism automatically given the restrictions imposed by its type system.

By @IshKebab - 10 months

Minor nit: RISC-V Vector isn't SIMD. It's actually like ARM's Scalable Vector Extension. Unlike traditional SIMD the code is agnostic to the register width and different hardware can run the same code with different widths.

There is also a traditional SIMD extension (P I think?) but it isn't finished. Most focus has been on the vector extension.

I am wondering how and if Rust will support these vector processing extensions.

By @brundolf - 10 months

std::simd is a delight. I'd never done SIMD before in any language, and it was very easy and natural (and safe!) to introduce to my code, and just automatically works cross-platform. Can't recommend it enough

By @neonsunset - 10 months

If you like SIMD and would like to dabble in it, I can strongly recommend trying it out in C# via its platform-agnostic SIMD abstraction. It is very accessible especially if you already know a little bit of C or C++, and compiles to very competent codegen for AdvSimd, SSE2/4.2/AVX1/2/AVX512, WASM's Packed SIMD and, in .NET 9, SVE1/2:

https://github.com/dotnet/runtime/blob/main/docs/coding-guid...

Here's an example of "checked" sum over a span of integers that uses platform-specific vector width:

https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...

Other examples:

CRC64 https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...

Hamming distance https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...

Default syntax is a bit ugly in my opinion, but it can be significantly improved with helper methods like here where the code is a port of simdutf's UTF-8 code point counting: https://github.com/U8String/U8String/blob/main/Sources/U8Str...

There are more advanced scenarios. Bepuphysics2 engine heavily leverages SIMD to perform as fast as PhysX's CPU back-end: https://github.com/bepu/bepuphysics2/blob/master/BepuPhysics...

Note that practically none of these need to reach out to platform-specific intrinsics (except for replacing movemask emulation with efficient ARM64 alternative) and use the same path for all platforms, varied by vector width rather than specific ISA.

Using SIMD for Parallel Processing in Rust

Related

Binrw

My experience crafting an interpreter with Rust (2021)

Own Constant Folder in C/C++

Download Accelerator – Async Rust Edition

The Inconceivable Types of Rust: How to Make Self-Borrows Safe

Related

Binrw

My experience crafting an interpreter with Rust (2021)

Own Constant Folder in C/C++

Download Accelerator – Async Rust Edition

The Inconceivable Types of Rust: How to Make Self-Borrows Safe