July 12th, 2024

Summing ASCII encoded integers on Haswell at almost the speed of memcpy

Matt Stuchlik presents a high-performance algorithm for summing ASCII-encoded integers on Haswell systems. It utilizes SIMD instructions, lookup tables, and efficient operations to achieve speed enhancements, showcasing innovative approaches in integer sum calculations.

Read original articleLink Icon
Summing ASCII encoded integers on Haswell at almost the speed of memcpy

This post by Matt Stuchlik discusses a high-performance solution for summing ASCII-encoded integers on a Haswell system. The algorithm processes 32-byte chunks of input using SIMD instructions, tracking the sum of digits in each decimal place. By leveraging lookup tables and specific input distributions, the algorithm achieves significant speed improvements over traditional methods. The implementation details involve careful handling of input bytes, newline detection, and efficient shuffling operations. The post hints at novel techniques for initializing lookup tables and optimizing memory access. While the algorithm is tailored to specific system specifications and input formats, it showcases innovative approaches to accelerating integer sum calculations. The post provides a glimpse into the complexities of high-speed computation and hints at further optimizations and insights to be explored in future discussions.

Related

Own Constant Folder in C/C++

Own Constant Folder in C/C++

Neil Henning discusses precision issues in clang when using the sqrtps intrinsic with -ffast-math, suggesting inline assembly for instruction selection. He introduces a workaround using __builtin_constant_p for constant folding optimization, enhancing code efficiency.

The Byte Order Fiasco

The Byte Order Fiasco

Handling endianness in C/C++ programming poses challenges, emphasizing correct integer deserialization to prevent undefined behavior. Adherence to the C standard is crucial to avoid unexpected compiler optimizations. Code examples demonstrate proper deserialization techniques using masking and shifting for system compatibility. Mastery of these concepts is vital for robust C code, despite available APIs for byte swapping.

Do not taunt happy fun branch predictor

Do not taunt happy fun branch predictor

The author shares insights on optimizing AArch64 assembly code by reducing jumps in loops. Replacing ret with br x30 improved performance, leading to an 8.8x speed increase. Considerations on branch prediction and SIMD instructions are discussed.

Scan HTML faster with SIMD instructions: .NET/C# Edition

Scan HTML faster with SIMD instructions: .NET/C# Edition

WebKit and Chromium enhance HTML content scanning with fast SIMD routines, boosting performance significantly. .NET8 now supports speedy SIMD instructions for C#, achieving impressive speeds comparable to C/C++ implementations.

Beating the Compiler

Beating the Compiler

The blog post discusses optimizing interpreters in assembly to outperform compilers. By enhancing the Uxn CPU interpreter, a 10-20% speedup was achieved through efficient assembly implementations and techniques inspired by LuaJIT.

Link Icon 6 comments
By @ashleyn - 6 months
Knew it'd be SIMD. Such an underrated feature of modern CPUs. Hopefully with cross-platform SIMD in Rust and Golang, it'll be more commonly used.

Thinking parallel gets you enormous speed benefits for any number of arbitrary algorithms: https://mcyoung.xyz/2023/11/27/simd-base64/

By @dist1ll - 6 months
First time I hear about HighLoad. Seems really interesting to me on the first glance. I personally find SIMD and ISA/μarch-specific optimizations more rewarding than pure algorithmic challenges (codeforces and such).

Though Haswell seems like a pretty obsolete platform to optimize for at this point. Even Skylake will be a decade old next year.

By @wolf550e - 6 months
I think the trick with dereferencing unmapped memory is cool, but I only really care about techniques that work reliably and I can use in production.
By @raldi - 6 months
Is there an explanation of why it sometimes gives the wrong answer?