Counting Bytes Faster Than You'd Think Possible
Matt Stuchlik's high-performance computing method counts bytes with a value of 127 in a 250MB stream, achieving 550 times faster performance using SIMD instructions and an innovative memory read pattern.
Read original articleMatt Stuchlik's recent exploration in high-performance computing focuses on counting the number of bytes with a value of 127 in a 250MB byte stream. His solution, which is approximately 550 times faster than a naive implementation, utilizes an optimized approach on an Intel Xeon E3-1271 v3 processor. The challenge involves reading the byte stream and counting occurrences of the target byte efficiently. Stuchlik's method employs SIMD (Single Instruction, Multiple Data) instructions to process data in chunks, leveraging AVX2 for performance gains.
A key innovation in his approach is the use of a memory read pattern that interleaves the processing of multiple 4K pages, which enhances data transfer rates by up to 30% in memory-bound scenarios. This technique takes advantage of the processor's hardware prefetchers, particularly the "Streamer," which can maintain multiple streams of data access. By interleaving access and unrolling the processing kernel, Stuchlik achieves significant performance improvements.
The implementation includes a series of assembly instructions that efficiently compare and count the target byte while managing potential overflow in accumulators. The final count is derived from a combination of narrow and wide accumulators, ensuring accuracy. Stuchlik concludes by noting the under-discussed nature of the page-interleaved read pattern and invites feedback on further memory optimization techniques.
Related
Do not taunt happy fun branch predictor
The author shares insights on optimizing AArch64 assembly code by reducing jumps in loops. Replacing ret with br x30 improved performance, leading to an 8.8x speed increase. Considerations on branch prediction and SIMD instructions are discussed.
Scan HTML faster with SIMD instructions: .NET/C# Edition
WebKit and Chromium enhance HTML content scanning with fast SIMD routines, boosting performance significantly. .NET8 now supports speedy SIMD instructions for C#, achieving impressive speeds comparable to C/C++ implementations.
Beating the L1 cache with value speculation (2021)
Value speculation leverages branch predictor to guess values, enhancing instruction parallelism and L1 cache efficiency. Demonstrated on Xeon E5-1650 v3, it boosts throughput from 14GB/s to 30GB/s by predicting linked list nodes.
Summing ASCII encoded integers on Haswell at almost the speed of memcpy
Matt Stuchlik presents a high-performance algorithm for summing ASCII-encoded integers on Haswell systems. It utilizes SIMD instructions, lookup tables, and efficient operations to achieve speed enhancements, showcasing innovative approaches in integer sum calculations.
Beating the Compiler
The blog post discusses optimizing interpreters in assembly to outperform compilers. By enhancing the Uxn CPU interpreter, a 10-20% speedup was achieved through efficient assembly implementations and techniques inspired by LuaJIT.
Alexander Monakov has called the attention of the highload Telegram chat to this paper[0], saying:
Haswell is tricky for memory bw tuning, as even at fixed core frequency, uncore frequency is not fixed, and depends on factors such as hardware-measured stall cycles:
> According to the respective patent [15], the uncore frequency depends on the stall cycles of the cores, the EPB of the cores, and c-states
> ... uncore frequencies–in addition to EPB and stall cycles–depend on the core frequency of the fastest active core on the system. Moreover, the uncore frequency is also a target of power limitations.
So one wonders if it's not really a matter of reading the RAM in the right pattern to appease the prefetchers but of using values in the right pattern to create the right pattern of stalls to get the highest frequency.[0]: https://tu-dresden.de/zih/forschung/ressourcen/dateien/proje...
550x gains with some C ++ / mixed gnarly low level assembly vs standard C++ is pretty shocking to me.
With that in mind, I propose the following solution.
`print(976563)`
... std::cin >> v; ...
Oh come on! That's I/O for every item, I'm surprised it's not even slower.Related
Do not taunt happy fun branch predictor
The author shares insights on optimizing AArch64 assembly code by reducing jumps in loops. Replacing ret with br x30 improved performance, leading to an 8.8x speed increase. Considerations on branch prediction and SIMD instructions are discussed.
Scan HTML faster with SIMD instructions: .NET/C# Edition
WebKit and Chromium enhance HTML content scanning with fast SIMD routines, boosting performance significantly. .NET8 now supports speedy SIMD instructions for C#, achieving impressive speeds comparable to C/C++ implementations.
Beating the L1 cache with value speculation (2021)
Value speculation leverages branch predictor to guess values, enhancing instruction parallelism and L1 cache efficiency. Demonstrated on Xeon E5-1650 v3, it boosts throughput from 14GB/s to 30GB/s by predicting linked list nodes.
Summing ASCII encoded integers on Haswell at almost the speed of memcpy
Matt Stuchlik presents a high-performance algorithm for summing ASCII-encoded integers on Haswell systems. It utilizes SIMD instructions, lookup tables, and efficient operations to achieve speed enhancements, showcasing innovative approaches in integer sum calculations.
Beating the Compiler
The blog post discusses optimizing interpreters in assembly to outperform compilers. By enhancing the Uxn CPU interpreter, a 10-20% speedup was achieved through efficient assembly implementations and techniques inspired by LuaJIT.