July 23rd, 2024

Counting Bytes Faster Than You'd Think Possible

Matt Stuchlik's high-performance computing method counts bytes with a value of 127 in a 250MB stream, achieving 550 times faster performance using SIMD instructions and an innovative memory read pattern.

Read original article

Counting Bytes Faster Than You'd Think Possible

Matt Stuchlik's recent exploration in high-performance computing focuses on counting the number of bytes with a value of 127 in a 250MB byte stream. His solution, which is approximately 550 times faster than a naive implementation, utilizes an optimized approach on an Intel Xeon E3-1271 v3 processor. The challenge involves reading the byte stream and counting occurrences of the target byte efficiently. Stuchlik's method employs SIMD (Single Instruction, Multiple Data) instructions to process data in chunks, leveraging AVX2 for performance gains.

A key innovation in his approach is the use of a memory read pattern that interleaves the processing of multiple 4K pages, which enhances data transfer rates by up to 30% in memory-bound scenarios. This technique takes advantage of the processor's hardware prefetchers, particularly the "Streamer," which can maintain multiple streams of data access. By interleaving access and unrolling the processing kernel, Stuchlik achieves significant performance improvements.

The implementation includes a series of assembly instructions that efficiently compare and count the target byte while managing potential overflow in accumulators. The final count is derived from a combination of narrow and wide accumulators, ensuring accuracy. Stuchlik concludes by noting the under-discussed nature of the page-interleaved read pattern and invites feedback on further memory optimization techniques.

Do not taunt happy fun branch predictor

The author shares insights on optimizing AArch64 assembly code by reducing jumps in loops. Replacing ret with br x30 improved performance, leading to an 8.8x speed increase. Considerations on branch prediction and SIMD instructions are discussed.

Scan HTML faster with SIMD instructions: .NET/C# Edition

WebKit and Chromium enhance HTML content scanning with fast SIMD routines, boosting performance significantly. .NET8 now supports speedy SIMD instructions for C#, achieving impressive speeds comparable to C/C++ implementations.

Beating the L1 cache with value speculation (2021)

Value speculation leverages branch predictor to guess values, enhancing instruction parallelism and L1 cache efficiency. Demonstrated on Xeon E5-1650 v3, it boosts throughput from 14GB/s to 30GB/s by predicting linked list nodes.

Summing ASCII encoded integers on Haswell at almost the speed of memcpy

Matt Stuchlik presents a high-performance algorithm for summing ASCII-encoded integers on Haswell systems. It utilizes SIMD instructions, lookup tables, and efficient operations to achieve speed enhancements, showcasing innovative approaches in integer sum calculations.

Beating the Compiler

The blog post discusses optimizing interpreters in assembly to outperform compilers. By enhancing the Uxn CPU interpreter, a 10-20% speedup was achieved through efficient assembly implementations and techniques inspired by LuaJIT.

8 comments

By @anonymoushn - 10 months

My own solution which is ~1ms faster uses some other pattern that was found experimentally, but I cannot seem to get it to go any faster by tuning the parameters, and the #1 spot remains slightly out of reach.

Alexander Monakov has called the attention of the highload Telegram chat to this paper[0], saying:

  Haswell is tricky for memory bw tuning, as even at fixed core frequency, uncore frequency is not fixed, and depends on factors such as hardware-measured stall cycles:

  > According to the respective patent [15], the uncore frequency depends on the stall cycles of the cores, the EPB of the cores, and c-states

  > ... uncore frequencies–in addition to EPB and stall cycles–depend on the core frequency of the fastest active core on the system. Moreover, the uncore frequency is also a target of power limitations.

So one wonders if it's not really a matter of reading the RAM in the right pattern to appease the prefetchers but of using values in the right pattern to create the right pattern of stalls to get the highest frequency.

[0]: https://tu-dresden.de/zih/forschung/ressourcen/dateien/proje...

By @sYnfo - 10 months

FYI vien [0] figured out that simply compiling with "-static -fno-pie" and _exit(0)-ing at the end puts the solution presented here to 15000 points and hence #4 on the leaderboard. Pretty funny.

[0] https://news.ycombinator.com/user?id=vient

By @dinobones - 10 months

Is there a path forward for compilers to eek out these optimization gains eventually? Is there even a path?

550x gains with some C ++ / mixed gnarly low level assembly vs standard C++ is pretty shocking to me.

By @maxbond - 10 months

Usually, it's fair game to use all of the information presented in an exam-style question to derive your answer.

With that in mind, I propose the following solution.

`print(976563)`

By @lumb63 - 10 months

Does anyone have any tips for similar wizardry-level SIMD optimization on ARM?

By @rini17 - 10 months

Can this optimization be applied to matmult for us, critters who are running llama on cpu? XD

By @_a_a_a_ - 10 months

"The solution presented here is ~550x faster than the following naive program."

   ... std::cin >> v; ...

Oh come on! That's I/O for every item, I'm surprised it's not even slower.

By @TacticalCoder - 10 months

Le met hazard a guess: that blog post was not written by a LLM!?

Counting Bytes Faster Than You'd Think Possible

Related

Do not taunt happy fun branch predictor

Scan HTML faster with SIMD instructions: .NET/C# Edition

Beating the L1 cache with value speculation (2021)

Summing ASCII encoded integers on Haswell at almost the speed of memcpy

Beating the Compiler

Related

Do not taunt happy fun branch predictor

Scan HTML faster with SIMD instructions: .NET/C# Edition

Beating the L1 cache with value speculation (2021)

Summing ASCII encoded integers on Haswell at almost the speed of memcpy

Beating the Compiler