July 23rd, 2024

Counting Bytes Faster Than You'd Think Possible

Matt Stuchlik's high-performance computing method counts bytes with a value of 127 in a 250MB stream, achieving 550 times faster performance using SIMD instructions and an innovative memory read pattern.

Read original articleLink Icon
Counting Bytes Faster Than You'd Think Possible

Matt Stuchlik's recent exploration in high-performance computing focuses on counting the number of bytes with a value of 127 in a 250MB byte stream. His solution, which is approximately 550 times faster than a naive implementation, utilizes an optimized approach on an Intel Xeon E3-1271 v3 processor. The challenge involves reading the byte stream and counting occurrences of the target byte efficiently. Stuchlik's method employs SIMD (Single Instruction, Multiple Data) instructions to process data in chunks, leveraging AVX2 for performance gains.

A key innovation in his approach is the use of a memory read pattern that interleaves the processing of multiple 4K pages, which enhances data transfer rates by up to 30% in memory-bound scenarios. This technique takes advantage of the processor's hardware prefetchers, particularly the "Streamer," which can maintain multiple streams of data access. By interleaving access and unrolling the processing kernel, Stuchlik achieves significant performance improvements.

The implementation includes a series of assembly instructions that efficiently compare and count the target byte while managing potential overflow in accumulators. The final count is derived from a combination of narrow and wide accumulators, ensuring accuracy. Stuchlik concludes by noting the under-discussed nature of the page-interleaved read pattern and invites feedback on further memory optimization techniques.

Link Icon 8 comments
By @anonymoushn - 6 months
My own solution which is ~1ms faster uses some other pattern that was found experimentally, but I cannot seem to get it to go any faster by tuning the parameters, and the #1 spot remains slightly out of reach.

Alexander Monakov has called the attention of the highload Telegram chat to this paper[0], saying:

  Haswell is tricky for memory bw tuning, as even at fixed core frequency, uncore frequency is not fixed, and depends on factors such as hardware-measured stall cycles:

  > According to the respective patent [15], the uncore frequency depends on the stall cycles of the cores, the EPB of the cores, and c-states

  > ... uncore frequencies–in addition to EPB and stall cycles–depend on the core frequency of the fastest active core on the system. Moreover, the uncore frequency is also a target of power limitations.
So one wonders if it's not really a matter of reading the RAM in the right pattern to appease the prefetchers but of using values in the right pattern to create the right pattern of stalls to get the highest frequency.

[0]: https://tu-dresden.de/zih/forschung/ressourcen/dateien/proje...

By @sYnfo - 6 months
FYI vien [0] figured out that simply compiling with "-static -fno-pie" and _exit(0)-ing at the end puts the solution presented here to 15000 points and hence #4 on the leaderboard. Pretty funny.

[0] https://news.ycombinator.com/user?id=vient

By @dinobones - 6 months
Is there a path forward for compilers to eek out these optimization gains eventually? Is there even a path?

550x gains with some C ++ / mixed gnarly low level assembly vs standard C++ is pretty shocking to me.

By @maxbond - 6 months
Usually, it's fair game to use all of the information presented in an exam-style question to derive your answer.

With that in mind, I propose the following solution.

`print(976563)`

By @lumb63 - 6 months
Does anyone have any tips for similar wizardry-level SIMD optimization on ARM?
By @rini17 - 6 months
Can this optimization be applied to matmult for us, critters who are running llama on cpu? XD
By @_a_a_a_ - 6 months
"The solution presented here is ~550x faster than the following naive program."

   ... std::cin >> v; ...
Oh come on! That's I/O for every item, I'm surprised it's not even slower.
By @TacticalCoder - 6 months
Le met hazard a guess: that blog post was not written by a LLM!?