July 3rd, 2024

Beating NumPy matrix multiplication in 150 lines of C

Aman Salykov's blog explores high-performance matrix multiplication in C, surpassing NumPy with OpenBLAS on AMD Ryzen 7700 CPU. Scalable, portable code optimized for modern CPUs with OpenMP directives for parallelization. Discusses BLAS libraries, CPU performance limits, and matrix multiplication optimization.

Read original article

Beating NumPy matrix multiplication in 150 lines of C

This blog post by Aman Salykov explores implementing high-performance matrix multiplication in C code to outperform NumPy's matrix multiplication using OpenBLAS. The implementation follows the BLIS design, achieving over 1 TFLOPS on an AMD Ryzen 7700 CPU. The code is scalable, portable, and optimized for modern CPUs with FMA3 and AVX instructions. By parallelizing the code with OpenMP directives, it ensures scalability and performance. The post discusses the importance of matrix multiplication in neural networks and the role of BLAS libraries like OpenBLAS. It also delves into the theoretical limits of CPU performance and the process of optimizing matrix multiplication through kernels. The author provides a naive implementation and introduces the concept of kernels for efficient matrix multiplication. The post aims to compare the implemented algorithm with NumPy's performance and hints at a follow-up post on optimizing matrix multiplication on GPUs.

Researchers upend AI status quo by eliminating matrix multiplication in LLMs

Researchers innovate AI language models by eliminating matrix multiplication, enhancing efficiency. A MatMul-free method reduces power consumption, costs, and challenges the necessity of matrix multiplication in high-performing models.

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

Nscale explores AI model optimization through GEMM tuning, leveraging rocBLAS and hipBLASlt for AMD MI300x GPUs. Results show up to 7.2x throughput increase and reduced latency, benefiting large models and enhancing processing efficiency.

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

Nscale explores GEMM tuning impact on AI model optimization, emphasizing throughput and latency benefits. Fine-tuning parameters and algorithms significantly boost speed and efficiency, especially on AMD GPUs, showcasing up to 7.2x throughput improvement.

Beating NumPy's matrix multiplication in 150 lines of C code

Aman Salykov's blog delves into high-performance matrix multiplication in C, surpassing NumPy with OpenBLAS on AMD Ryzen 7700 CPU. Scalable, portable code with OpenMP, targeting Intel Core and AMD Zen CPUs. Discusses BLAS, CPU performance limits, and hints at GPU optimization.

Do not taunt happy fun branch predictor

The author shares insights on optimizing AArch64 assembly code by reducing jumps in loops. Replacing ret with br x30 improved performance, leading to an 8.8x speed increase. Considerations on branch prediction and SIMD instructions are discussed.

21 comments

By @epr - 10 months

If the point of this article is that there's generally performance left on the table, if anything it's understating how much room there generally is for improvement considering how much effort goes into matmul libraries compared to most other software.

Getting a 10-1000x or more improvement on existing code is very common without putting in a ton of effort if the code was not already heavily optimized. These are listed roughly in order of importance, but performance is often such a non-consideration from most developers that a little effort goes a long way.

1. Most importantly, is the algorithm a good choice? Can we eliminate some work entirely? (this is what algo interviews are testing for)

2. Can we eliminate round trips to the kernel and similar heavy operations? The most common huge gain here is replacing tons of malloc calls with a custom allocator.

3. Can we vectorize? Explicit vector intrinsics like in the blog post are great, but you can often get the same machine code by reorganizing your data into arrays / struct of arrays rather than arrays of structs.

4. Can we optimize for cache efficiency? If you already reorganized for vectors this might already be handled, but this can get more complicated with parallel code if you can't isolate data to one thread (false sharing, etc.)

5. Can we do anything else that's hardware specific? This can be anything from using intrinsics to hand-coding assembly.

By @ssivark - 10 months

Most common coding patterns leave a lot of performance unclaimed, by not fully specializing to the hardware. This article is an interesting example. For another interesting demonstration, see this CS classic "There's plenty of room at the top"

https://www.science.org/doi/10.1126/science.aam9744

By @gnufx - 10 months

The papers referenced in the BLIS repo are the authoritative reference to understand this stuff. I've no idea why people think optimized BLASes aren't performant; you should expect >90% of CPU peak for sufficiently large matrices, and serial OpenBLAS was generally on a par with MKL last I saw. (BLAS implements GEMM, not matmul, as the basic linear algebra building block.) I don't understand using numpy rather than the usual benchmark frameworks, and on Zen surely you should compare with AMD's BLAS (based on BLIS). BLIS had a better parallelization story than OpenBLAS last I knew. AMD BLIS also has the implementation switch for "small" dimensions, and I don't know if current OpenBLAS does.

SIMD intrinsics aren't needed for micro-kernel vectorization, as a decent C compiler will fully vectorize and unroll it. BLIS' pure C micro-kernel gets >80% of the performance of the hand-optimized implementation on Haswell with appropriate block sizes. The difference is likely to be due to prefecth, but I don't properly understand it.

By @ks2048 - 10 months

This looks like a nice write-up and implementation. I'm left wondering what is the "trick"? How does it manage to beat OpenBLAS, which is assembly+C optimized over decades for this exact problem? It goes into detail about caching, etc - is BLAS is not taking advantage of these things, or is this more tuned to this specific processor, etc?

By @bjourne - 10 months

Good writeup and commendable of you to make your benchmark so easily repeatable. On my 16-core Xeon(R) W-2245 CPU @ 3.90GHz matmul.c takes about 1.41 seconds to multiply 8192x8192 matrices when compiled with gcc -O3 and 1.47 seconds when compiled with clang -O2, while NumPy does it in 1.07 seconds. I believe an avx512 kernel would be significantly faster. Another reason for the lackluster performance may be omp. IME, you can reduce overhead by managing the thread pool explicitly with pthreads (and use sysconf(_SC_NPROCESSORS_ONLN) instead of hard-coding).

By @david-gpu - 10 months

There is no reason to burden one side with Python while the other side is C, when they could have just as easily perform an apples-to-apples comparison where both sides are written in C, one calling a BLAS library while the other calls this other implementation.

By @dzaima - 10 months

Though not at all part of the hot path, the inefficiency of the mask generation ('bit_mask' usage) nags me. Some more efficient methods include creating a global constant array of {-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1, 0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0} and loading from it at element offsets 16-m and 8-m, or comparing constant vector {0,1,2,3,4,...} with broadcasted m and m-8.

Very moot nitpick though, given that this is for only one column of the matrix, the following loops of maskload/maskstore will take significantly more time (esp. store, which is still slow on Zen 4[1] despite the AVX-512 instruction (whose only difference is taking the mask in a mask register) being 6x faster), and clang autovectorizes the shifting anyways (maybe like 2-3x slower than my suggestions).

[1]: https://uops.info/table.html?search=vmaskmovps&cb_lat=on&cb_...

By @leeoniya - 10 months

what about jart's tinyBLAS?

https://hacks.mozilla.org/2024/04/llamafiles-progress-four-m...

also https://justine.lol/matmul/

By @throwaway4good - 10 months

What is the point of making the matrix multiplication itself multithreaded (other than benchmarking)? Wouldn't it be more beneficial in practice to have the multithreadedness in the algorithm that use the multiplication?

By @azornathogron - 10 months

I've only skimmed so far, but this post has a lot of detail and explanation. Looks like a pretty great description of how fast matrix multiplications are implemented to take into account architectural concerns. Goes on my reading list!

By @marmaduke - 10 months

Very nice write up. Those are the kind of matrix sizes that MKL is fairly good at, might be worth a comparison as well?

Also, if you were designing for smaller cases, say MNK=16 or 32, how would you approach it differently? I'm implementing neural ODEs and this is one point I've been considering.

By @SushiHippie - 10 months

In the README, it says:

> Important! Please don’t expect peak performance without fine-tuning the hyperparameters, such as the number of threads, kernel and block sizes, unless you are running it on a Ryzen 7700(X). More on this in the tutorial.

I think I'll need a TL;DR on what to change all these values to.

I have a Ryzen 7950X and as a first test I tried to only change NTHREADS to 32 in benchmark.c, but matmul.c performed worse than NumPy on my machine.

So I took a look at the other values present in the benchmark.c, but MC and NC are already calculated via the amount of threads (so these are probably already 'fine-tuned'?), and I couldn't really understand how KC = 1000 fits for the 7700(X) (the author's CPU) and how I'd need to adjust it for the 7950X (with the informations from the article).

By @Bayes7 - 10 months

This was a great read, thanks a lot! One a side note, any one has a good guess what tool/software they used to create the visualisations for matrix multiplications or memory outline?

By @teo_zero - 10 months

Does it make sense to compare a C executable with an interpreted Python program that calls a compiled library? Is the difference due to the algorithm or the call stack?

By @softmicro - 10 months

Is there any explaination for the dropping and rising in GFLOP/S at certain matrix sizes as shown in the plots.

By @p0w3n3d - 10 months

This only proves even more how Python is magnificently fast. Only 5% difference from the native code...

By @jstrong - 10 months

in terms of comparing to numpy, how much overhead would there be from Python (vs. running the numpy C code alone)?

By @38 - 10 months

> #define min(x, y) ((x) < (y) ? (x) : (y))

cursed code. I checked and every single invocation is just `int`, so why do this? you can just write a function:

    func min(x, y int) int {
       if x < y { return x }
       return y
    }

and keep type safety

By @KerrAvon - 10 months

The article claims this is portable C. Given the use of intel intrinsics, what happens if you try to compile it for ARM64?

By @tom306 - 10 months

Does numpy's implementation actually use multithreading? If not, then the comparison is not fair.

By @le-mark - 10 months

> This is my first time writing a blog post. If you enjoy it, please subscribe and share it!

Great job! Self publishing things like this were a hallmark of the early internet I for one sorely miss.

Beating NumPy matrix multiplication in 150 lines of C

Related

Researchers upend AI status quo by eliminating matrix multiplication in LLMs

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

Beating NumPy's matrix multiplication in 150 lines of C code

Do not taunt happy fun branch predictor

Related

Researchers upend AI status quo by eliminating matrix multiplication in LLMs

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

Beating NumPy's matrix multiplication in 150 lines of C code

Do not taunt happy fun branch predictor