Beating NumPy matrix multiplication in 150 lines of C
Aman Salykov's blog explores high-performance matrix multiplication in C, surpassing NumPy with OpenBLAS on AMD Ryzen 7700 CPU. Scalable, portable code optimized for modern CPUs with OpenMP directives for parallelization. Discusses BLAS libraries, CPU performance limits, and matrix multiplication optimization.
Read original articleThis blog post by Aman Salykov explores implementing high-performance matrix multiplication in C code to outperform NumPy's matrix multiplication using OpenBLAS. The implementation follows the BLIS design, achieving over 1 TFLOPS on an AMD Ryzen 7700 CPU. The code is scalable, portable, and optimized for modern CPUs with FMA3 and AVX instructions. By parallelizing the code with OpenMP directives, it ensures scalability and performance. The post discusses the importance of matrix multiplication in neural networks and the role of BLAS libraries like OpenBLAS. It also delves into the theoretical limits of CPU performance and the process of optimizing matrix multiplication through kernels. The author provides a naive implementation and introduces the concept of kernels for efficient matrix multiplication. The post aims to compare the implemented algorithm with NumPy's performance and hints at a follow-up post on optimizing matrix multiplication on GPUs.
Related
Researchers upend AI status quo by eliminating matrix multiplication in LLMs
Researchers innovate AI language models by eliminating matrix multiplication, enhancing efficiency. A MatMul-free method reduces power consumption, costs, and challenges the necessity of matrix multiplication in high-performing models.
AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x
Nscale explores AI model optimization through GEMM tuning, leveraging rocBLAS and hipBLASlt for AMD MI300x GPUs. Results show up to 7.2x throughput increase and reduced latency, benefiting large models and enhancing processing efficiency.
AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x
Nscale explores GEMM tuning impact on AI model optimization, emphasizing throughput and latency benefits. Fine-tuning parameters and algorithms significantly boost speed and efficiency, especially on AMD GPUs, showcasing up to 7.2x throughput improvement.
Beating NumPy's matrix multiplication in 150 lines of C code
Aman Salykov's blog delves into high-performance matrix multiplication in C, surpassing NumPy with OpenBLAS on AMD Ryzen 7700 CPU. Scalable, portable code with OpenMP, targeting Intel Core and AMD Zen CPUs. Discusses BLAS, CPU performance limits, and hints at GPU optimization.
Do not taunt happy fun branch predictor
The author shares insights on optimizing AArch64 assembly code by reducing jumps in loops. Replacing ret with br x30 improved performance, leading to an 8.8x speed increase. Considerations on branch prediction and SIMD instructions are discussed.
Getting a 10-1000x or more improvement on existing code is very common without putting in a ton of effort if the code was not already heavily optimized. These are listed roughly in order of importance, but performance is often such a non-consideration from most developers that a little effort goes a long way.
1. Most importantly, is the algorithm a good choice? Can we eliminate some work entirely? (this is what algo interviews are testing for)
2. Can we eliminate round trips to the kernel and similar heavy operations? The most common huge gain here is replacing tons of malloc calls with a custom allocator.
3. Can we vectorize? Explicit vector intrinsics like in the blog post are great, but you can often get the same machine code by reorganizing your data into arrays / struct of arrays rather than arrays of structs.
4. Can we optimize for cache efficiency? If you already reorganized for vectors this might already be handled, but this can get more complicated with parallel code if you can't isolate data to one thread (false sharing, etc.)
5. Can we do anything else that's hardware specific? This can be anything from using intrinsics to hand-coding assembly.
SIMD intrinsics aren't needed for micro-kernel vectorization, as a decent C compiler will fully vectorize and unroll it. BLIS' pure C micro-kernel gets >80% of the performance of the hand-optimized implementation on Haswell with appropriate block sizes. The difference is likely to be due to prefecth, but I don't properly understand it.
Very moot nitpick though, given that this is for only one column of the matrix, the following loops of maskload/maskstore will take significantly more time (esp. store, which is still slow on Zen 4[1] despite the AVX-512 instruction (whose only difference is taking the mask in a mask register) being 6x faster), and clang autovectorizes the shifting anyways (maybe like 2-3x slower than my suggestions).
[1]: https://uops.info/table.html?search=vmaskmovps&cb_lat=on&cb_...
https://hacks.mozilla.org/2024/04/llamafiles-progress-four-m...
Also, if you were designing for smaller cases, say MNK=16 or 32, how would you approach it differently? I'm implementing neural ODEs and this is one point I've been considering.
> Important! Please don’t expect peak performance without fine-tuning the hyperparameters, such as the number of threads, kernel and block sizes, unless you are running it on a Ryzen 7700(X). More on this in the tutorial.
I think I'll need a TL;DR on what to change all these values to.
I have a Ryzen 7950X and as a first test I tried to only change NTHREADS to 32 in benchmark.c, but matmul.c performed worse than NumPy on my machine.
So I took a look at the other values present in the benchmark.c, but MC and NC are already calculated via the amount of threads (so these are probably already 'fine-tuned'?), and I couldn't really understand how KC = 1000 fits for the 7700(X) (the author's CPU) and how I'd need to adjust it for the 7950X (with the informations from the article).
cursed code. I checked and every single invocation is just `int`, so why do this? you can just write a function:
func min(x, y int) int {
if x < y { return x }
return y
}
and keep type safetyGreat job! Self publishing things like this were a hallmark of the early internet I for one sorely miss.
Related
Researchers upend AI status quo by eliminating matrix multiplication in LLMs
Researchers innovate AI language models by eliminating matrix multiplication, enhancing efficiency. A MatMul-free method reduces power consumption, costs, and challenges the necessity of matrix multiplication in high-performing models.
AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x
Nscale explores AI model optimization through GEMM tuning, leveraging rocBLAS and hipBLASlt for AMD MI300x GPUs. Results show up to 7.2x throughput increase and reduced latency, benefiting large models and enhancing processing efficiency.
AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x
Nscale explores GEMM tuning impact on AI model optimization, emphasizing throughput and latency benefits. Fine-tuning parameters and algorithms significantly boost speed and efficiency, especially on AMD GPUs, showcasing up to 7.2x throughput improvement.
Beating NumPy's matrix multiplication in 150 lines of C code
Aman Salykov's blog delves into high-performance matrix multiplication in C, surpassing NumPy with OpenBLAS on AMD Ryzen 7700 CPU. Scalable, portable code with OpenMP, targeting Intel Core and AMD Zen CPUs. Discusses BLAS, CPU performance limits, and hints at GPU optimization.
Do not taunt happy fun branch predictor
The author shares insights on optimizing AArch64 assembly code by reducing jumps in loops. Replacing ret with br x30 improved performance, leading to an 8.8x speed increase. Considerations on branch prediction and SIMD instructions are discussed.