March 25th, 2025

Optimizing Matrix Multiplication on RDNA3

The article details optimizations for FP32 matrix multiplication on AMD's RDNA3 GPUs, improving performance through Local Data Store tiling, yet still falling short of theoretical limits due to resource underutilization.

Read original article

Optimizing Matrix Multiplication on RDNA3

The article discusses the optimization of FP32 matrix multiplication on AMD's RDNA3 GPUs, specifically targeting performance improvements over the existing rocBLAS library. The author outlines a series of iterative optimizations implemented across eight different kernels, focusing on 4096x4096 matrix sizes for simplicity. The initial naive implementation achieved only 1010.60 GFLOPS, significantly below the theoretical performance of 61.44 TFLOPS. The author then compares this with the rocBLAS implementation, which performed at 30,547 GFLOPS. The key issue identified was inefficient global memory access, leading to high latency. To address this, the author introduced Local Data Store (LDS) tiling, which caches data in faster local memory, improving performance to 4017.99 GFLOPS. Despite these improvements, the performance still lagged behind the theoretical maximum due to underutilization of the GPU's capabilities. The article emphasizes the importance of optimizing memory access patterns and leveraging the architecture of RDNA3 to achieve better performance in matrix multiplication, a critical operation in machine learning applications.

- The optimization of matrix multiplication on AMD RDNA3 GPUs can significantly outperform existing libraries like rocBLAS.

- Initial naive implementations yield low performance, highlighting the need for efficient memory access strategies.

- Utilizing Local Data Store (LDS) tiling can improve performance by reducing memory access latency.

- Despite optimizations, achieving theoretical performance limits remains challenging due to underutilization of GPU resources.

Beating NumPy's matrix multiplication in 150 lines of C code

Aman Salykov's blog delves into high-performance matrix multiplication in C, surpassing NumPy with OpenBLAS on AMD Ryzen 7700 CPU. Scalable, portable code with OpenMP, targeting Intel Core and AMD Zen CPUs. Discusses BLAS, CPU performance limits, and hints at GPU optimization.

Beating NumPy's matrix multiplication in 150 lines of C code

Aman Salykov's blog explores high-performance matrix multiplication in C, surpassing NumPy with OpenBLAS on AMD Ryzen 7700 CPU. Scalable, portable code optimized for modern CPUs with FMA3 and AVX instructions, parallelized with OpenMP for scalability and performance. Discusses matrix multiplication's significance in neural networks, BLAS libraries' role, CPU performance limits, and optimizing implementations without low-level assembly. Mentions fast matrix multiplication tutorials and upcoming GPU optimization post.

Beating NumPy matrix multiplication in 150 lines of C

Aman Salykov's blog explores high-performance matrix multiplication in C, surpassing NumPy with OpenBLAS on AMD Ryzen 7700 CPU. Scalable, portable code optimized for modern CPUs with OpenMP directives for parallelization. Discusses BLAS libraries, CPU performance limits, and matrix multiplication optimization.

How to Optimize a CUDA Matmul Kernel for CuBLAS-Like Performance: A Worklog

The author optimizes a CUDA matrix multiplication kernel, achieving 93.7% of cuBLAS performance through techniques like memory coalescing and shared memory caching, emphasizing GPU performance in deep learning applications.

Fast Multidimensional Matrix Multiplication on CPU from Scratch

The article examines multidimensional matrix multiplication performance on CPUs using Numpy and C++. It discusses optimization techniques and challenges in replicating Numpy's efficiency, emphasizing the importance of memory access patterns.

7 comments

By @almostgotcaught - 15 days

> Furthermore, performing custom ISA optimizations makes these changes RDNA3-specific

this is overblown at least wrt forward compatibility - all of the instructions used are in RDNA4 and most of them are even in CDNA3 (CDNA4 isn't public yet?) and the ones that aren't exactly there are only slightly renamed (ds_load -> ds_read). Sure it's annoying but it's not the end of the world to have some `#ifdef`s in your code (that's not very much different from what the compiler itself is going to do anyway).

By @tgtweak - 14 days

Cuda has similar inefficiencies and many use cases can have equal uplifts by going lower level on the code.

I think this is what deepseek had done to get their speedups on older hardware.

Even way back in the days of GPU crypto mining - custom kernels hand built (mostly just unrolling loops) would yield 20% improvements over just running opencl and letting the drivers compile it down.

By @SavageNoble - 15 days

This is really cool. 60% is no joke and as a 7900XTX owner I would love the performance boost.

Well done!

By @delusional - 14 days

I find it quite interesting that while vector instructions are present every other sort of "hardware level grouping" (wave, SIMD, thread) is hidden from the programmer. Why would vector instructions be the only thing the programmer ought to care about?

I wonder if there's untapped potential in a GPU language which made all of those implicit classes explicit in code, now that we've sort of stabilized on them. It wouldn't allow you to do anything that you can't already do with clever optimizations and a profiler, but it could have the potential to make the optimizations clearer.

In general I'm very curious as to why we don't have any new languages that are better aligned with current hardware. For some reason we collectively decided that it was more fun to make everything general, which is especially unfortunate considering the real world got increasingly homogeneous. Compiling to some intermediate language makes no sense when you're only ever going to run on x86 anyway.

By @randomNumber7 - 15 days

Is the author a genius or has AMD questionable software?

By @nyanpasu64 - 14 days

Is it worth implementing sub-cubic matrix multiplication algorithms like Strassen etc. for 4096x4096?

Beating NumPy's matrix multiplication in 150 lines of C code

Beating NumPy matrix multiplication in 150 lines of C

Aman Salykov's blog explores high-performance matrix multiplication in C, surpassing NumPy with OpenBLAS on AMD Ryzen 7700 CPU. Scalable, portable code optimized for modern CPUs with OpenMP directives for parallelization. Discusses BLAS libraries, CPU performance limits, and matrix multiplication optimization.

Optimizing Matrix Multiplication on RDNA3

Related

Beating NumPy's matrix multiplication in 150 lines of C code

Beating NumPy's matrix multiplication in 150 lines of C code

Beating NumPy matrix multiplication in 150 lines of C

How to Optimize a CUDA Matmul Kernel for CuBLAS-Like Performance: A Worklog

Fast Multidimensional Matrix Multiplication on CPU from Scratch

Related

Beating NumPy's matrix multiplication in 150 lines of C code

Beating NumPy's matrix multiplication in 150 lines of C code

Beating NumPy matrix multiplication in 150 lines of C

How to Optimize a CUDA Matmul Kernel for CuBLAS-Like Performance: A Worklog

Fast Multidimensional Matrix Multiplication on CPU from Scratch