July 31st, 2024

Fast Multidimensional Matrix Multiplication on CPU from Scratch

The article examines multidimensional matrix multiplication performance on CPUs using Numpy and C++. It discusses optimization techniques and challenges in replicating Numpy's efficiency, emphasizing the importance of memory access patterns.

Read original articleLink Icon
Fast Multidimensional Matrix Multiplication on CPU from Scratch

The article discusses the performance of multidimensional matrix multiplication on a CPU, specifically using Numpy and C++. It highlights that Numpy can multiply two 1024x1024 matrices in approximately 8 milliseconds on a 4-core Intel CPU, achieving around 18 FLOPs per core per cycle. This performance is attributed to the use of optimized BLAS libraries, such as Intel's MKL, which implement functions like SGEMM for efficient matrix operations. The author explores the challenges of replicating this performance using plain C++, noting that a naive implementation takes significantly longer.

The article details various optimization techniques, including compiler flags, register accumulation, and cache-aware loop reordering, which can drastically reduce computation time. For instance, a cache-aware implementation improved performance to 89 milliseconds, demonstrating the importance of memory access patterns in matrix multiplication. The author emphasizes that while achieving performance close to Numpy's implementation is challenging, understanding these optimizations is crucial for developing efficient numerical algorithms.

Ultimately, the article serves as a guide for programmers interested in enhancing matrix multiplication performance in C++, illustrating the impact of hardware architecture and optimization strategies on computational efficiency. The author concludes that while their implementation achieves 9 FLOPs per core per cycle, it is not intended to compete with established BLAS libraries but rather to provide insights into performance optimization techniques.

Related

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

Nscale explores GEMM tuning impact on AI model optimization, emphasizing throughput and latency benefits. Fine-tuning parameters and algorithms significantly boost speed and efficiency, especially on AMD GPUs, showcasing up to 7.2x throughput improvement.

Beating NumPy's matrix multiplication in 150 lines of C code

Beating NumPy's matrix multiplication in 150 lines of C code

Aman Salykov's blog delves into high-performance matrix multiplication in C, surpassing NumPy with OpenBLAS on AMD Ryzen 7700 CPU. Scalable, portable code with OpenMP, targeting Intel Core and AMD Zen CPUs. Discusses BLAS, CPU performance limits, and hints at GPU optimization.

Beating NumPy's matrix multiplication in 150 lines of C code

Beating NumPy's matrix multiplication in 150 lines of C code

Aman Salykov's blog explores high-performance matrix multiplication in C, surpassing NumPy with OpenBLAS on AMD Ryzen 7700 CPU. Scalable, portable code optimized for modern CPUs with FMA3 and AVX instructions, parallelized with OpenMP for scalability and performance. Discusses matrix multiplication's significance in neural networks, BLAS libraries' role, CPU performance limits, and optimizing implementations without low-level assembly. Mentions fast matrix multiplication tutorials and upcoming GPU optimization post.

Beating NumPy matrix multiplication in 150 lines of C

Beating NumPy matrix multiplication in 150 lines of C

Aman Salykov's blog explores high-performance matrix multiplication in C, surpassing NumPy with OpenBLAS on AMD Ryzen 7700 CPU. Scalable, portable code optimized for modern CPUs with OpenMP directives for parallelization. Discusses BLAS libraries, CPU performance limits, and matrix multiplication optimization.

How to Optimize a CUDA Matmul Kernel for CuBLAS-Like Performance: A Worklog

How to Optimize a CUDA Matmul Kernel for CuBLAS-Like Performance: A Worklog

The author optimizes a CUDA matrix multiplication kernel, achieving 93.7% of cuBLAS performance through techniques like memory coalescing and shared memory caching, emphasizing GPU performance in deep learning applications.

Link Icon 8 comments
By @ap4 - 6 months
Related: I created a CUDA kernel typically much faster than kernels from cuBLAS when multiplying large square float32 matrices. Tested mostly on a 4090 GPU so far.

Source code: https://github.com/arekpaterek/Faster_SGEMM_CUDA

  size    tflops_cublas  tflops_my  diff      gpu
  4096²   50.8-50.9      61.8       +21%      4090
  6144²   55.3           59.8       +8%       4090
  8192²   56.3-56.5      67.1       +19%      4090
  12288²  53.7           66.7       +24%      4090
  16384²  53.6           66.7       +24%      4090
  4096²   28.7-28.8      32.5       +13%      4070ts
  4096²   3.8-4.3        6.7        +56-76%   T4
By @hedgehog - 6 months
For those interested in going deeper I think the classic reference in this area is the GotoBLAS paper: https://www.cs.utexas.edu/~pingali/CS378/2008sp/papers/gotoP...
By @namibj - 6 months
The quoted assembly looks L1D$ bandwidth bound; on most common and vaguely recent architectures one has to use register tiling to saturate the FMA units, as a system unable to do more than one vector load and one vector store each per cycle can't ever fully saturate a single FMA unit on GEMM; for 2 FMA units even 2 vector loads and a vector store per cycle won't be enough without register tiling.
By @canjobear - 6 months
Quasi-related: Do BLAS libraries ever actually implement Strassen's Algorithm?
By @Remnant44 - 6 months
I honestly didn't realize how performant the decades-old 2013 Haswell architecture is on vector workloads.

250GFLOP/core is no joke - He also cross-compared to an M1 Pro, that when not using the secret matrix coprocessor achieves effectively the same vector throughput, a decade later...

By @kiririn - 6 months
i7 6700 is skylake not haswell