July 26th, 2024

How to Optimize a CUDA Matmul Kernel for CuBLAS-Like Performance: A Worklog

The author optimizes a CUDA matrix multiplication kernel, achieving 93.7% of cuBLAS performance through techniques like memory coalescing and shared memory caching, emphasizing GPU performance in deep learning applications.

Read original article

How to Optimize a CUDA Matmul Kernel for CuBLAS-Like Performance: A Worklog

the memory access patterns in CPU programming. The author aims to optimize a CUDA matrix multiplication (matmul) kernel to achieve performance levels similar to cuBLAS, NVIDIA's optimized matrix library. The worklog details a series of iterative optimizations, starting from a naive implementation that achieves only 1.3% of cuBLAS performance, to advanced techniques like global memory coalescing, shared memory caching, and block tiling, which progressively enhance performance. The final optimized kernel approaches 93.7% of cuBLAS performance, demonstrating significant improvements in GFLOPs/s. The author emphasizes the importance of understanding GPU performance characteristics, particularly in the context of deep learning, where matrix multiplication is a critical operation. The post also discusses the impact of memory access patterns on performance, highlighting the need for coalescing global memory accesses to reduce memory traffic. The author provides insights into CUDA's thread hierarchy and the significance of warps in optimizing memory access. The worklog serves as a practical guide for developers looking to enhance CUDA kernel performance, with the potential for future exploration of tensor cores and warp matrix functions. The author invites collaboration on kernel optimization at Anthropic, indicating ongoing opportunities in this field.

Show HN: UNet diffusion model in pure CUDA

The GitHub content details optimizing a UNet diffusion model in C++/CUDA to match PyTorch's performance. It covers custom convolution kernels, forward pass improvements, backward pass challenges, and future optimization plans.

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

Nscale explores GEMM tuning impact on AI model optimization, emphasizing throughput and latency benefits. Fine-tuning parameters and algorithms significantly boost speed and efficiency, especially on AMD GPUs, showcasing up to 7.2x throughput improvement.

Beating NumPy's matrix multiplication in 150 lines of C code

Aman Salykov's blog delves into high-performance matrix multiplication in C, surpassing NumPy with OpenBLAS on AMD Ryzen 7700 CPU. Scalable, portable code with OpenMP, targeting Intel Core and AMD Zen CPUs. Discusses BLAS, CPU performance limits, and hints at GPU optimization.

Beating NumPy's matrix multiplication in 150 lines of C code

Aman Salykov's blog explores high-performance matrix multiplication in C, surpassing NumPy with OpenBLAS on AMD Ryzen 7700 CPU. Scalable, portable code optimized for modern CPUs with FMA3 and AVX instructions, parallelized with OpenMP for scalability and performance. Discusses matrix multiplication's significance in neural networks, BLAS libraries' role, CPU performance limits, and optimizing implementations without low-level assembly. Mentions fast matrix multiplication tutorials and upcoming GPU optimization post.

Beating NumPy matrix multiplication in 150 lines of C

Aman Salykov's blog explores high-performance matrix multiplication in C, surpassing NumPy with OpenBLAS on AMD Ryzen 7700 CPU. Scalable, portable code optimized for modern CPUs with OpenMP directives for parallelization. Discusses BLAS libraries, CPU performance limits, and matrix multiplication optimization.

3 comments

By @aaa370 - 10 months

Another point to consider here is that this project of writing a cuBLAS level GEMM kernel becomes much more challenging if you are doing it with fp16, and are thus competing with the cuBLAS kernels that use tensor cores. The (theoretical) arithmetic throughput of tensor cores is ~8x higher as compared to fp32 math on the Turing arch, I dont know off the top of my head but I think this ratio is the same or greater for Ampere/Hopper tensor cores.

This makes the project proportionally harder in my opinion because you need to be that much more efficient with moving data through the memory hierarchy. With tensor cores, to get anywhere close to cuBLAS, you need to start with something like the most efficient kernel in simon's article, and then do stuff like shared memory swizzling, async global memory copies, double buffering, and writing a really efficient kernel epilogue to accumulate the C matrix into the product.

I came across this article a while ago and it inspired me to take a stab at this^, and as of now I have gotten to ~80% of the cuBLAS tensor core performance where the kernel is mostly compute bound, and I am close to giving up on the last ~20%, because I think I may need to write the inner loop in SASS to make sure the instruction mix between shared memory loads, mma instructions, and synchronizations is perfectly balanced so that none of the hardware pipelines get overloaded (see link below), and I have enough compassion for myself to not spend my free time doing stuff like that :). There are also certain things implemented in CUTLASS that seem important (look up serpentine traversal) but NVIDIA engineers wont talk about the hardware details required to understand why this helps.

Article on this is forthcoming

https://github.com/NervanaSystems/maxas/wiki/SGEMM

By @flakiness - 10 months

Note that this is from 2022.

My guess is that people nowadays are gradually moving away from raw CUDA programming and moving towards things like Triton etc, and you won't be focusing on pure GEMM since you tend to do some fusion.

The Triton tutorial claims their performance is on par with cuBLAS.

https://triton-lang.org/main/getting-started/tutorials/03-ma...

By @joe_the_user - 10 months

In other discussion here, people asserted that a CUDA replacement was unworkable because you couldn't replace Nvidia's CuBLAS implementation. I'm not qualified to say whether this would give info for constructing an adequate replacement but I'd interested in people's opinions.

Show HN: UNet diffusion model in pure CUDA

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

Beating NumPy's matrix multiplication in 150 lines of C code

Beating NumPy matrix multiplication in 150 lines of C

Aman Salykov's blog explores high-performance matrix multiplication in C, surpassing NumPy with OpenBLAS on AMD Ryzen 7700 CPU. Scalable, portable code optimized for modern CPUs with OpenMP directives for parallelization. Discusses BLAS libraries, CPU performance limits, and matrix multiplication optimization.

How to Optimize a CUDA Matmul Kernel for CuBLAS-Like Performance: A Worklog

Related

Show HN: UNet diffusion model in pure CUDA

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

Beating NumPy's matrix multiplication in 150 lines of C code

Beating NumPy's matrix multiplication in 150 lines of C code

Beating NumPy matrix multiplication in 150 lines of C

Related

Show HN: UNet diffusion model in pure CUDA

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

Beating NumPy's matrix multiplication in 150 lines of C code

Beating NumPy's matrix multiplication in 150 lines of C code

Beating NumPy matrix multiplication in 150 lines of C