November 11th, 2024

Optimizing a WebGPU Matmul Kernel for 1 TFLOP

Zach Nussbaum optimized a WebGPU matrix multiplication kernel to exceed 1 TFLOP performance and introduced Surfgrad, a library for browser-based tensor operations, highlighting WebGPU's advantages over traditional APIs.

Read original articleLink Icon
CuriosityAppreciationSkepticism
Optimizing a WebGPU Matmul Kernel for 1 TFLOP

Zach Nussbaum discusses the optimization of a WebGPU matrix multiplication (matmul) kernel to achieve over 1 TFLOP performance. He introduces Surfgrad, a WebGPU-based autograd library aimed at facilitating tensor operations in the browser. The article highlights the advantages of WebGPU over traditional APIs like CUDA, particularly its ability to run on various hardware and its support for compute shaders. Nussbaum explains the challenges of matrix multiplication in deep learning, noting that native WebGPU support is limited to small matrices. He details the development of several kernels, starting with a naive implementation that achieves only 1.64 GFLOPS. Subsequent optimizations include increasing the number of threads per workgroup, utilizing 2D workgroups, and implementing kernel tiling to enhance performance. These improvements allow for the computation of larger matrices and better memory access patterns, ultimately leading to significant performance gains. The article serves as both a technical guide and an exploration of WebGPU's potential in machine learning applications.

- Zach Nussbaum optimized a WebGPU matmul kernel to exceed 1 TFLOP performance.

- Surfgrad is a new WebGPU-powered autograd library for browser-based tensor operations.

- WebGPU offers advantages over CUDA, including cross-platform compatibility and enhanced compute shader support.

- The article outlines various kernel optimizations, including increasing thread counts and implementing kernel tiling.

- The optimizations enable the handling of larger matrices and improve overall computational efficiency.

Related

Beating NumPy's matrix multiplication in 150 lines of C code

Beating NumPy's matrix multiplication in 150 lines of C code

Aman Salykov's blog explores high-performance matrix multiplication in C, surpassing NumPy with OpenBLAS on AMD Ryzen 7700 CPU. Scalable, portable code optimized for modern CPUs with FMA3 and AVX instructions, parallelized with OpenMP for scalability and performance. Discusses matrix multiplication's significance in neural networks, BLAS libraries' role, CPU performance limits, and optimizing implementations without low-level assembly. Mentions fast matrix multiplication tutorials and upcoming GPU optimization post.

Beating NumPy matrix multiplication in 150 lines of C

Beating NumPy matrix multiplication in 150 lines of C

Aman Salykov's blog explores high-performance matrix multiplication in C, surpassing NumPy with OpenBLAS on AMD Ryzen 7700 CPU. Scalable, portable code optimized for modern CPUs with OpenMP directives for parallelization. Discusses BLAS libraries, CPU performance limits, and matrix multiplication optimization.

How to Optimize a CUDA Matmul Kernel for CuBLAS-Like Performance: A Worklog

How to Optimize a CUDA Matmul Kernel for CuBLAS-Like Performance: A Worklog

The author optimizes a CUDA matrix multiplication kernel, achieving 93.7% of cuBLAS performance through techniques like memory coalescing and shared memory caching, emphasizing GPU performance in deep learning applications.

Fast Multidimensional Matrix Multiplication on CPU from Scratch

Fast Multidimensional Matrix Multiplication on CPU from Scratch

The article examines multidimensional matrix multiplication performance on CPUs using Numpy and C++. It discusses optimization techniques and challenges in replicating Numpy's efficiency, emphasizing the importance of memory access patterns.

What Shapes Do Matrix Multiplications Like?

What Shapes Do Matrix Multiplications Like?

The article explains how matrix multiplication performance varies with shape, emphasizing larger matrices enhance GPU efficiency. Key factors include compute intensity, tiling, and wave quantization, necessitating manual adjustments for optimization.

AI: What people are saying
The comments on the article discuss various aspects of WebGPU and matrix multiplication performance.
  • Users compare WebGPU's performance to CUDA, noting that WebGPU achieves about 17% of the M2's peak performance, while CUDA can reach around 75%.
  • There are inquiries about the choice of the naive algorithm for matrix multiplication instead of faster alternatives, suggesting a desire for educational insights.
  • Several commenters share links to their own projects or articles related to matrix multiplication optimization, indicating a collaborative interest in the topic.
  • Concerns are raised about hardware-specific details, such as bank conflicts, that may affect WebGPU's performance compared to CUDA.
  • Some users express optimism about WebGPU's potential to enhance web applications, particularly in mapping technologies.
Link Icon 10 comments
By @shihab - 6 months
Great article!

For context: this WebGPU version achieves ~17% of peak theoretical performance of M2. With CUDA (i.e. CuBLAS), you can reach ~75% of peak performance for same matrix config (without tensor core).

By @mkeeter - 6 months
For a very deep dive into the subject, this is a great writeup:

How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance (https://siboehm.com/articles/22/CUDA-MMM)

(It's CUDA-specific, so there may be aspects that can't yet be ported to WGPU)

By @inglor - 6 months
Can you explain why you did the naive algorithm here and not any of the fast matrix multiplication ones that trade multiplications for more additions? Just for educational purposes or is there a performance benefit in the technique?
By @pama - 6 months
To clarify the title: TFLOP/s is the unit the author goes after, not TFLOP. People in the threads compare CUDA performance on GPUs to WebAssembly performance: please recall that H100 has a theoretical performance of about 1000 TFLOP/s for bfloat16, and even moderately complicated algorithms in typical modern transformer architectures can reach about half of that performance.
By @FL33TW00D - 6 months
I wrote something similar a while back: https://github.com/FL33TW00D/wgpu-mm

Also does quantized matmuls.

By @coffeeaddict1 - 6 months
You can do slightly better fairly easily I think. See here for example https://github.com/AnswerDotAI/gpu.cpp/pull/35
By @Const-me - 6 months
Couple years ago, I wanted about the same thing in HLSL language, for a Direct3D 11.0 compute shader. Here’s the fastest version I managed to make back then: https://github.com/Const-me/Cgml/blob/master/Mistral/Mistral...

As you see, I have implemented 32×32 tiling, using thread groups of 32×8 threads, two groupshared buffers to load tiles of the input matrices, and I accumulate numbers into local variables, 32 / 8 = 4 accumulators per thread.

By @billconan - 6 months
WebGPU doesn't seem to talk about bank conflict, hiding some hardware details that might be necessary to write the best kernel. will it be able to match the perf of Cuda on the same hardware?
By @maelito - 6 months
WebGPU will make Web maps even more competitive than they are already.

The smoothness of an iPhone map zoom, on any device.