June 29th, 2024

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

Nscale explores GEMM tuning impact on AI model optimization, emphasizing throughput and latency benefits. Fine-tuning parameters and algorithms significantly boost speed and efficiency, especially on AMD GPUs, showcasing up to 7.2x throughput improvement.

Read original articleLink Icon
AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

Nscale's latest technical deep dive explores the impact of GEMM (General Matrix Multiplication) tuning on AI model optimization, focusing on throughput benchmarking and latency reduction. By fine-tuning parameters and selecting optimal algorithms, GEMM tuning maximizes efficiency in using available computing resources, resulting in significant speed improvements for AI and machine learning models. The blog delves into key aspects of GEMM tuning, such as algorithm selection, parameter adjustments, and kernel configuration, highlighting the importance of leveraging libraries like rocBLAS and hipBLASlt for optimized implementations. Benchmarking tests conducted on AMD MI300X GPUs demonstrate that GEMM tuning can improve throughput by up to 7.2x and reduce latency across different models and batch sizes. Larger models like LLaMA-2-70B and LLaMA-3-70B show the most significant improvements in throughput, while latency reductions are observed consistently with GEMM tuning enabled. These findings underscore the critical role of GEMM tuning in maximizing AI model capabilities on AMD GPUs, emphasizing the importance of advanced tuning techniques for efficient processing and superior performance in handling complex AI workloads.

Related

Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU

Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU

The article discusses the release of open-source Llama3 70B model, highlighting its performance compared to GPT-4 and Claude3 Opus. It emphasizes training enhancements, data quality, and the competition between open and closed-source models.

Testing AMD's Giant MI300X

Testing AMD's Giant MI300X

AMD introduces Radeon Instinct MI300X to challenge NVIDIA in GPU compute market. MI300X features chiplet setup, Infinity Cache, CDNA 3 architecture, competitive performance against NVIDIA's H100, and excels in local memory bandwidth tests.

Researchers upend AI status quo by eliminating matrix multiplication in LLMs

Researchers upend AI status quo by eliminating matrix multiplication in LLMs

Researchers innovate AI language models by eliminating matrix multiplication, enhancing efficiency. A MatMul-free method reduces power consumption, costs, and challenges the necessity of matrix multiplication in high-performing models.

AMD MI300X performance compared with Nvidia H100

AMD MI300X performance compared with Nvidia H100

The AMD MI300X AI GPU outperforms Nvidia's H100 in cache, latency, and inference benchmarks. It excels in caching performance, compute throughput, but AI inference performance varies. Real-world performance and ecosystem support are essential.

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

Nscale explores AI model optimization through GEMM tuning, leveraging rocBLAS and hipBLASlt for AMD MI300x GPUs. Results show up to 7.2x throughput increase and reduced latency, benefiting large models and enhancing processing efficiency.

Link Icon 7 comments
By @saagarjha - 4 months
I don't usually like to do this, because clearly someone wrote this post (or at the very least they cared enough to have a LLM help draft it :P). Maybe it's just me and my interests not being aligned with the content. But I just found myself kind of disappointed by the content?

Like, the post is about how you can do GEMM tuning on AMD GPUs, a subject which is inherently super interesting–there's a lot of nuance to writing optimal kernels, and some of this is expressed in the article, too. Combine that with an architecture that isn't Nvidia? It's an excellent setup for something that would be interesting to read.

Which makes the actual conclusion all the more disappointing, IMO. There's nothing about what the actual optimizations are. It's just "oh yeah we tuned our GEMMs and now LLaMA is faster". Like, I get that nobody actually cares about GEMM and they just want tokens to come out of their GPU. But still, that's like writing a blog post about how you can speed up your game with SIMD and then posting some charts of how Cyberpunk 2077 gives you 2x the frame rate now. Ok, but how? I just feel like the interesting part is missing.

By @teaearlgraycold - 4 months
I get the impression Nvidia thinks they have a moat with CUDA, but the current AI boom is mostly built upon Python libraries that are platform agnostic, or at least can be. With enough support for AMD in Pytorch etc. the decision to buy AMD or Nvidia will purely be down to specs.
By @Lindon4290 - 4 months
Now, I don't have any MI300X, so I can't make any definite claims here. I am hoping someone else can replicate the results shown here or at the least educate me on how this is possible. Good part is the docker container and associated steps are made public - which is pretty cool!

Going by the video, the first thing that gave me pause was that a single MI300X is pulling off groq like performance, i.e. 314 tokens/second for a batch size = 1 (bs=1) with a prompt of 256 tokens and generation of 256 tokens. [1]

The Llama-2 70B is 128.48GB with FP16 (you can see this in the video). The entire model fits well within the 192GB HBM memory of the MI300X - which is awesome! However, for an regressive transformer model, during generation, the entire model weights are processed to generate a single next token. These models are "next token predictors" so to say, and you need the previous token to generate the next token. Therefore, the 128.48GB of model weights need are consumed from the HBM at the compute cores of the MI300X, per generated token. Note, I am not talking about the prefill - which only needs a single forward pass to generate the first output token. Every subsequent output token is auto-regressive.

The video shows that a single prompt (bs=1) with 256 token prompt and 256 tokens generated within 1.63 second. There is no tensor parallelism involved, or batching or anything else. This is a bs=1 case with a single card, and hence you can reason about the math fairly easy.

This shouldn't fundamentally be possible within the specs of the MI300X card. The card has a peak HBM memory bandwidth of 5.3 TB/s. You'll notice that to cycle through the weights (assuming FP16) 256 times, you'd need a minimum of 6 seconds, even at perfect ideal conditions. Napkin math: (256 * 128.48e9) / (5.3e12)

[1] https://wow.groq.com/groq-sets-new-large-language-model-perf...

By @fancyfredbot - 4 months
Can someone knowledgeable please put these numbers in context with a comparison to an H100? For example they say 1053 tokens per second throughput with batch size 4 on llama2 70b. Is that good?
By @fancyfredbot - 4 months
The numbers in this article don't make sense. They aren't consistent with the hardware (they seem to show the weights being loaded faster than mi300x peak memory bandwidth) and they aren't self consistent (70B models running only 2x slower than 7B models).

I'm not sure what's going on here but something is not right.

By @sva_ - 4 months
> Let’s delve into the notable advancements achieved through GEMM tuning of LLMs such as Llama

Smells like written by GPT