August 21st, 2024

GPU utilization can be a misleading metric

The article critiques GPU Utilization as a performance metric, advocating for Model FLOPS (MFUs) and SM Efficiency to better assess GPU performance and improve training efficiency in machine learning tasks.

Read original article

CuriositySkepticismEnthusiasm

GPU utilization can be a misleading metric

The article discusses the limitations of using GPU Utilization as a primary metric for assessing GPU performance in machine learning tasks. While GPU Utilization, often measured via tools like nvidia-smi, can indicate high usage, it does not necessarily reflect the actual computational efficiency of the GPU. The authors highlight that one can achieve 100% GPU utilization by merely reading or writing to memory without performing any computations, which can be misleading. Instead, they advocate for using Model FLOPS (MFUs) as a more accurate measure of performance, as it compares the observed throughput to the GPU's theoretical maximum. The authors found that their model training was only achieving about 20% MFUs despite 100% GPU utilization, indicating significant room for improvement. They emphasize the importance of monitoring SM Efficiency, which measures the active streaming multiprocessors during GPU operations, to identify inefficiencies in model execution. By optimizing their training loop and implementing fused kernels, they achieved a 4x speedup in training time and increased MFUs to 38%. The article concludes by recommending that AI teams track both SM Efficiency and GPU Utilization for a more comprehensive understanding of GPU performance.

- GPU Utilization can be misleading as it does not reflect actual computational efficiency.

- Model FLOPS (MFUs) is a better metric for assessing GPU performance.

- Monitoring SM Efficiency helps identify inefficiencies in model execution.

- Optimizing training loops and using fused kernels can significantly improve performance.

- AI teams should track both SM Efficiency and GPU Utilization for better insights.

Researchers upend AI status quo by eliminating matrix multiplication in LLMs

Researchers innovate AI language models by eliminating matrix multiplication, enhancing efficiency. A MatMul-free method reduces power consumption, costs, and challenges the necessity of matrix multiplication in high-performing models.

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

Nscale explores AI model optimization through GEMM tuning, leveraging rocBLAS and hipBLASlt for AMD MI300x GPUs. Results show up to 7.2x throughput increase and reduced latency, benefiting large models and enhancing processing efficiency.

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

Nscale explores GEMM tuning impact on AI model optimization, emphasizing throughput and latency benefits. Fine-tuning parameters and algorithms significantly boost speed and efficiency, especially on AMD GPUs, showcasing up to 7.2x throughput improvement.

Ex-Twitter dev reminisces about finding 700 unused Nvidia GPUs after takeover

Tim Zaman, a former Twitter engineer, revealed 700 idle Nvidia V100 GPUs in Twitter's data center post-Elon Musk's acquisition, highlighting inefficiencies in resource management amid rising AI demands.

A practitioner's guide to testing and running GPU clusters

The article emphasizes the importance of systematic acceptance testing for GPU clusters in AI training, addressing hardware reliability, performance validation, and the need for efficient storage and communication systems.

AI: What people are saying

The discussion around GPU performance metrics reveals several key insights and common themes.

GPU utilization is often misleading, as high utilization can occur without effective computation, prompting a shift towards metrics like Model FLOPS and SM Efficiency.
Alternative metrics, such as GPU watt usage and application-specific metrics, are gaining traction for assessing performance more accurately.
Tools like roofline plots and Nsight Compute are recommended for analyzing performance and identifying bottlenecks in model training.
Users express challenges in maximizing GPU efficiency and seek guidance on optimizing their setups, including the use of advanced features like MPS.
There is a consensus that understanding and recalibrating expectations around GPU performance is essential for tapping into their full potential.

11 comments

By @SnowflakeOnIce - 8 months

> you can get 100% GPU utilization by just reading/writing to memory while doing 0 computations

Indeed! Utilization is a proxy for what you actually want (which is good use of available hardware). 100% GPU utilization doesn't actually indicate this.

On the other hand, if you aren't getting 100% GPU utilization, you aren't making good use of the hardware.

By @antognini - 8 months

When understanding the performance of your model it's very helpful to look at a roofline plot [1]. The roofline plot will show you the floating-point performance as a function of arithmetic intensity for the various ops in your model. The plot has two regimes: a memory-bound regime on the left and a compute-bound regime on the right. This can help to identify memory-bound ops that are taking a significant fraction of compute time.

[1]: https://en.wikipedia.org/wiki/Roofline_model

By @sundalia - 8 months

Application-specific metrics are the way to go. For ML training this is one example: https://cloud.google.com/blog/products/ai-machine-learning/g...

By @sergiotapia - 8 months

running GPU models and maximizing utilization is pretty opaque to me as a layman coming into the scene.

take this example: https://gist.github.com/sergiotapia/efc9b3f7163ba803a260b481... - running a fairly simple model that takes only 70ms per image pair, but because I have 300 images it becomes a big time sink.

by using ThreadPoolExecutor, I cut that down to about 16 seconds. i wonder if there is a fairly obvious way to truly utlize my beefy L40S GPU! is it MPS? I haven't been successful at even running the MPS daemon on my linux server yet. very opaque for sure!

By @DamonsJ - 8 months

"If we have a CUDA kernel that continuously runs for 10 seconds but only uses 1 SM, on an H100, this would register 100% utilization, but the SM efficiency would be 1 / 132 = 0.7%."

does this situation register 100% utilization? BTW, the SM OCCUPANCY is also a metric you need to care about if you concern on kernel efficiency

By @saagarjha - 8 months

If you have a basic understanding of what your kernels are supposed to do, looking at pipe usage and roofline analysis in Nsight Compute is often helpful, since it will show you how hard you’re saturating those.

By @pavelstoev - 8 months

I recommend hidet backend in torch.compile - implements many advanced model-specific optimizations automatically. https://github.com/hidet-org/hidet

By @areichenbach - 8 months

I’ve recently been trusting gpu watt usage over utilization. Any idea how good that is as a simple proxy (if I’m just looking at nvidia-smi)?

By @danielvaughn - 8 months

We ran into a similar problem with CPU utilization at my job. Created an alert for when our systems hit 90% CPU util, and ended up with a ton of noise. We realized that for some of our workloads, this was normal and expected.

By @ScoutOrgo - 8 months

As someone that is familiar with using nvidia-smi to track util, what are some commands people use to track the SM efficiency? The end of the article had some references, but no examples of what to use explicitly.

By @AeZ1E - 8 months

gpu utilization is not everything, people! mfus are where it's at. time to recalibrate those expectations and tap into the true potential of your gpus. brace yourselves, the real efficiency is yet to come!

GPU utilization can be a misleading metric

Related

Researchers upend AI status quo by eliminating matrix multiplication in LLMs

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

Ex-Twitter dev reminisces about finding 700 unused Nvidia GPUs after takeover

A practitioner's guide to testing and running GPU clusters

Related

Researchers upend AI status quo by eliminating matrix multiplication in LLMs

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

Ex-Twitter dev reminisces about finding 700 unused Nvidia GPUs after takeover

A practitioner's guide to testing and running GPU clusters