October 18th, 2024

We Ran Over Half a Million Evaluations on Quantized LLMs

Neural Magic's study evaluated over 500,000 quantized large language models, finding they achieved over 99% accuracy compared to full-precision models, highlighting their effectiveness for various applications.

Read original article

We Ran Over Half a Million Evaluations on Quantized LLMs

Neural Magic conducted over 500,000 evaluations on quantized large language models (LLMs) to assess their accuracy compared to uncompressed models. The study aimed to address concerns within the machine learning community regarding the potential loss of accuracy when models are quantized to lower precision formats, such as 8-bit or 4-bit. The results indicated that, despite initial skepticism, quantized models can achieve accuracy levels comparable to their full-precision counterparts. The evaluation focused on the Llama 3.1 series, testing various quantization schemes across academic and real-world benchmarks. The findings revealed that quantized models maintained over 99% accuracy recovery on academic benchmarks and performed competitively on real-world tasks, including chat and code generation. The study also highlighted the importance of hyperparameter tuning and the choice of quantization algorithms in achieving optimal performance. Overall, the results suggest that quantized LLMs can effectively balance computational efficiency with high accuracy, making them suitable for deployment in various applications.

- Neural Magic evaluated quantized LLMs through over 500,000 tests to assess accuracy.

- Quantized models showed over 99% accuracy recovery compared to full-precision models.

- The Llama 3.1 series was the primary focus, with various quantization schemes tested.

- Results indicate quantized models perform well in both academic and real-world benchmarks.

- Hyperparameter tuning and algorithm selection are crucial for optimizing model performance.

A beginner's guide to LLM quantization and testing

Quantization in machine learning involves reducing model parameters to lower precision for efficiency. Methods like GGUF are explored, impacting model size and performance. Extreme quantization to 1-bit values is discussed, along with practical steps using tools like Llama.cpp for optimizing deployment on various hardware.

Llama 3 Secrets Every Engineer Must Know

Llama 3 is an advanced open-source language model trained on 15 trillion multilingual tokens, featuring 405 billion parameters, improved reasoning, and multilingual capabilities, while exploring practical applications and limitations.

A Visual Guide to LLM Quantization

Quantization reduces the memory footprint of large language models by converting high-precision parameters to lower-precision formats, maintaining accuracy while minimizing storage. Various methods include symmetric and asymmetric quantization.

LLMs know more than what they say

Log10's latent space readout (LSR) enhances evaluation accuracy for large language models, being 20 times more sample efficient than traditional methods, allowing rapid customization and improving hallucination detection and numeric scoring.

Fine-Tuning LLMs to 1.58bit

BitNet introduces extreme quantization for large language models, achieving 1.58 bits per parameter, enhancing efficiency and performance, particularly in fine-tuning Llama3 8B models while integrating into existing frameworks.

3 comments

By @anotherhue - 6 months

> In conclusion, our comprehensive evaluation demonstrates that quantized models maintain impressive accuracy and quality compared to their full-precision counterparts, making them an essential tool for optimizing LLMs in real-world deployments.

By @eldar_ciki - 6 months

The ML community has recently questioned whether quantized LLMs can genuinely compete with their full-precision counterparts. To address this, we conducted over half a million evaluations on quantized Llama-3.1-{8B, 70B, 405B}-Instruct models across FP8, INT8, and INT4 quantization schemes. We looked at various benchmarks, from open-ended challenges like Arena-Hard, to rigorous academic benchmarks such as MMLU, MMLU-Pro, Big Bench Hard, ARC-Challenge, IFEval, GPQA (and others from OpenLLM Leaderboard v1 and v2), and coding tests like HumanEval and HumanEval+.

-> Long story short: when models are carefully quantized, everything looks good. The lowest accuracy recovery we found was 96% relative to unquantized baseline, and this happened only for 8B model at weight-only INT4 quantization mostly because the unquantized baseline model has close to random prediction accuracy on a couple of Leaderboard v2 benchmarks. As one could imagine, "recovering" random accuracy is a bit noisy.

We Ran Over Half a Million Evaluations on Quantized LLMs

Related