October 18th, 2024

We Ran Over Half a Million Evaluations on Quantized LLMs

Neural Magic's study evaluated over 500,000 quantized large language models, finding they achieved over 99% accuracy compared to full-precision models, highlighting their effectiveness for various applications.

Read original articleLink Icon
We Ran Over Half a Million Evaluations on Quantized LLMs

Neural Magic conducted over 500,000 evaluations on quantized large language models (LLMs) to assess their accuracy compared to uncompressed models. The study aimed to address concerns within the machine learning community regarding the potential loss of accuracy when models are quantized to lower precision formats, such as 8-bit or 4-bit. The results indicated that, despite initial skepticism, quantized models can achieve accuracy levels comparable to their full-precision counterparts. The evaluation focused on the Llama 3.1 series, testing various quantization schemes across academic and real-world benchmarks. The findings revealed that quantized models maintained over 99% accuracy recovery on academic benchmarks and performed competitively on real-world tasks, including chat and code generation. The study also highlighted the importance of hyperparameter tuning and the choice of quantization algorithms in achieving optimal performance. Overall, the results suggest that quantized LLMs can effectively balance computational efficiency with high accuracy, making them suitable for deployment in various applications.

- Neural Magic evaluated quantized LLMs through over 500,000 tests to assess accuracy.

- Quantized models showed over 99% accuracy recovery compared to full-precision models.

- The Llama 3.1 series was the primary focus, with various quantization schemes tested.

- Results indicate quantized models perform well in both academic and real-world benchmarks.

- Hyperparameter tuning and algorithm selection are crucial for optimizing model performance.

Link Icon 3 comments
By @anotherhue - about 1 month
> In conclusion, our comprehensive evaluation demonstrates that quantized models maintain impressive accuracy and quality compared to their full-precision counterparts, making them an essential tool for optimizing LLMs in real-world deployments.
By @eldar_ciki - about 1 month
The ML community has recently questioned whether quantized LLMs can genuinely compete with their full-precision counterparts. To address this, we conducted over half a million evaluations on quantized Llama-3.1-{8B, 70B, 405B}-Instruct models across FP8, INT8, and INT4 quantization schemes. We looked at various benchmarks, from open-ended challenges like Arena-Hard, to rigorous academic benchmarks such as MMLU, MMLU-Pro, Big Bench Hard, ARC-Challenge, IFEval, GPQA (and others from OpenLLM Leaderboard v1 and v2), and coding tests like HumanEval and HumanEval+.

-> Long story short: when models are carefully quantized, everything looks good. The lowest accuracy recovery we found was 96% relative to unquantized baseline, and this happened only for 8B model at weight-only INT4 quantization mostly because the unquantized baseline model has close to random prediction accuracy on a couple of Leaderboard v2 benchmarks. As one could imagine, "recovering" random accuracy is a bit noisy.