April 20th, 2025

Pushing the Limits of LLM Quantization via the Linearity Theorem

The paper presents advancements in large language model quantization, introducing a linearity theorem, a new data-free method called HIGGS, and improved non-uniform quantization, enhancing accuracy-compression trade-offs on various models.

Read original article

Pushing the Limits of LLM Quantization via the Linearity Theorem

The paper titled "Pushing the Limits of Large Language Model Quantization via the Linearity Theorem" presents advancements in the quantization of large language models (LLMs) to reduce their memory and computational costs. Traditional methods have focused on minimizing per-layer errors without a solid theoretical foundation, often leading to suboptimal results. The authors introduce a "linearity theorem" that connects layer-wise reconstruction error with model perplexity increase due to quantization. This insight facilitates two key innovations: a data-free quantization method called HIGGS, which utilizes Hadamard rotations and MSE-optimal grids, outperforming existing data-free methods like NF4; and a dynamic programming approach to determine non-uniform per-layer quantization levels that adhere to specific compression constraints. The practical implications of these methods are demonstrated through improved accuracy-compression trade-offs on models such as Llama-3.1, Llama-3.2, and Qwen-family models. Additionally, the proposed methods are designed to be efficiently supported by GPU kernels across various batch sizes, enhancing both data-free and non-uniform quantization techniques for LLMs.

- The paper introduces a linearity theorem linking quantization error and model perplexity.

- A new data-free quantization method, HIGGS, outperforms existing techniques.

- The study provides an optimal solution for non-uniform per-layer quantization levels.

- Improved accuracy-compression trade-offs are demonstrated on multiple LLM models.

- The methods are compatible with GPU support for efficient processing.

A beginner's guide to LLM quantization and testing

Quantization in machine learning involves reducing model parameters to lower precision for efficiency. Methods like GGUF are explored, impacting model size and performance. Extreme quantization to 1-bit values is discussed, along with practical steps using tools like Llama.cpp for optimizing deployment on various hardware.

A Visual Guide to LLM Quantization

Quantization reduces the memory footprint of large language models by converting high-precision parameters to lower-precision formats, maintaining accuracy while minimizing storage. Various methods include symmetric and asymmetric quantization.

Fine-Tuning LLMs to 1.58bit

BitNet introduces extreme quantization for large language models, achieving 1.58 bits per parameter, enhancing efficiency and performance, particularly in fine-tuning Llama3 8B models while integrating into existing frameworks.

We Ran Over Half a Million Evaluations on Quantized LLMs

Neural Magic's study evaluated over 500,000 quantized large language models, finding they achieved over 99% accuracy compared to full-precision models, highlighting their effectiveness for various applications.

SplitQuantV2: Enhancing Low-Bit Quantization of LLMs Without GPUs

SplitQuantV2 improves low-bit quantization of large language models without GPUs, achieving an 11.76% accuracy increase in INT4 quantization, and efficiently processes models in just over two minutes.

3 comments

By @cs702 - 6 days

The OP looks like good work, but it's definitely not a quick read. The authors claim theoretical breakthroughs that enable:

* a data-free LLM quantization method which they claim outperforms all prior data-free approaches, including NF4; and

* a method which they claim is optimal for finding non-uniform per-layer quantization levels which match a given compression constraint in the "medium bitwidth" regime.

They demonstrate improved accuracy-compression trade-offs on popular LLMs.

Thank you for sharing this on HN.

By @Scene_Cast2 - 6 days

Given our modern understanding of how LLMs work (like the recent Anthropic work), I wonder if that insight can be used to quantize better. For example, we know that LLMs encode concepts through rotations (but not magnitude) of several neurons.

Bringing this up because the abstract (and the mention of rotations) reminded me of recent LLM interpretability posts.

Pushing the Limits of LLM Quantization via the Linearity Theorem

Related

A beginner's guide to LLM quantization and testing

A Visual Guide to LLM Quantization

Fine-Tuning LLMs to 1.58bit

We Ran Over Half a Million Evaluations on Quantized LLMs

SplitQuantV2: Enhancing Low-Bit Quantization of LLMs Without GPUs

Related

A beginner's guide to LLM quantization and testing

A Visual Guide to LLM Quantization

Fine-Tuning LLMs to 1.58bit

We Ran Over Half a Million Evaluations on Quantized LLMs

SplitQuantV2: Enhancing Low-Bit Quantization of LLMs Without GPUs