April 20th, 2025

Pushing the Limits of LLM Quantization via the Linearity Theorem

The paper presents advancements in large language model quantization, introducing a linearity theorem, a new data-free method called HIGGS, and improved non-uniform quantization, enhancing accuracy-compression trade-offs on various models.

Read original articleLink Icon
Pushing the Limits of LLM Quantization via the Linearity Theorem

The paper titled "Pushing the Limits of Large Language Model Quantization via the Linearity Theorem" presents advancements in the quantization of large language models (LLMs) to reduce their memory and computational costs. Traditional methods have focused on minimizing per-layer errors without a solid theoretical foundation, often leading to suboptimal results. The authors introduce a "linearity theorem" that connects layer-wise reconstruction error with model perplexity increase due to quantization. This insight facilitates two key innovations: a data-free quantization method called HIGGS, which utilizes Hadamard rotations and MSE-optimal grids, outperforming existing data-free methods like NF4; and a dynamic programming approach to determine non-uniform per-layer quantization levels that adhere to specific compression constraints. The practical implications of these methods are demonstrated through improved accuracy-compression trade-offs on models such as Llama-3.1, Llama-3.2, and Qwen-family models. Additionally, the proposed methods are designed to be efficiently supported by GPU kernels across various batch sizes, enhancing both data-free and non-uniform quantization techniques for LLMs.

- The paper introduces a linearity theorem linking quantization error and model perplexity.

- A new data-free quantization method, HIGGS, outperforms existing techniques.

- The study provides an optimal solution for non-uniform per-layer quantization levels.

- Improved accuracy-compression trade-offs are demonstrated on multiple LLM models.

- The methods are compatible with GPU support for efficient processing.

Link Icon 3 comments
By @cs702 - 6 days
The OP looks like good work, but it's definitely not a quick read. The authors claim theoretical breakthroughs that enable:

* a data-free LLM quantization method which they claim outperforms all prior data-free approaches, including NF4; and

* a method which they claim is optimal for finding non-uniform per-layer quantization levels which match a given compression constraint in the "medium bitwidth" regime.

They demonstrate improved accuracy-compression trade-offs on popular LLMs.

Thank you for sharing this on HN.

By @Scene_Cast2 - 6 days
Given our modern understanding of how LLMs work (like the recent Anthropic work), I wonder if that insight can be used to quantize better. For example, we know that LLMs encode concepts through rotations (but not magnitude) of several neurons.

Bringing this up because the abstract (and the mention of rotations) reminded me of recent LLM interpretability posts.