March 26th, 2025

SplitQuantV2: Enhancing Low-Bit Quantization of LLMs Without GPUs

SplitQuantV2 improves low-bit quantization of large language models without GPUs, achieving an 11.76% accuracy increase in INT4 quantization, and efficiently processes models in just over two minutes.

Read original article

The paper titled "SplitQuantV2: Enhancing Low-Bit Quantization of LLMs Without GPUs" introduces an innovative algorithm aimed at improving the low-bit quantization of large language models (LLMs) without the need for high-end GPUs. Traditional quantization methods often require advanced hardware and specific deep learning frameworks, which limits their applicability on various neural processing units (NPUs) and edge AI devices. SplitQuantV2 addresses these challenges by preprocessing models to create quantization-friendly structures, allowing for efficient implementation across different platforms. The algorithm was evaluated using the Llama 3.2 1B Instruct model and the AI2's Reasoning Challenge (ARC) dataset, demonstrating an 11.76% improvement in accuracy for the INT4 quantization model, matching the performance of the original floating-point model. Notably, the preprocessing and quantization process took only 2 minutes and 6 seconds on an Apple M4 CPU, showcasing the algorithm's efficiency. SplitQuantV2 presents a practical solution for low-bit quantization, particularly beneficial for scenarios where complex algorithms are not feasible due to hardware or framework limitations.

- SplitQuantV2 enhances low-bit quantization of LLMs without requiring GPUs.

- The algorithm preprocesses models to create quantization-friendly structures.

- It achieved an 11.76% accuracy improvement in INT4 quantization compared to the original model.

- The preprocessing and quantization process is efficient, taking just over two minutes on an Apple M4 CPU.

- SplitQuantV2 is suitable for diverse platforms and edge AI devices.

A beginner's guide to LLM quantization and testing

Quantization in machine learning involves reducing model parameters to lower precision for efficiency. Methods like GGUF are explored, impacting model size and performance. Extreme quantization to 1-bit values is discussed, along with practical steps using tools like Llama.cpp for optimizing deployment on various hardware.

A Visual Guide to LLM Quantization

Quantization reduces the memory footprint of large language models by converting high-precision parameters to lower-precision formats, maintaining accuracy while minimizing storage. Various methods include symmetric and asymmetric quantization.

Fine-Tuning LLMs to 1.58bit

BitNet introduces extreme quantization for large language models, achieving 1.58 bits per parameter, enhancing efficiency and performance, particularly in fine-tuning Llama3 8B models while integrating into existing frameworks.

We Ran Over Half a Million Evaluations on Quantized LLMs

Neural Magic's study evaluated over 500,000 quantized large language models, finding they achieved over 99% accuracy compared to full-precision models, highlighting their effectiveness for various applications.

SVDQuant: 4-Bit Quantization Powers 12B Flux on a 16GB 4090 GPU with 3x Speedup

SVDQuant is a post-training quantization technique that optimizes diffusion models by reducing memory usage and increasing speed, while maintaining visual quality and enabling efficient integration with LoRA branches.

2 comments

By @imtringued - about 1 month

Extremely low bit quantization makes me curious why it is so effective.

Why is it better to run a bigger model with more parameters at lower accuracy?

Obviously more parameters are better, but why is that the case exactly? For that you need to understand that a transformer layer consists of the self attention mechanism followed by a bog standard feedforward network (usually multiple MLPs). Most of the parameters are here.

My personal theory is based on the fact that ReLU is the simplest possible activation function that works, yet all it does is clamp negative values to zero. How could a network use that for learning?

The answer to the question is quite simple. If you have weights w_i that are negative and take the sum = \sum_i w_i times x_i plus positive bias, then throw that into ReLU, you will get a boolean function that turns off when the negative sum is smaller than the bias. This means you can build a comparison operator using ReLU. Take it a few steps further and you can probably implement any arbitrary boolean function directly in each row of your MLP.

This means that most of the precision is only really needed during training, because you want a nicely continuosly differentiable function for gradient descent, but the model itself is mostly operating on a form of fuzzy boolean logic.

This means that the embedding length, basically the size of a token, plays a key role in the ability to encode these mostly binary concepts.

Bigger models have wider tokens. That's why bigger models with low bit quantization outperform smaller models with high bit quantization.

By @mentalgear - about 1 month

I feel like for many tasks, there's a certain "good enough" threshold that local small LMs can do as good but private and no cloud LLM is needed. I think the future is mostly on-device SLMs and their agentic coordination.

In that sense, a local agentic framework (js/ts based) would be soon very relevant.

SplitQuantV2: Enhancing Low-Bit Quantization of LLMs Without GPUs

Related

A beginner's guide to LLM quantization and testing

A Visual Guide to LLM Quantization

Fine-Tuning LLMs to 1.58bit

We Ran Over Half a Million Evaluations on Quantized LLMs

SVDQuant: 4-Bit Quantization Powers 12B Flux on a 16GB 4090 GPU with 3x Speedup

Related

A beginner's guide to LLM quantization and testing

A Visual Guide to LLM Quantization

Fine-Tuning LLMs to 1.58bit

We Ran Over Half a Million Evaluations on Quantized LLMs

SVDQuant: 4-Bit Quantization Powers 12B Flux on a 16GB 4090 GPU with 3x Speedup