March 26th, 2025

SplitQuantV2: Enhancing Low-Bit Quantization of LLMs Without GPUs

SplitQuantV2 improves low-bit quantization of large language models without GPUs, achieving an 11.76% accuracy increase in INT4 quantization, and efficiently processes models in just over two minutes.

Read original articleLink Icon
SplitQuantV2: Enhancing Low-Bit Quantization of LLMs Without GPUs

The paper titled "SplitQuantV2: Enhancing Low-Bit Quantization of LLMs Without GPUs" introduces an innovative algorithm aimed at improving the low-bit quantization of large language models (LLMs) without the need for high-end GPUs. Traditional quantization methods often require advanced hardware and specific deep learning frameworks, which limits their applicability on various neural processing units (NPUs) and edge AI devices. SplitQuantV2 addresses these challenges by preprocessing models to create quantization-friendly structures, allowing for efficient implementation across different platforms. The algorithm was evaluated using the Llama 3.2 1B Instruct model and the AI2's Reasoning Challenge (ARC) dataset, demonstrating an 11.76% improvement in accuracy for the INT4 quantization model, matching the performance of the original floating-point model. Notably, the preprocessing and quantization process took only 2 minutes and 6 seconds on an Apple M4 CPU, showcasing the algorithm's efficiency. SplitQuantV2 presents a practical solution for low-bit quantization, particularly beneficial for scenarios where complex algorithms are not feasible due to hardware or framework limitations.

- SplitQuantV2 enhances low-bit quantization of LLMs without requiring GPUs.

- The algorithm preprocesses models to create quantization-friendly structures.

- It achieved an 11.76% accuracy improvement in INT4 quantization compared to the original model.

- The preprocessing and quantization process is efficient, taking just over two minutes on an Apple M4 CPU.

- SplitQuantV2 is suitable for diverse platforms and edge AI devices.

Link Icon 2 comments
By @imtringued - about 1 month
Extremely low bit quantization makes me curious why it is so effective.

Why is it better to run a bigger model with more parameters at lower accuracy?

Obviously more parameters are better, but why is that the case exactly? For that you need to understand that a transformer layer consists of the self attention mechanism followed by a bog standard feedforward network (usually multiple MLPs). Most of the parameters are here.

My personal theory is based on the fact that ReLU is the simplest possible activation function that works, yet all it does is clamp negative values to zero. How could a network use that for learning?

The answer to the question is quite simple. If you have weights w_i that are negative and take the sum = \sum_i w_i times x_i plus positive bias, then throw that into ReLU, you will get a boolean function that turns off when the negative sum is smaller than the bias. This means you can build a comparison operator using ReLU. Take it a few steps further and you can probably implement any arbitrary boolean function directly in each row of your MLP.

This means that most of the precision is only really needed during training, because you want a nicely continuosly differentiable function for gradient descent, but the model itself is mostly operating on a form of fuzzy boolean logic.

This means that the embedding length, basically the size of a token, plays a key role in the ability to encode these mostly binary concepts.

Bigger models have wider tokens. That's why bigger models with low bit quantization outperform smaller models with high bit quantization.

By @mentalgear - about 1 month
I feel like for many tasks, there's a certain "good enough" threshold that local small LMs can do as good but private and no cloud LLM is needed. I think the future is mostly on-device SLMs and their agentic coordination.

In that sense, a local agentic framework (js/ts based) would be soon very relevant.