November 9th, 2024

SVDQuant: 4-Bit Quantization Powers 12B Flux on a 16GB 4090 GPU with 3x Speedup

SVDQuant is a post-training quantization technique that optimizes diffusion models by reducing memory usage and increasing speed, while maintaining visual quality and enabling efficient integration with LoRA branches.

Read original article

SVDQuant: 4-Bit Quantization Powers 12B Flux on a 16GB 4090 GPU with 3x Speedup

SVDQuant is a novel post-training quantization technique designed to optimize diffusion models by quantizing both weights and activations to 4 bits. This method significantly reduces memory usage by 3.6 times compared to the BF16 model and achieves an 8.7 times speedup over the 16-bit model on a 16GB NVIDIA RTX 4090 laptop. The technique addresses the challenges of quantization by introducing a low-rank branch that effectively absorbs outliers, thus maintaining visual fidelity. SVDQuant's performance is further enhanced by the Nunchaku inference engine, which minimizes latency by fusing low-rank and low-bit branch kernels, resulting in a total speedup of 10.1 times by eliminating CPU offloading. The 4-bit models produced by SVDQuant demonstrate superior visual quality and text alignment compared to existing baselines, making them suitable for real-time applications. Additionally, SVDQuant allows for seamless integration with LoRA branches without redundant memory access, enhancing its efficiency. Overall, SVDQuant represents a significant advancement in the field of efficient AI computing, particularly for large-scale diffusion models.

- SVDQuant quantizes weights and activations to 4 bits, achieving significant memory and speed improvements.

- The technique reduces model size by 3.6 times and offers an 8.7 times speedup over traditional 16-bit models.

- Nunchaku inference engine enhances performance by reducing latency through kernel fusion.

- The 4-bit models maintain high visual quality and better text alignment compared to existing quantization methods.

- SVDQuant allows for efficient integration with LoRA branches, optimizing memory usage.

A beginner's guide to LLM quantization and testing

Quantization in machine learning involves reducing model parameters to lower precision for efficiency. Methods like GGUF are explored, impacting model size and performance. Extreme quantization to 1-bit values is discussed, along with practical steps using tools like Llama.cpp for optimizing deployment on various hardware.

Fine-Tuning LLMs to 1.58bit

BitNet introduces extreme quantization for large language models, achieving 1.58 bits per parameter, enhancing efficiency and performance, particularly in fine-tuning Llama3 8B models while integrating into existing frameworks.

INT-FlashAttention: Enabling Flash Attention for INT8 Quantization

The paper presents INT-FlashAttention, a new architecture combining FlashAttention with INT8 quantization, achieving 72% faster inference and 82% less quantization error, while supporting various data formats.

We Ran Over Half a Million Evaluations on Quantized LLMs

Neural Magic's study evaluated over 500,000 quantized large language models, finding they achieved over 99% accuracy compared to full-precision models, highlighting their effectiveness for various applications.

Quantized Llama models with increased speed and a reduced memory footprint

Meta has launched lightweight quantized Llama models for mobile devices, achieving 2-4x speedup and 56% size reduction. These models support short-context applications and prioritize on-device data processing for privacy.

9 comments

By @djoldman - 6 months

This is one in a long line of posts saying "we took a model and made it smaller" and now it can run with different requirements.

It is important to keep in mind that modifying a model changes the performance of the resulting model, where performance is "correctness" or "quality" of output.

Just because the base model is very performant does not mean the smaller model is.

This means that another model that is the same size as the new quantized model may outperform the quantized model.

Suppose there are equal sized big models A and B with their smaller quantized variants a and b. A being a more performant model than B does not guarantee a being more performant than b.

By @mesmertech - 6 months

Demo on actual 4090 with flux schnell for next few hours: https://5jkdpo3rnipsem-3000.proxy.runpod.net/

Its basically H100 speeds with 4090, 4.80it/s. 1.1 sec for flux schenll(4 steps) and 5.5 seconds for flux dev(25 steps). Compared to normal speeds(comfyui fp8 with "--fast" optimization") which is 3 seconds for schnell and 11.5 seconds for dev

By @gyrovagueGeist - 6 months

This problem seems like it would be very similar to the Low-Rank + Sparse decompositions that used to be popular in audio-visual filtering.

By @notarealllama - 6 months

I'm convinced the path to ubiquity (such as embedded in smartphones) is quantization.

I had to int4 a llama model to get it to properly run on my 3060.

I'm curious, how much resolution / significant digits do we actually need for most genAI work? If you can draw a circle with 3.14, maybe it's good enough for fast and ubiquitous usage.

By @xrd - 6 months

Can someone explain this sentence from the article:

  Diffusion models, however, are computationally bound, even for single batches, so quantizing weights alone yields limited gains.

By @atlex2 - 6 months

Seriously nobody thought to use SVD on these weight matrices before?

By @scottmas - 6 months

Possible to run this in ComfyUI?

By @DeathArrow - 6 months

But doesn't quantization give worse results? Don't you trade quality for memory footprint?

SVDQuant: 4-Bit Quantization Powers 12B Flux on a 16GB 4090 GPU with 3x Speedup

Related

A beginner's guide to LLM quantization and testing

Fine-Tuning LLMs to 1.58bit

INT-FlashAttention: Enabling Flash Attention for INT8 Quantization

We Ran Over Half a Million Evaluations on Quantized LLMs

Quantized Llama models with increased speed and a reduced memory footprint

Related

A beginner's guide to LLM quantization and testing

Fine-Tuning LLMs to 1.58bit

INT-FlashAttention: Enabling Flash Attention for INT8 Quantization

We Ran Over Half a Million Evaluations on Quantized LLMs

Quantized Llama models with increased speed and a reduced memory footprint