November 9th, 2024

SVDQuant: 4-Bit Quantization Powers 12B Flux on a 16GB 4090 GPU with 3x Speedup

SVDQuant is a post-training quantization technique that optimizes diffusion models by reducing memory usage and increasing speed, while maintaining visual quality and enabling efficient integration with LoRA branches.

Read original articleLink Icon
SVDQuant: 4-Bit Quantization Powers 12B Flux on a 16GB 4090 GPU with 3x Speedup

SVDQuant is a novel post-training quantization technique designed to optimize diffusion models by quantizing both weights and activations to 4 bits. This method significantly reduces memory usage by 3.6 times compared to the BF16 model and achieves an 8.7 times speedup over the 16-bit model on a 16GB NVIDIA RTX 4090 laptop. The technique addresses the challenges of quantization by introducing a low-rank branch that effectively absorbs outliers, thus maintaining visual fidelity. SVDQuant's performance is further enhanced by the Nunchaku inference engine, which minimizes latency by fusing low-rank and low-bit branch kernels, resulting in a total speedup of 10.1 times by eliminating CPU offloading. The 4-bit models produced by SVDQuant demonstrate superior visual quality and text alignment compared to existing baselines, making them suitable for real-time applications. Additionally, SVDQuant allows for seamless integration with LoRA branches without redundant memory access, enhancing its efficiency. Overall, SVDQuant represents a significant advancement in the field of efficient AI computing, particularly for large-scale diffusion models.

- SVDQuant quantizes weights and activations to 4 bits, achieving significant memory and speed improvements.

- The technique reduces model size by 3.6 times and offers an 8.7 times speedup over traditional 16-bit models.

- Nunchaku inference engine enhances performance by reducing latency through kernel fusion.

- The 4-bit models maintain high visual quality and better text alignment compared to existing quantization methods.

- SVDQuant allows for efficient integration with LoRA branches, optimizing memory usage.

Link Icon 9 comments
By @djoldman - 6 months
This is one in a long line of posts saying "we took a model and made it smaller" and now it can run with different requirements.

It is important to keep in mind that modifying a model changes the performance of the resulting model, where performance is "correctness" or "quality" of output.

Just because the base model is very performant does not mean the smaller model is.

This means that another model that is the same size as the new quantized model may outperform the quantized model.

Suppose there are equal sized big models A and B with their smaller quantized variants a and b. A being a more performant model than B does not guarantee a being more performant than b.

By @mesmertech - 6 months
Demo on actual 4090 with flux schnell for next few hours: https://5jkdpo3rnipsem-3000.proxy.runpod.net/

Its basically H100 speeds with 4090, 4.80it/s. 1.1 sec for flux schenll(4 steps) and 5.5 seconds for flux dev(25 steps). Compared to normal speeds(comfyui fp8 with "--fast" optimization") which is 3 seconds for schnell and 11.5 seconds for dev

By @gyrovagueGeist - 6 months
This problem seems like it would be very similar to the Low-Rank + Sparse decompositions that used to be popular in audio-visual filtering.
By @notarealllama - 6 months
I'm convinced the path to ubiquity (such as embedded in smartphones) is quantization.

I had to int4 a llama model to get it to properly run on my 3060.

I'm curious, how much resolution / significant digits do we actually need for most genAI work? If you can draw a circle with 3.14, maybe it's good enough for fast and ubiquitous usage.

By @xrd - 6 months
Can someone explain this sentence from the article:

  Diffusion models, however, are computationally bound, even for single batches, so quantizing weights alone yields limited gains.
By @atlex2 - 6 months
Seriously nobody thought to use SVD on these weight matrices before?
By @scottmas - 6 months
Possible to run this in ComfyUI?
By @DeathArrow - 6 months
But doesn't quantization give worse results? Don't you trade quality for memory footprint?