January 27th, 2025

Run DeepSeek R1 Dynamic 1.58-bit

DeepSeek-R1 is an open-source alternative to OpenAI's O1, reduced from 720GB to 131GB via quantization. It runs on various systems, with performance benchmarks indicating valid outputs and minor errors.

Read original articleLink Icon
Run DeepSeek R1 Dynamic 1.58-bit

DeepSeek-R1 has emerged as a competitive open-source alternative to OpenAI's O1 reasoning model, achieving significant size reduction through quantization techniques. The model, originally 720GB, has been compressed to 131GB while maintaining functionality. This was accomplished by selectively quantizing certain layers to higher bit rates, allowing for efficient performance without compromising output quality. The 1.58-bit version can operate with 160GB of VRAM for fast inference, or with 20GB of RAM on a CPU, albeit at slower speeds. Various dynamic quantized versions have been released, with performance benchmarks indicating that the 1.58-bit model produces valid outputs, although some incorrect tokens may occur. The architecture of DeepSeek R1 utilizes a mixture of experts (MoE) approach, which allows for increased parameters without a corresponding increase in computational cost. The model's performance was evaluated through a Flappy Bird game generation task, scoring high on various criteria. The dynamic quantization code has been made available on GitHub, and users can run the model on various systems, including those without GPUs. The blog post provides detailed instructions for downloading and running the model, emphasizing the importance of proper hardware configuration for optimal performance.

- DeepSeek-R1 is an open-source model rivaling OpenAI's O1.

- The model size was reduced from 720GB to 131GB through selective quantization.

- The 1.58-bit version can run on 160GB VRAM or 20GB RAM, with varying performance.

- Performance benchmarks show the model generates valid outputs, with some minor errors.

- Dynamic quantization code is available on GitHub for user implementation.

Link Icon 4 comments
By @danielhanchen - 3 months
Oh thanks for sharing this! The fork of llama.cpp for how to do the dynamic quant is here: https://github.com/unslothai/llama.cpp. I also found min_p = 0.05 can help reduce chances of some bad tokens coming up for 1.58bit (I found it to happen around 1/8000 tokens of the time)
By @homarp - 3 months
By @homarp - 3 months
"The 1.58bit quantization should fit in 160GB of VRAM for fast inference"

instruction for llama.cpp: https://huggingface.co/unsloth/DeepSeek-R1-GGUF#instructions...