January 31st, 2025

DeepSeek-R1 at 3,872 tokens / second on a single Nvidia HGX H200

NVIDIA's DeepSeek-R1 model, with 671 billion parameters, enhances reasoning through test-time scaling, achieving 3,872 tokens per second. It offers easy deployment via the NIM microservice and promises future performance improvements.

Read original articleLink Icon
DeepSeek-R1 at 3,872 tokens / second on a single Nvidia HGX H200

DeepSeek-R1, a new open model from NVIDIA, is designed for advanced reasoning capabilities, utilizing a method called test-time scaling to enhance its inference processes. This model performs multiple inference passes over queries, employing techniques such as chain-of-thought and consensus to derive optimal answers. With 671 billion parameters, DeepSeek-R1 significantly outperforms many existing models, offering high accuracy in logical reasoning, math, coding, and language understanding. It can process up to 3,872 tokens per second on a single NVIDIA HGX H200 system, making it suitable for real-time applications. The model's architecture includes a large mixture-of-experts (MoE) design, with each layer featuring 256 experts, allowing for parallel evaluation of tokens. The NVIDIA NIM microservice facilitates easy deployment and customization of DeepSeek-R1, ensuring security and data privacy for enterprises. Future enhancements, including the upcoming NVIDIA Blackwell architecture, promise to further improve the model's performance, particularly in real-time inference scenarios. Developers can access the DeepSeek-R1 NIM microservice on NVIDIA's platform to experiment and build specialized AI agents.

- DeepSeek-R1 features 671 billion parameters, enhancing its reasoning capabilities.

- The model utilizes test-time scaling for improved inference quality.

- It can deliver up to 3,872 tokens per second on NVIDIA's hardware.

- The NVIDIA NIM microservice simplifies deployment and customization for enterprises.

- Future NVIDIA architectures are expected to boost performance for real-time applications.

Link Icon 2 comments
By @billconan - 3 months
https://news.ycombinator.com/item?id=42879864

this is cerebras' 70B number, 1600 tokens / sec, not sure about the costs.