January 25th, 2025

DeepSeek V3 and the cost of frontier AI models

DeepSeek AI launched its DeepSeek V3 model, outperforming competitors like GPT-4o. It features innovative training techniques but has higher overall development costs than reported, impacting competitive positioning.

Read original articleLink Icon
DeepSeek V3 and the cost of frontier AI models

DeepSeek AI recently launched its DeepSeek V3 model, a mixture of experts (MoE) architecture trained on 14.8 trillion tokens, featuring 671 billion total parameters and 37 billion active parameters. The model has demonstrated superior performance in challenging evaluations, outperforming notable competitors like GPT-4o and Claude 3.5. Despite its capabilities, user experience feedback suggests it lacks the polish of other models like ChatGPT. The technical report accompanying the release details various innovations that contributed to its efficiency, including multi-head latent attention and efficient mixture of expert architectures. DeepSeek's training process utilized significantly fewer GPU hours compared to other models, indicating a high level of resource optimization. However, the reported costs of training do not encompass the full scope of expenses, including prior research and operational overheads. The actual cost of ownership for such models is likely much higher when considering factors like electricity and personnel. The narrative around compute efficiency is crucial, especially for Chinese companies facing export controls, as it positions them competitively against larger firms like Meta. Overall, while DeepSeek V3 showcases impressive advancements, the true costs and resource requirements for developing frontier AI models remain complex and multifaceted.

- DeepSeek V3 outperforms major competitors in challenging evaluations.

- The model's training efficiency is attributed to several innovative techniques.

- Reported training costs do not reflect the total expenses involved in model development.

- The narrative of compute efficiency is vital for competitive positioning in the AI landscape.

- Actual operational costs for AI development are significantly higher than initial training figures suggest.

Link Icon 3 comments
By @Havoc - 30 days
There is also the minor matter of the Alex Wang claim that they have 50k H100s but can't talk about it for obvious reasons...

Doesn't take away from the outcome, but the cost numbers may not be 100%

By @raphaelj - 29 days
Do we know which change(s) made DeepSeek V3 so much more efficient than other models?

I went through the paper and I understood they made these improvements compared to "regular" MoE models:

1. Latent Multi-head Attention. If I understand correctly, they were able to do some caching on the attention computation. This one is still a little bit confusing to me;

2. new MoE architecture with one shared expert and a large number of small routed experts (256 total, but 8 in use in the same token inference). This was already used in DeepSeek v2;

3. Better load balancing of the training of experts. During training, they add bias or "bonus" value to experts that are less used to make them more likely to be selected in the future training steps;

4. They added a few smaller transformer layers to predict not only the first next token, but a few additional tokens. Their training error/loss function then uses all these predicted tokens, not only the first one. This is supposed to improve the transformer capabilities of predicting sequences of tokens. Note that they don't use this for inference, except for some latency optimisation by doing speculative execution on the 2nd token.

5. They are using FP8 instead of FP16 when it does not impact accuracy.

My guess would be that 4) is the most impactful improvement. 1), 2), 3) and 5) could explain why their model train faster, but not how is performs greatly better than models with way more activated parameters (e.g. llama 3).

By @sschueller - 29 days
Deli Chen at DeepSeek said they would open source AGI.

https://x.com/victor207755822/status/1882757279436718454?t=0...