DeepSeek V3 and the cost of frontier AI models
DeepSeek AI launched its DeepSeek V3 model, outperforming competitors like GPT-4o. It features innovative training techniques but has higher overall development costs than reported, impacting competitive positioning.
Read original articleDeepSeek AI recently launched its DeepSeek V3 model, a mixture of experts (MoE) architecture trained on 14.8 trillion tokens, featuring 671 billion total parameters and 37 billion active parameters. The model has demonstrated superior performance in challenging evaluations, outperforming notable competitors like GPT-4o and Claude 3.5. Despite its capabilities, user experience feedback suggests it lacks the polish of other models like ChatGPT. The technical report accompanying the release details various innovations that contributed to its efficiency, including multi-head latent attention and efficient mixture of expert architectures. DeepSeek's training process utilized significantly fewer GPU hours compared to other models, indicating a high level of resource optimization. However, the reported costs of training do not encompass the full scope of expenses, including prior research and operational overheads. The actual cost of ownership for such models is likely much higher when considering factors like electricity and personnel. The narrative around compute efficiency is crucial, especially for Chinese companies facing export controls, as it positions them competitively against larger firms like Meta. Overall, while DeepSeek V3 showcases impressive advancements, the true costs and resource requirements for developing frontier AI models remain complex and multifaceted.
- DeepSeek V3 outperforms major competitors in challenging evaluations.
- The model's training efficiency is attributed to several innovative techniques.
- Reported training costs do not reflect the total expenses involved in model development.
- The narrative of compute efficiency is vital for competitive positioning in the AI landscape.
- Actual operational costs for AI development are significantly higher than initial training figures suggest.
Related
DeepSeek-V3, ultra-large open-source AI, outperforms Llama and Qwen on launch
Chinese AI startup DeepSeek launched DeepSeek-V3, a 671 billion parameter model outperforming major competitors. It features cost-effective training, innovative architecture, and is available for testing and commercial use.
DeepSeek's new AI model appears to be one of the best 'open' challengers yet
DeepSeek, a Chinese AI firm, launched DeepSeek V3, an open-source model with 671 billion parameters, excelling in text tasks and outperforming competitors, though limited by regulatory constraints.
DeepSeek v3: The Six Million Dollar Model
DeepSeek v3 is an affordable AI model with 37 billion active parameters, showing competitive benchmarks but underperforming in output diversity and coherence. Its real-world effectiveness remains to be evaluated.
Notes on the New Deepseek v3
Deepseek v3, a leading open-source model with 607 billion parameters, excels in reasoning and math tasks, outperforming competitors while being cost-effective, trained on 14.8 trillion data points for $6 million.
DeepSeek and the Effects of GPU Export Controls
DeepSeek launched its V3 model, trained on 2,048 H800 GPUs for $5.5 million, emphasizing efficiency and innovation due to U.S. export controls, while exploring advancements beyond transformer architectures.
Doesn't take away from the outcome, but the cost numbers may not be 100%
I went through the paper and I understood they made these improvements compared to "regular" MoE models:
1. Latent Multi-head Attention. If I understand correctly, they were able to do some caching on the attention computation. This one is still a little bit confusing to me;
2. new MoE architecture with one shared expert and a large number of small routed experts (256 total, but 8 in use in the same token inference). This was already used in DeepSeek v2;
3. Better load balancing of the training of experts. During training, they add bias or "bonus" value to experts that are less used to make them more likely to be selected in the future training steps;
4. They added a few smaller transformer layers to predict not only the first next token, but a few additional tokens. Their training error/loss function then uses all these predicted tokens, not only the first one. This is supposed to improve the transformer capabilities of predicting sequences of tokens. Note that they don't use this for inference, except for some latency optimisation by doing speculative execution on the 2nd token.
5. They are using FP8 instead of FP16 when it does not impact accuracy.
My guess would be that 4) is the most impactful improvement. 1), 2), 3) and 5) could explain why their model train faster, but not how is performs greatly better than models with way more activated parameters (e.g. llama 3).
https://x.com/victor207755822/status/1882757279436718454?t=0...
Related
DeepSeek-V3, ultra-large open-source AI, outperforms Llama and Qwen on launch
Chinese AI startup DeepSeek launched DeepSeek-V3, a 671 billion parameter model outperforming major competitors. It features cost-effective training, innovative architecture, and is available for testing and commercial use.
DeepSeek's new AI model appears to be one of the best 'open' challengers yet
DeepSeek, a Chinese AI firm, launched DeepSeek V3, an open-source model with 671 billion parameters, excelling in text tasks and outperforming competitors, though limited by regulatory constraints.
DeepSeek v3: The Six Million Dollar Model
DeepSeek v3 is an affordable AI model with 37 billion active parameters, showing competitive benchmarks but underperforming in output diversity and coherence. Its real-world effectiveness remains to be evaluated.
Notes on the New Deepseek v3
Deepseek v3, a leading open-source model with 607 billion parameters, excels in reasoning and math tasks, outperforming competitors while being cost-effective, trained on 14.8 trillion data points for $6 million.
DeepSeek and the Effects of GPU Export Controls
DeepSeek launched its V3 model, trained on 2,048 H800 GPUs for $5.5 million, emphasizing efficiency and innovation due to U.S. export controls, while exploring advancements beyond transformer architectures.