December 26th, 2024

DeepSeek v3 beats Claude sonnet 3.5 and way cheaper

DeepSeek-V3 is a 671 billion parameter language model that excels in benchmarks, particularly math and coding tasks, utilizing advanced training strategies and supporting various hardware for local deployment.

Read original articleLink Icon
DeepSeek v3 beats Claude sonnet 3.5 and way cheaper

DeepSeek-V3 is a state-of-the-art Mixture-of-Experts (MoE) language model featuring 671 billion parameters, with 37 billion activated for each token. It employs innovative architectures such as Multi-head Latent Attention (MLA) and DeepSeekMoE, validated in its predecessor, DeepSeek-V2. The model introduces an auxiliary-loss-free strategy for load balancing and a multi-token prediction training objective, enhancing performance. Pre-training was conducted on 14.8 trillion tokens, followed by supervised fine-tuning and reinforcement learning, resulting in a model that outperforms other open-source models and rivals leading closed-source models. The training process was efficient, requiring only 2.788 million GPU hours, and was stable without significant loss spikes. DeepSeek-V3 excels in various benchmarks, particularly in math and coding tasks, and supports context lengths up to 128K. It is available for local deployment through various platforms and hardware configurations, including support for AMD GPUs and Huawei Ascend NPUs. The model's performance is further enhanced through knowledge distillation from previous models, improving reasoning capabilities while maintaining control over output style and length.

- DeepSeek-V3 features 671 billion parameters, with 37 billion activated per token.

- It employs advanced training strategies for efficient performance and stability.

- The model outperforms many existing open-source and closed-source models in benchmarks.

- It supports extensive context lengths and is deployable on various hardware.

- Knowledge distillation enhances its reasoning capabilities while controlling output characteristics.

Link Icon 4 comments
By @patrickhogan1 - about 2 months
It does not beat Claude Sonnet 3.5 on SWE Bench (42 to Claude's 50). It chooses 4 benchmarks of the 100s of available benchmarks and then decides it "beats" Claude Sonnet 3.5.
By @Jet_Xu - about 2 months
Please refer to my recent AI Code review performance test include DeepSeek V3: https://news.ycombinator.com/item?id=42547196
By @sam_goody - about 2 months
What are the minimum and recommended amounts of RAM, hard disk space, CPU or GPU to run this locally.

As someone who just follows this stuff from afar, it is hard for me to conceptualize if this is a SaaS only model, or if it means we are getting to the point where you can have a A1 model on a local machine.