January 2nd, 2025

Notes on the New Deepseek v3

Deepseek v3, a leading open-source model with 607 billion parameters, excels in reasoning and math tasks, outperforming competitors while being cost-effective, trained on 14.8 trillion data points for $6 million.

Read original articleLink Icon
Notes on the New Deepseek v3

Deepseek has launched its latest model, Deepseek v3, which features a 607 billion parameter mixture-of-experts architecture with 37 billion active parameters. This model has been recognized as the best open-source model, outperforming competitors like Llama 3.1 and Mistral, and is comparable to OpenAI's GPT-4o and Claude 3.5 Sonnet in various benchmarks. Deepseek v3 was trained on 14.8 trillion high-quality data points, utilizing 2,788,000 GPU hours at a cost of approximately $6 million, significantly less than its competitors. The model's efficiency is attributed to its innovative engineering, including a mixture-of-experts architecture, FP8 mixed precision training, and a custom training framework. Deepseek v3 excels in reasoning and mathematical tasks, surpassing GPT-4o and Claude 3.5 Sonnet, although it lags slightly in writing and coding tasks. The introduction of a deep thinking feature enhances its reasoning capabilities. Overall, Deepseek v3 is positioned as a cost-effective alternative to high-end models, providing substantial performance at a lower price point.

- Deepseek v3 is the leading open-source model, outperforming major competitors.

- The model was trained efficiently, costing around $6 million.

- It excels in reasoning and math tasks compared to GPT-4o and Claude 3.5 Sonnet.

- A new deep thinking feature improves its reasoning abilities.

- Deepseek v3 offers significant value for AI developers at a lower cost.

Link Icon 9 comments
By @antirez - about 2 months
I'm testing it for system programming brainstorming, code reviews and Python test units writing, and my impression is that it's a Sonnet 3.5 level model for most tasks. I said a few things here: https://www.youtube.com/watch?v=xjCqi9JK440 but in general this is really an open weights frontier model, the first one that we get (IMHO llama 3.1 405B does not fit the definition, and the actual model quality is far from the benchmarks). Also the extreme inference speed due to MoE and other design choices improves the user experience a lot. I also tested asking questions with very large contexts (PDFs, large C files) at play, and it performs very well.

Also don't just focus on this model but check out what DeepSeek mission is, and the CEO words in the recently released interview. They want to be the DJI / Bambulab of AI, basically: leaders and not followers, and after V3 it's hard to say they don't have the right brains to do that.

By @egnehots - about 2 months
If you understand how LLMs work, you should disregard tests such as:

- How many 'r's are in Strawberry?

- Finding the fourth word of the response

These tests are at odds with the tokenizer and next-word prediction model. They do not accurately represent an LLM's capabilities. It's akin to asking a blind person to identify colors.

By @darksaints - about 2 months
I know future GPU development is addressing the constrained ram problem, but it is nonetheless a massive problem for local inference. MoE seems to solve a compute problem, at the expense of compounding the ram problem. So I have a question... My understanding is that the typical MoE model starts each output token with a decision as to which expert model(s) to send inference tasks to. How often is it that the vast majority of predictions end up being sent to the same expert(s)? Wouldn't it be a more practical from both a training and inference perspective to do the same mixture of experts model, but choose experts on a much higher level of granularity? Like maybe on the level of the whole response, or clause, or sentence? At least then you could load an expert into ram and expect to use it without having to do massive IO loading/unloading constantly.
By @doctorpangloss - about 2 months
From the article:

> They probably trained the model on a synthetic dataset generated by GPT-4o.

This seems to be the case. I can speculate further. They trained on copyrighted material that OpenAI did not.

By @maeil - about 2 months
A lot of talk about how much cheaper it is than all other models.

It remains to be seen what the pricing will be when run by non-Deepseek providers. They might be loss leading.

The comparison for cheap models should also be Gemini 2.0 Flash Exp. I could see it being even cheaper when it stops being free - if it does at all. There's definitely a scenario where Google just keeps it freeish for a long time with relatively high limits.

By @ReaLNero - about 2 months
> Source: Perplexity

AI slop, I don't trust any of this article, especially the bullets on what made Deepseek "win"

By @musha68k - about 2 months
Open weights are nice, but they're just the end product of a black box process (training data, alignment methods, filtering choices, etc).

Like with all of these models, we don't know what's in them.

By @jaggs - about 2 months
Does it have function calling and vision?