January 20th, 2025

DeepSeek-R1-Distill-Qwen-1.5B Surpasses GPT-4o in certain benchmarks

DeepSeek launched its first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1, utilizing large-scale reinforcement learning. The models are open-sourced, with DeepSeek-R1-Distill-Qwen-32B achieving state-of-the-art results.

Read original articleLink Icon
DeepSeek-R1-Distill-Qwen-1.5B Surpasses GPT-4o in certain benchmarks

DeepSeek has introduced its first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1, which utilize large-scale reinforcement learning (RL) without prior supervised fine-tuning (SFT). DeepSeek-R1-Zero has shown impressive reasoning capabilities but faced issues like repetition and readability. To improve upon this, DeepSeek-R1 incorporates cold-start data before RL, achieving performance on par with OpenAI's models across various tasks. The models have been open-sourced, including several distilled versions based on Llama and Qwen architectures, with DeepSeek-R1-Distill-Qwen-32B setting new benchmarks. The development pipeline for DeepSeek-R1 includes two RL stages to enhance reasoning patterns and two SFT stages for foundational capabilities. The research demonstrates that larger model reasoning can be distilled into smaller models, which perform better than small models trained solely through RL. The evaluation results indicate that the distilled models excel in various benchmarks, and the open-source nature of these models aims to benefit the research community. Users can access the models via the DeepSeek platform and run them locally with specific configurations to avoid common issues like repetition.

- DeepSeek-R1 models utilize reinforcement learning without prior supervised fine-tuning.

- The models have been open-sourced, including several distilled versions.

- DeepSeek-R1-Distill-Qwen-32B has achieved new state-of-the-art results.

- The development pipeline includes stages for improving reasoning patterns and aligning with human preferences.

- Distilled models demonstrate superior performance compared to smaller models trained through RL alone.

Link Icon 3 comments
By @Fergusonb - about 1 month
These benchmarks have even the small models absolutely demolishing Sonnet-3.5, which doesn't reflect my subjective experience.

It still seems to me that these models are 'dumb' and often don't understand what I'm asking, where claude's intuition is much stronger.

I feel r1 14b even feels weaker than qwen 2.5 14b

Primary use-case is web technology / coding. Maybe I'm prompting it incorrectly?

By @buyucu - about 1 month
OpenAI was caught gaming benchmarks recently with FrontierMath. Just (yet another) sign that benchmarks are very flawed and everyone is training on them.

So I would not put too much weight on how the models are doing on benchmarks.

By @amelius - about 1 month
Where can we read some genuine non-cherrypicked conversations with this model?