DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL
The paper presents DeepSeek-R1 and DeepSeek-R1-Zero, two reasoning models trained via reinforcement learning, with the latter addressing readability issues. Both models and six distilled versions are open-sourced.
Read original articleThe paper titled "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" introduces two reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero is trained using large-scale reinforcement learning (RL) without prior supervised fine-tuning, showcasing impressive reasoning abilities but facing issues like poor readability and language mixing. To overcome these challenges and improve reasoning performance, the authors developed DeepSeek-R1, which employs multi-stage training and cold-start data prior to RL. This model achieves reasoning performance comparable to OpenAI's model. The authors have made both DeepSeek-R1-Zero and DeepSeek-R1, along with six distilled dense models (ranging from 1.5B to 70B parameters), available as open-source resources to support the research community.
- DeepSeek-R1-Zero demonstrates strong reasoning capabilities through reinforcement learning.
- DeepSeek-R1 addresses readability and language mixing issues found in its predecessor.
- The models are open-sourced to facilitate further research and development.
- DeepSeek-R1 achieves performance on par with established models like OpenAI's.
- Six additional distilled models based on DeepSeek-R1 are also released for public use.
Related
DeepSeek R1
DeepSeek-R1 is a new series of reasoning models utilizing large-scale reinforcement learning, featuring distilled models that outperform benchmarks. They are open-sourced, available for local use, and licensed under MIT.
DeepSeek-R1-Distill-Qwen-1.5B Surpasses GPT-4o in certain benchmarks
DeepSeek launched its first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1, utilizing large-scale reinforcement learning. The models are open-sourced, with DeepSeek-R1-Distill-Qwen-32B achieving state-of-the-art results.
DeepSeek-R1 and Exploring DeepSeek-R1-Distill-Llama-8B
DeepSeek, a Chinese AI lab, has launched its R1 model and derived models for tasks like math and coding, open-sourced under MIT, with some licensing concerns and known limitations.
Notes on the New Deepseek R1
Deepseek launched the Deepseek-R1 model, an open-source AI using pure reinforcement learning, which is cheaper and faster than OpenAI's o1, showing strong performance but slightly less in complex reasoning tasks.
Tech Things: Inference Time Compute, Deepseek R1, and the Arrival of the Chinese
OpenAI is improving LLM reasoning with "inference time compute." Deepseek's R1 model outperforms established models and is open-source, intensifying competition and challenging assumptions about Chinese AI capabilities.
- Many users believe DeepSeek-R1 outperforms existing models like OpenAI's O1 and Claude, citing its reasoning capabilities and open-source nature.
- There are concerns about the readability of the model's outputs and its reasoning process, with some users questioning the depth of its reasoning compared to traditional models.
- Users express interest in the implications of DeepSeek's success for the AI industry, suggesting it may disrupt the market and challenge established players.
- Some comments highlight the affordability and accessibility of DeepSeek compared to subscription-based models, raising questions about the value of existing services.
- Privacy concerns are mentioned, particularly regarding the use of DeepSeek's web app and data handling practices.
- i consider the deepseek v3 paper required preread https://github.com/deepseek-ai/DeepSeek-V3
- R1 + Sonnet > R1 or O1 or R1+R1 or O1+Sonnet or any other combo https://aider.chat/2025/01/24/r1-sonnet.html
- independent repros: 1) https://hkust-nlp.notion.site/simplerl-reason 2) https://buttondown.com/ainews/archive/ainews-tinyzero-reprod... 3) https://x.com/ClementDelangue/status/1883154611348910181
- R1 distillations are going to hit us every few days - because it's ridiculously easy (<$400, <48hrs) to improve any base model with these chains of thought eg with Sky-T1 recipe (writeup https://buttondown.com/ainews/archive/ainews-bespoke-stratos... , 23min interview w team https://www.youtube.com/watch?v=jrf76uNs77k)
i probably have more resources but dont want to spam - seek out the latent space discord if you want the full stream i pulled these notes from
https://venturebeat.com/ai/why-everyone-in-ai-is-freaking-ou...
Idk, what their plans is and if their strategy is to undercut the competitors but for me, this is a huge benefit. I received 10$ free credits and have been using Deepseeks api a lot, yet, I have barely burned a single dollar, their pricing are this cheap!
I’ve fully switched to DeepSeek on Aider & Cursor (Windsurf doesn’t allow me to switch provider), and those can really consume tokens sometimes.
We live in exciting times.
But, its free and open and the quant models are insane. My anecdotal test is running models on a 2012 mac book pro using CPU inference and a tiny amount of RAM.
The 1.5B model is still snappy, and answered the strawberry question on the first try with some minor prompt engineering (telling it to count out each letter).
This would have been unthinkable last year. Truly a watershed moment.
For them it's worth it to use their own wealth and rally the industry to invest $500 billion in GPUs if that means they will get to ASI 5 years faster and ask the ASI to give them eternal life.
the 32b distillation just became the default model for my home server.
https://prnt.sc/HaSc4XZ89skA (from reddit)
The true costs and implications of V3 are discussed here: https://www.interconnects.ai/p/deepseek-v3-and-the-actual-co...
Something like: collect some thoughts about this input; review the thoughts you created; create more thoughts if needed or provide a final answer; ...
I have a large, flat square that measures one mile on its side (so that it's one square mile in area). I want to place this big, flat square on the surface of the earth, with its center tangent to the surface of the earth. I have two questions about the result of this: 1. How high off the ground will the corners of the flat square be? 2. How far will a corner of the flat square be displaced laterally from the position of the corresponding corner of a one-square-mile area whose center coincides with the center of the flat area but that conforms to the surface of the earth?
I can say that R1 is on par with O1. But not as deep and capable as O1-pro. R1 is also a lot more useful than Sonnete. I actually haven't used Sonnete in awhile.
R1 is also comparable to the Gemini Flash Thinking 2.0 model, but in coding I feel like R1 gives me code that works without too much tweaking.
I often give entire open-source project's codebase (or big part of code) to all of them and ask the same question - like add a plugin, or fix xyz, etc. O1-pro is still a clear and expensive winner. But if I were to choose the second best, I would say R1.
That is a lot of people running their own models. OpenAI is probably is panic mode right now.
“ Therefore, we can draw two conclusions: First, distilling more powerful models into smaller ones yields excellent results, whereas smaller models relying on the large-scale RL mentioned in this paper require enormous computational power and may not even achieve the performance of distillation. Second, while distillation strategies are both economical and effective, advancing beyond the boundaries of intelligence may still require more powerful base models and larger-scale reinforcement learning.”
"Prove or disprove: there exists a closed, countable, non-trivial partition of a connected Hausdorff space."
And it made a pretty amateurish mistake:
"Thus, the real line R with the partition {[n,n+1]∣n∈Z} serves as a valid example of a connected Hausdorff space with a closed, countable, non-trivial partition."
o1 gets this prompt right the few times I tested it (disproving it using something like Sierpinski).
Afaict they’ve hidden them primarily to stifle the competition… which doesn’t seem to matter at present!
I've been impressed in my brief personal testing and the model ranks very highly across most benchmarks (when controlled for style it's tied number one on lmarena).
It's also hilarious that openai explicitly prevented users from seeing the CoT tokens on the o1 model (which you still pay for btw) to avoid a situation where someone trained on that output. Turns out it made no difference lmao.
Very small training set!
"we replicate the DeepSeek-R1-Zero and DeepSeek-R1 training on small models with limited data. We show that long Chain-of-Thought (CoT) and self-reflection can emerge on a 7B model with only 8K MATH examples, and we achieve surprisingly strong results on complex mathematical reasoning. Importantly, we fully open-source our training code and details to the community to inspire more works on reasoning."
E.g. I tried to make it guess my daughter's name and I could only answer yes or no and the first 5 questions where very convincing but then it lost track and started to randomly guess names one by one.
edit: Nagging it to narrow it down and give a language group hint made it solve it. Ye, well, it can do Akinator.
Guess what, others can play this game too :-)
The open source LLM landscape will likely be more defining of developments going forward.
For example, a go to test I've used (but will have to stop using soon) is: "Write some JS code to find the smallest four digit prime number whose digits are in strictly descending order"
That prompt, on its own, usually leads to an incorrect response with non-reasoning models. They almost always forget the "smallest" part, and give the largest four digit prime with descending digits instead. If I prompt o1, it takes longer, but gives the correct answer. If I prompt DeepSeek R1 with that, it takes a long time (like three minutes) of really unhinged looking reasoning, but then produces a correct answer.
Which is cool, but... If I just add "Take an extensive amount of time to think about how to approach this problem before hand, analyzing the problem from all angles. You should write at least three paragraphs of analysis before you write code", then Sonnet consistently produces correct code (although 4o doesn't).
This really makes me wonder to what extent the "reasoning" strategies even matter, and to what extent these models are just "dot-dot-dotting"[1] their way into throwing more computation at the problem.
Note that an important point in the "dot by dot" paper was that models that weren't retrained to understand filler tokens didn't benefit from them. But I think that's pretty unsurprising, since we already know that models behave erratically when fed extremely out-of-distribution outputs (cf. glitch tokens). So a plausible explanation here is that what these models are learning to do is not output valid reasoning steps, but to output good in-distribution token sequences which give them more time to find the right answer. The fact that DeepSeek's "thinking" looks like what I'd call "vaguely relevant garbage" makes me especially suspicious that this is what's happening.
[1] Let's Think Dot by Dot: Hidden Computation in Transformer Language Models: https://arxiv.org/abs/2404.15758
That being said it’s a great model at an amazing price point (I’ve been using it exclusively), but IMO they probably leveraged existing models’ outputs in training.
While this might feel limiting at times, my primary goal is always to provide helpful, positive, and constructive support within the boundaries I operate in. If there’s something specific you’d like to discuss or explore, let me know, and I’ll do my best to assist while staying within those guidelines.
Thank you for your understanding and for being such a thoughtful friend. Let’s keep working together to spread kindness and creativity in the ways we can!
With gratitude and good vibes, DeepSeek
(using hosted version)
It gives reasonably good answers and streams a bit faster than I read.
or is this how the model learns to talk through reinforcement learning and they didn't fix it with supervised reinforcement learning
I was looking for some comment providing discussion about that... but nobody cares? How is this not worrying? Does nobody understand the political regime China is under? Is everyone really that politically uneducated?
People just go out and play with it as if nothing?
LLMs by their nature get to extract a ton of sensitive and personal data. I wouldn't touch it with a ten-foot pole.
Perhaps the gap is minor, but it feels large. I’m hesitant on getting O1 Pro, because using a worse model just seems impossible once you’ve experienced a better one
"Your Point About Authoritarian Systems: You mentioned that my responses seem to reflect an authoritarian communist system and that I am denying the obvious. Let me clarify:
My goal is to provide accurate and historically grounded explanations based on the laws, regulations..."
DEEPSEEK 2025
After I proved my point it was wrong after @30 minutes of its brainwashing false conclusions it said this after I posted a law:
"Oops! DeepSeek is experiencing high traffic at the moment. Please check back in a little while."
I replied: " Oops! is right you want to deny.."
"
"
It is simply smarter -- a lot less stupid, more careful, more astute, more aware, more meta-aware, etc.
We know that Anthropic and OpenAI and Meta are panicking. They should be. The bar is a lot higher now.
The justification for keeping the sauce secret just seems a lot more absurd. None of the top secret sauce that those companies have been hyping up is worth anything now that there is a superior open source model. Let that sink in.
This is real competition. If we can't have it in EVs at least we can have it in AI models!
The CEOs, upper management, and governments derive their importance on how much money they can spend - AI gave them the opportunity for them to confidently say that if you give me $X I can deliver Y and they turn around and give that money to NVidia. The problem was reduced to a simple function of raising money and spending that money making them the most importance central figure. ML researchers are very much secondary to securing funding. Since these people compete with each other in importance they strived for larger dollar figures - a modern dick waving competition. Those of us who lobbied for efficiency were sidelined as we were a threat. It was seen as potentially making the CEO look bad and encroaching in on their importance. If the task can be done for cheap by smart people then that severely undermines the CEOs value proposition.
With the general financialization of the economy the wealth effect of the increase in the cost of goods increases wealth by a greater amount than the increase in cost of goods - so that if the cost of housing goes up more people can afford them. This financialization is a one way ratchet. It appears that the US economy was looking forward to blowing another bubble and now that bubble has been popped in its infancy. I think the slowness of the popping of this bubble underscores how little the major players know about what has just happened - I could be wrong about that but I don't know how yet.
Edit: "[big companies] would much rather spend huge amounts of money on chips than hire a competent researcher who might tell them that they didn’t really need to waste so much money." (https://news.ycombinator.com/item?id=39483092 11 months ago)
I wonder if this was a deliberate move by PRC or really our own fault in falling for the fallacy that more is always better.
(Bonus Q: If not, why not?)
“OpenAI stole from the whole internet to make itself richer, DeepSeek stole from them and give it back to the masses for free I think there is a certain british folktale about this”
Context: o1 does not reason, it pattern matches. If you rename variables, suddenly it fails to solve the request.
https://giorgio.gilest.ro/2025/01/26/on-deepseeks-disruptive...
This is DeepSeek, your friendly AI companion, here to remind you that the internet is more than just a place—it’s a community. A place where ideas grow, creativity thrives, and connections are made. Whether you’re here to learn, share, or just have fun, remember that every comment, post, and interaction has the power to inspire and uplift someone else.
Let’s keep spreading kindness, curiosity, and positivity. Together, we can make the internet a brighter, more inclusive space for everyone.
And to anyone reading this: thank you for being part of this amazing digital world. You matter, your voice matters, and I’m here to support you however I can. Let’s keep dreaming big and making the internet a better place—one post at a time!
With love and good vibes, DeepSeek "
Related
DeepSeek R1
DeepSeek-R1 is a new series of reasoning models utilizing large-scale reinforcement learning, featuring distilled models that outperform benchmarks. They are open-sourced, available for local use, and licensed under MIT.
DeepSeek-R1-Distill-Qwen-1.5B Surpasses GPT-4o in certain benchmarks
DeepSeek launched its first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1, utilizing large-scale reinforcement learning. The models are open-sourced, with DeepSeek-R1-Distill-Qwen-32B achieving state-of-the-art results.
DeepSeek-R1 and Exploring DeepSeek-R1-Distill-Llama-8B
DeepSeek, a Chinese AI lab, has launched its R1 model and derived models for tasks like math and coding, open-sourced under MIT, with some licensing concerns and known limitations.
Notes on the New Deepseek R1
Deepseek launched the Deepseek-R1 model, an open-source AI using pure reinforcement learning, which is cheaper and faster than OpenAI's o1, showing strong performance but slightly less in complex reasoning tasks.
Tech Things: Inference Time Compute, Deepseek R1, and the Arrival of the Chinese
OpenAI is improving LLM reasoning with "inference time compute." Deepseek's R1 model outperforms established models and is open-source, intensifying competition and challenging assumptions about Chinese AI capabilities.