January 27th, 2025

The Illustrated DeepSeek-R1

DeepSeek-R1 is a new language model emphasizing reasoning, utilizing a three-step training process and a unique architecture. It faces challenges in readability and language mixing while enhancing reasoning capabilities.

Read original articleLink Icon
CuriositySkepticismAppreciation
The Illustrated DeepSeek-R1

DeepSeek-R1 is a newly released language model that emphasizes reasoning capabilities, making it a significant advancement in AI development. It is an open weights model that includes smaller, distilled versions and utilizes a unique training method to enhance reasoning similar to OpenAI's O1. The training process involves three main steps: first, a language modeling step that predicts the next word using extensive web data; second, a supervised fine-tuning (SFT) step to improve instruction-following and question-answering abilities; and third, a preference tuning step to align the model's behavior with human preferences. Notably, DeepSeek-R1 incorporates long chains of reasoning data, generated by an interim reasoning model, which excels at reasoning tasks but is less effective at non-reasoning tasks. The model also employs large-scale reinforcement learning (RL) to create reasoning models, allowing it to generate a substantial amount of reasoning data efficiently. Despite its strengths, the interim model faces challenges such as readability and language mixing. DeepSeek-R1 aims to address these issues while maintaining strong reasoning capabilities across various tasks. The architecture consists of 61 transformer decoder blocks, with a mix of dense and mixture-of-experts layers, enhancing its performance.

- DeepSeek-R1 focuses on enhancing reasoning capabilities in language models.

- The training process includes language modeling, supervised fine-tuning, and preference tuning.

- It utilizes long chains of reasoning data and large-scale reinforcement learning for model training.

- The model architecture features a combination of dense and mixture-of-experts layers.

- Challenges such as readability and language mixing remain areas for improvement.

AI: What people are saying
The comments on DeepSeek-R1 reflect a mix of skepticism and curiosity about its capabilities and development.
  • Some users express doubt about the model's reasoning abilities and its overall performance compared to previous models.
  • There is interest in the technical improvements made in DeepSeek-R1, particularly regarding its training process and architecture.
  • Several commenters highlight the significance of the reasoning examples used in training, noting their complexity and cost.
  • Concerns are raised about the model's perceived limitations and the hype surrounding it.
  • Users discuss the implications of synthetic data in AI development and the challenges that remain in non-reasoning domains.
Link Icon 14 comments
By @jasonjmcghee - 26 days
For the uninitiated, this is the same author as the many other "The Illustrated..." blog posts.

A particularly popular one: https://jalammar.github.io/illustrated-transformer/

Always very high quality.

By @raphaelj - 26 days
Do we know which changes made DeepSeek V3 so much faster and better at training than other models? DeepSeek R1's performances seem to be highly related to V3 being a very good model to start with.

I went through the paper and I understood they made these improvements compared to "regular" MoE models:

1. Latent Multi-head Attention. If I understand correctly, they were able to do some caching on the attention computation. This one is still a little bit confusing to me;

2. New MoE architecture with one shared expert and a large number of small routed experts (256 total, but 8 in use in the same token inference). This was already used in DeepSeek v2;

3. Better load balancing of the training of experts. During training, they add bias or "bonus" value to experts that are less used, to make them more likely to be selected in the future training steps;

4. They added a few smaller transformer layers to predict not only the first next token, but a few additional tokens. Their training error/loss function then uses all these predicted tokens as input, not only the first one. This is supposed to improve the transformer capabilities in predicting sequences of tokens;

5. They are using FP8 instead of FP16 when it does not impact accuracy.

It's not clear to me which changes are the most important, but my guess would be that 4) is a critical improvement.

1), 2), 3) and 5) could explain why their model trains faster by some small factor (+/- 2x), but neither the 10x advertised boost nor how is performs greatly better than models with way more activated parameters (e.g. llama 3).

By @QuadrupleA - 26 days
Am I the only one not that impressed with Deepseek R1? Its "thinking" seems full of the usual LLM blindsides, and ultimately generating more of it then summarizing doesn't seem to overcome any real limits.

It's like making mortgage backed securities out of bad mortgages, you never really overcome the badness of the underlying loans, no matter how many layers you pile on top

I haven't used or studied DeepSeek R1 (or o1) in exhaustive depth, but I guess I'm just not understanding the level of breathless hype right now.

By @8n4vidtmkvmk - 26 days
> This is a large number of long chain-of-thought reasoning examples (600,000 of them). These are very hard to come by and very expensive to label with humans at this scale. Which is why the process to create them is the second special thing to highlight

I didn't know the reasonings were part of the training data. I thought we basically just told the LLM to "explain its thinking" or something as an intermediate step, but the fact that the 'thinking' is part of the training step makes more sense and I can see how this improves things in a non-trivial way.

Still not sure if using word tokens as the intermediate "thinking" is the correct or optimal way of doing things, but I don't know. Maybe after everything is compressed into latent space it's essentially the same stuff.

By @blackeyeblitzar - 26 days
The thing I still don’t understand is how DeepSeek built the base model cheaply, and why their models seem to think they are GPT4 when asked. This article says the base model is from their previous paper, but that paper also doesn’t make clear what they trained on. The earlier paper is mostly a description of optimization techniques they applied. It does mention pretraining on 14.8T tokens with 2.7M H800 GPU hours to produce the base DeepSeek-V3. But what were those tokens? The paper describes the corpus only in vague ways.
By @alecco - 26 days
How is this very high signal vs noise post out of the front page in 2hs?

Are people so upset with the stock market crash that they are flagging it?

By @whoistraitor - 26 days
It’s remarkable we’ve hit a threshold where so much can be done with synthetic data. The reasoning race seems an utterly solvable problem now (thanks mostly to the verifiability of results). I guess the challenge then becomes non-reasoning domains, where qualitative and truly creative results are desired.
By @ForOldHack - 24 days
"DeepSeek-R1 is the latest resounding beat in the steady drumroll of AI progress. " IBM's Intellect, from 1983 cost $47,000 dollars a month. Let me know when DeepSleep-Rx exceeds Windows (tm) version numbers or makes a jump like AutoCADs version numbers.
By @distantsounds - 25 days
We all knew The Chinese government was going to censor it. The censoring happening in ChatGPT is arguably more interesting since they are not beholden to the US government. I'm more interested in that report.
By @caithrin - 26 days
This is fantastic work, thank you!
By @youssefabdelm - 26 days
The "illustrated"... He needs to read up on Tufte or Bret Victor or something, these are just diagrams with text inside of boxes.