The Illustrated DeepSeek-R1
DeepSeek-R1 is a new language model emphasizing reasoning, utilizing a three-step training process and a unique architecture. It faces challenges in readability and language mixing while enhancing reasoning capabilities.
Read original articleDeepSeek-R1 is a newly released language model that emphasizes reasoning capabilities, making it a significant advancement in AI development. It is an open weights model that includes smaller, distilled versions and utilizes a unique training method to enhance reasoning similar to OpenAI's O1. The training process involves three main steps: first, a language modeling step that predicts the next word using extensive web data; second, a supervised fine-tuning (SFT) step to improve instruction-following and question-answering abilities; and third, a preference tuning step to align the model's behavior with human preferences. Notably, DeepSeek-R1 incorporates long chains of reasoning data, generated by an interim reasoning model, which excels at reasoning tasks but is less effective at non-reasoning tasks. The model also employs large-scale reinforcement learning (RL) to create reasoning models, allowing it to generate a substantial amount of reasoning data efficiently. Despite its strengths, the interim model faces challenges such as readability and language mixing. DeepSeek-R1 aims to address these issues while maintaining strong reasoning capabilities across various tasks. The architecture consists of 61 transformer decoder blocks, with a mix of dense and mixture-of-experts layers, enhancing its performance.
- DeepSeek-R1 focuses on enhancing reasoning capabilities in language models.
- The training process includes language modeling, supervised fine-tuning, and preference tuning.
- It utilizes long chains of reasoning data and large-scale reinforcement learning for model training.
- The model architecture features a combination of dense and mixture-of-experts layers.
- Challenges such as readability and language mixing remain areas for improvement.
Related
DeepSeek R1
DeepSeek-R1 is a new series of reasoning models utilizing large-scale reinforcement learning, featuring distilled models that outperform benchmarks. They are open-sourced, available for local use, and licensed under MIT.
DeepSeek-R1-Distill-Qwen-1.5B Surpasses GPT-4o in certain benchmarks
DeepSeek launched its first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1, utilizing large-scale reinforcement learning. The models are open-sourced, with DeepSeek-R1-Distill-Qwen-32B achieving state-of-the-art results.
Notes on the New Deepseek R1
Deepseek launched the Deepseek-R1 model, an open-source AI using pure reinforcement learning, which is cheaper and faster than OpenAI's o1, showing strong performance but slightly less in complex reasoning tasks.
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL
The paper presents DeepSeek-R1 and DeepSeek-R1-Zero, two reasoning models trained via reinforcement learning, with the latter addressing readability issues. Both models and six distilled versions are open-sourced.
How DeepSeek-R1 Was Built, for Dummies
DeepSeek launched DeepSeek-R1, a reasoning model trained with pure reinforcement learning, achieving performance comparable to OpenAI's o1. It features a cost-effective API and highlights open-source potential in AI.
- Some users express doubt about the model's reasoning abilities and its overall performance compared to previous models.
- There is interest in the technical improvements made in DeepSeek-R1, particularly regarding its training process and architecture.
- Several commenters highlight the significance of the reasoning examples used in training, noting their complexity and cost.
- Concerns are raised about the model's perceived limitations and the hype surrounding it.
- Users discuss the implications of synthetic data in AI development and the challenges that remain in non-reasoning domains.
A particularly popular one: https://jalammar.github.io/illustrated-transformer/
Always very high quality.
I went through the paper and I understood they made these improvements compared to "regular" MoE models:
1. Latent Multi-head Attention. If I understand correctly, they were able to do some caching on the attention computation. This one is still a little bit confusing to me;
2. New MoE architecture with one shared expert and a large number of small routed experts (256 total, but 8 in use in the same token inference). This was already used in DeepSeek v2;
3. Better load balancing of the training of experts. During training, they add bias or "bonus" value to experts that are less used, to make them more likely to be selected in the future training steps;
4. They added a few smaller transformer layers to predict not only the first next token, but a few additional tokens. Their training error/loss function then uses all these predicted tokens as input, not only the first one. This is supposed to improve the transformer capabilities in predicting sequences of tokens;
5. They are using FP8 instead of FP16 when it does not impact accuracy.
It's not clear to me which changes are the most important, but my guess would be that 4) is a critical improvement.
1), 2), 3) and 5) could explain why their model trains faster by some small factor (+/- 2x), but neither the 10x advertised boost nor how is performs greatly better than models with way more activated parameters (e.g. llama 3).
It's like making mortgage backed securities out of bad mortgages, you never really overcome the badness of the underlying loans, no matter how many layers you pile on top
I haven't used or studied DeepSeek R1 (or o1) in exhaustive depth, but I guess I'm just not understanding the level of breathless hype right now.
I didn't know the reasonings were part of the training data. I thought we basically just told the LLM to "explain its thinking" or something as an intermediate step, but the fact that the 'thinking' is part of the training step makes more sense and I can see how this improves things in a non-trivial way.
Still not sure if using word tokens as the intermediate "thinking" is the correct or optimal way of doing things, but I don't know. Maybe after everything is compressed into latent space it's essentially the same stuff.
Are people so upset with the stock market crash that they are flagging it?
Related
DeepSeek R1
DeepSeek-R1 is a new series of reasoning models utilizing large-scale reinforcement learning, featuring distilled models that outperform benchmarks. They are open-sourced, available for local use, and licensed under MIT.
DeepSeek-R1-Distill-Qwen-1.5B Surpasses GPT-4o in certain benchmarks
DeepSeek launched its first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1, utilizing large-scale reinforcement learning. The models are open-sourced, with DeepSeek-R1-Distill-Qwen-32B achieving state-of-the-art results.
Notes on the New Deepseek R1
Deepseek launched the Deepseek-R1 model, an open-source AI using pure reinforcement learning, which is cheaper and faster than OpenAI's o1, showing strong performance but slightly less in complex reasoning tasks.
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL
The paper presents DeepSeek-R1 and DeepSeek-R1-Zero, two reasoning models trained via reinforcement learning, with the latter addressing readability issues. Both models and six distilled versions are open-sourced.
How DeepSeek-R1 Was Built, for Dummies
DeepSeek launched DeepSeek-R1, a reasoning model trained with pure reinforcement learning, achieving performance comparable to OpenAI's o1. It features a cost-effective API and highlights open-source potential in AI.