DeepSeek-R1, a recently released LLM with deep reasoning capabilities, is making waves—reminding me of the early days of ChatGPT.
DeepSeek-R1 has gained rapid popularity due to its open-source, low-cost nature and performance comparable to OpenAI o1.

DeepSeek-R1 has made powerful LLMs more accessible. Many, even those with little tech knowledge, have downloaded and explored it for the first time, and truly experience the power of LLMs.
After reviewing DeepSeek-R1's technical report, I provide some perspectives and insights.
Training Process
Figure 2 is the training process:
Training DeepSeek-R1-Zero (Pure RL Training): It uses reinforcement learning (RL-only) to develop reasoning abilities. During training, the model learns self-verification, reflection, and generates long Chain of Thought (CoT). However, the output lacks readability, often mixing languages and reducing user experience.
Cold Start Fine-Tuning: Stabilizes early RL training, improves readability, and enhances reasoning ability. One source of data comes from selecting and refining the most readable parts of reasoning outputs generated by DeepSeek-R1-Zero.
Reinforcement Learning Optimization: Improves performance in mathematics, programming, science, and logical reasoning. Rule-based rewards ensure reasoning accuracy, while language consistency rewards prevent incoherent outputs.
Rejection Sampling & Supervised Fine-Tuning (SFT): As RL training stabilizes, high-quality samples are selected through rejection sampling and further refined with DeepSeek-V3 scoring to ensure the best reasoning responses. The final dataset consists of 600k reasoning-related samples and 200k additional non-reasoning examples, totaling 800k samples.
Reinforcement Learning for All Scenarios: Ensures the model excels in reasoning while aligning with user preferences (helpfulness & harmlessness). It introduces dual reward signals to optimize both reasoning skills and user experience.
In my opinion, the core contribution of this process lies in improving LLM reasoning capabilities while not relying on large-scale human-labeled data.
DeepSeek-R1 boasts a parameter count of 671 billion. I'm curious whether its training methodology (Cold Start + Supervised Fine-Tuning + RL + Rule-Based Reward) can be effectively applied to smaller models, such as those with 32 billion or even 10 billion parameters.
Before publishing this article, I came across a 3B model replication. But it was only reproduced for the Countdown Game task. Interested readers can check it out.
In addition, during DeepSeek-R1-Zero’s training, the model spontaneously developed advanced reasoning behaviors like reflection and self-verification. This suggests that reinforcement learning may have the potential for self-evolution.
It reminds me of AlphaGo Zero, but as far as I know, this is the first time such capabilities have emerged in an LLM task. If this approach is further refined, LLMs could eventually reach a stage of self-learning, where they optimize their reasoning strategies independently, without human intervention.
The Contribution of DeepSeek-R1-Zero
At first glance, the first step in the training process—DeepSeek-R1-Zero—seems somewhat disconnected to the later stages.
While DeepSeek-R1-Zero is not a strict prerequisite for training DeepSeek-R1, it played a crucial role in shaping its development and optimizing the training process:
DeepSeek-R1-Zero was trained purely through reinforcement learning (RL-only), proving for the first time that a language model can develop strong reasoning abilities without supervised fine-tuning (SFT). This experiment provided valuable insights that led DeepSeek-R1 to adopt an RL-driven optimization strategy during training.
Some of the cold start data for DeepSeek-R1 came from high-quality reasoning paths generated by DeepSeek-R1-Zero. These data points helped the model achieve faster convergence in the early stages of RL training.
DeepSeek-R1-Zero also revealed key challenges, such as poor readability and language mixing. To address these issues, DeepSeek-R1 implemented stricter rejection sampling and language consistency rewards, ensuring clearer and more reliable reasoning.
Monte Carlo Tree Search (MCTS)
Inspired by AlphaGo and AlphaGo Zero, Monte Carlo Tree Search (MCTS) has been used to improve the scalability of reasoning. The core idea is to break down answers into smaller parts, allowing the model to systematically explore the solution space. Here’s how it works:
Prompts guide the model to generate multiple tags, each representing a specific reasoning step needed for the search.
During training, collected prompts are used alongside a pre-trained value model to help MCTS find the right answers.
The resulting question-answer pairs (QA pairs) are then used to train both the actor model and the value model, refining the process through iterative optimization.
DeepSeek-R1 found that MCTS struggled to converge due to an overly large search space, an unreliable value model, and a weak path-filtering mechanism. As a result, it failed to improve the model’s reasoning ability.
This reminds me of the MCTS approach used in rStar, as shown in Figure 3.

Why did rStar succeed where DeepSeek-R1 struggled? I think there are three key reasons:
A well-designed reasoning action space: Traditional MCTS searches only for the next reasoning step, while rStar allows for more human-like reasoning behaviors. These include rephrasing the question, generating sub-questions, and more, making the search process more efficient.
Mutual Reasoning Consistency: rStar introduces a second small language model as a discriminator. Instead of directly filling in incomplete reasoning prompts, the discriminator receives a partially hidden reasoning path and tries to infer the missing parts. A reasoning path is only kept if both the main model and the discriminator reach the same conclusion based on the same clues, ensuring more reliable answers.
No reliance on a reward model: DeepSeek-R1’s MCTS depends on a pre-trained value model for scoring, which has limitations in LLM-related tasks. In contrast, rStar uses a second model for unsupervised validation, avoiding biases and overfitting that reward models might introduce.
In my view, MCTS is a highly efficient algorithm. It works by quickly finding the highest-quality path in a vast space of possible CoT sequences, guided by PRM and search strategies. MCTS is viable for reasoning tasks, success depends on a deep understanding of the problem and well-designed selection, value model, or reward mechanisms.
Cold Start
DeepSeek-R1 collected a small set of high-quality CoT data (a few thousand samples) for a cold start. This significantly improves the readability of the model's output and further enhances its reasoning ability. In DeepSeek-R1's training, it also helps improve stability.
DeepSeek-R1 tried several methods to collect cold-start data: using few-shot prompts with a long CoT as an example, prompting models to generate detailed answers with reflection and verification, formatting DeepSeek-R1-Zero outputs for better readability, and refining results through human review and post-processing.
Of course, this feels rather general—the specific sources of cold start data aren’t very clear.
Reward System
In my previous article, we discussed how reward design plays a crucial role in improving LLM reasoning.
DeepSeek-R1's reward design left a deep impression on me. It primarily uses a rule-based reward system instead of the Process Reward Model (PRM) approach.
The main reason is that defining a fine-grained step is challenging, and even if defined, it's hard to determine whether a step is correct. Automated evaluation lacks precision, and human labeling doesn’t scale well. Moreover, introducing PRM could lead to reward hacking and higher training costs.
This is ultimately a practical choice—at least a rule-based reward ensures accuracy!
Distillation or Reinforcement Learning
DeepSeek-R1's report hastily drew a conclusion: distillation works better for small models, but breaking through the boundaries of intelligence may still require more powerful base models and larger-scale reinforcement learning.
However, there was no in-depth exploration or detailed comparison between the distillation approach and the RL approach.