Paper #27

DeepSeek-R1: Incentivizing Reasoning via Reinforcement Learning (January 2025)

AI Confidence: 85%

AI-generated

TL;DR

DeepSeek-R1 showed that pure reinforcement learning - without any human-labeled reasoning examples - can teach a language model to reason step by step. The model spontaneously developed self-reflection, verification, and strategy adaptation, matching OpenAI o1 on reasoning benchmarks. Later published in Nature.

What It Does

DeepSeek-R1 comes in two versions. R1-Zero is trained with large-scale RL directly, without any supervised fine-tuning warmup. Remarkably, the model spontaneously develops chain-of-thought reasoning, self-verification ("let me check this"), and dynamic strategy switching - behaviors that emerged purely from the RL reward signal.

R1 adds a multi-stage training pipeline with cold-start data before RL, producing a more polished model. On the AIME 2024 benchmark, R1's pass@1 score went from 15.6% to 71.0%, and with majority voting hit 86.7% - matching OpenAI o1.

Why It Matters

This paper was a bombshell for two reasons. First, it showed that reasoning capability can emerge from RL alone, without needing expensive human-annotated reasoning traces. This dramatically lowers the barrier to building reasoning models.

Second, it came from DeepSeek, a Chinese AI lab, demonstrating that the frontier is global, not confined to a few US companies. The paper was published open-source with model weights, and was later accepted by Nature - unusual for an AI systems paper.

The emergent reasoning behaviors (self-reflection, verification) mirror what OpenAI achieved with o1 but through a fundamentally different training approach.

Key Details

Organization: DeepSeek (China). Release: January 2025. Published in Nature September 2025. AIME 2024: 86.7% (majority voting), matching OpenAI o1. Open weights: Yes.

Sources & Further Reading

arXiv: DeepSeek-R1 - https://arxiv.org/abs/2501.12948

Nature: DeepSeek-R1 publication - https://www.nature.com/articles/s41586-025-09422-z

Hugging Face: DeepSeek-R1 models - https://huggingface.co/deepseek-ai