LLM : 논문리뷰 : DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

논문리뷰

LLM : 논문리뷰 : DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

AI바라기 2025. 1. 26. 17:35

쉬운 설명:

이 논문은 LLM이 '생각하는 법'을 배우는 과정에 대한 연구입니다. 첫 번째 모델(R1-Zero)은 마치 정답지를 주지 않고 '맞았다/틀렸다' 피드백만 주면서 스스로 문제 푸는 법을 터득하게 한 것과 같습니다. 놀랍게도 이렇게만 해도 꽤 똑똑해졌습니다. 두 번째 모델(R1)은 처음에 약간의 힌트(cold-start data)를 주고, 스스로 연습하게 한(RL) 다음, 잘 푼 문제들만 골라 다시 학습시키고(SFT), 마지막으로 전체적으로 다듬는(final RL) 여러 단계를 거쳐 훨씬 더 똑똑하게 만들었습니다. 마지막으로, 이렇게 똑똑해진 큰 모델(R1)의 '문제 해결 노하우'를 작은 모델들에게 전수(distillation)해주니, 작은 모델들도 매우 뛰어난 성능을 보이게 되었습니다.

용어 설명:

DeepSeek-R1-Zero: Base model에서 시작하여 supervised fine-tuning (SFT) 없이 오직 Reinforcement Learning (RL) 만으로 학습된 모델.
DeepSeek-R1: Cold-start SFT 데이터로 시작하여 multi-stage RL/SFT pipeline을 통해 학습된 주력 모델.
RL (Reinforcement Learning): 명시적인 정답 예시 대신, 원하는 결과(정확한 답변, 좋은 형식 등)에 대해 보상(reward)을 주어 모델을 학습시키는 방식.
SFT (Supervised Fine-Tuning): 특정 입력-출력 쌍 (예: 질문-정답) 데이터셋으로 모델을 학습시키는 방식.
Cold Start Data: R1 pipeline 초기 단계에서 주요 RL 단계 전에 사용되는 소량의 고품질 데이터 (수천 개의 긴 Chain-of-Thought 예시).
Distillation: 더 크고 성능 좋은 모델(teacher, 여기서는 R1)의 출력(reasoning 과정 + 답변)을 모방하도록 작은 모델(student)을 학습시키는 기법.
GRPO (Group Relative Policy Optimization): 논문에서 사용된 특정 RL 알고리즘. 그룹 내 샘플들의 점수를 이용해 베이스라인을 추정하여 학습 효율을 높임 (별도의 critic 모델 불필요).
Rule-based Reward: 수학 문제 정답 확인, 코드 컴파일 성공 여부 등 객관적인 규칙에 기반한 보상.
Format Reward: <think>, </think> 태그 사용 등 특정 출력 형식을 따랐을 때 주어지는 보상.
Language Mixing: 응답 내에서 의도치 않게 여러 언어가 섞여 사용되는 현상.
Aha Moment: 모델이 reasoning 과정 중간에 자발적으로 자신의 접근 방식을 재평가하는, 관찰된 흥미로운 행동 패턴.
Rejection Sampling: 모델로부터 여러 응답을 생성한 후, 원하는 기준(예: 정답)을 만족하는 응답만 선택하여 SFT 학습 데이터를 구성하는 방법.
cons@64 (Consensus @ 64): 64개의 생성된 샘플에 대해 다수결 투표(majority voting)를 적용하여 평가하는 방식.
pass@1 (Pass @ 1): (주로 코딩/수학 문제에서) k개의 샘플을 생성했을 때, 그중 정답이 하나 이상 존재할 확률을 계산하는 평가 지표. 논문에서는 non-zero temperature 샘플링 후 결과를 평균 내어 사용.

DeepSeek-R1 논문 학습 노트

Purpose of the Paper:

기존 연구 한계: 기존의 reasoning 능력 향상 연구(process-based rewards, search algorithms 등)들이 OpenAI의 o1 시리즈 모델만큼의 범용적인 reasoning 성능 및 test-time scaling 효과를 달성하지 못함. 특히 supervised fine-tuning (SFT) 없이 순수 Reinforcement Learning (RL) 만으로 높은 reasoning 능력을 달성할 수 있는지 불분명했음.
새로운 접근 방식:
1. SFT 없이 대규모 RL만으로 LLM의 reasoning 능력을 self-evolution 시킬 수 있는지 검증 (DeepSeek-R1-Zero).
2. DeepSeek-R1-Zero의 한계점(poor readability, language mixing)을 극복하고 성능을 더 높이기 위해, 소량의 high-quality 'cold-start' SFT 데이터와 multi-stage (SFT-RL-SFT-RL) training pipeline을 도입 (DeepSeek-R1).
차별점: SFT 의존도를 최소화하고 RL의 잠재력을 극대화하여 reasoning 능력을 향상시키는 데 초점을 맞춤. 특히, SFT 없이 순수 RL만으로 복잡한 reasoning 행동(self-verification, reflection 등)이 발현됨을 처음으로 (주장) 보여줌. 또한, 체계적인 multi-stage pipeline을 제안하고 그 효과를 입증함.

Key Contributions & Novelty:

Contribution 1: Pure RL for Reasoning (DeepSeek-R1-Zero)
- Base model에 SFT 없이 직접 RL을 적용하여 강력한 reasoning 능력 (self-verification, reflection, long CoT 생성 등)을 갖춘 DeepSeek-R1-Zero 개발.
- Novelty: LLM의 reasoning 능력이 SFT 없이 순수 RL만으로도 효과적으로 향상될 수 있음을 입증한 (주장하는) 첫 open research. 이는 RL을 통한 모델의 자발적 능력 발현 가능성을 시사.
Contribution 2: DeepSeek-R1 Multi-Stage Pipeline
- Cold-start SFT → Reasoning-oriented RL → Rejection Sampling 기반 SFT (+ non-reasoning data) → Final Alignment RL의 4단계 pipeline 제안.
- Novelty: Reasoning 능력을 효과적으로 bootstrap하고 개선하며, 일반화 능력 및 human preference alignment까지 고려한 체계적인 multi-stage SFT/RL 조합 방식 제안.
Contribution 3: Effective Reasoning Distillation
- DeepSeek-R1의 reasoning pattern을 작은 모델(Qwen, Llama)로 성공적으로 distillation함.
- Novelty: 작은 모델에 직접 RL을 적용하는 것보다, 우수한 대형 모델(R1)로부터 distillation하는 것이 reasoning 성능 면에서 더 효과적임을 실험적으로 증명. 이를 통해 강력한 성능의 open-source distilled 모델들을 공개함.

Experimental Highlights:

Datasets & Metrics: AIME 2024 (Pass@1, cons@64), MATH-500 (Pass@1), Codeforces (Percentile, Rating), GPQA Diamond (Pass@1), MMLU (Pass@1), SWE-Bench Verified (Resolved), LiveCodeBench (Pass@1-COT).
Key Results:
- DeepSeek-R1-Zero: AIME 2024에서 Pass@1 71.0%, cons@64 86.7% 달성 (OpenAI-01-0912 수준). 순수 RL의 가능성 입증 (Table 2, Fig 2).
- DeepSeek-R1: 주요 reasoning benchmark (AIME, MATH-500)에서 OpenAI-01-1217과 동등하거나 우수한 성능 달성 (Fig 1, Table 4). 예: AIME Pass@1 79.8% (vs 79.2%), MATH-500 Pass@1 97.3% (vs 96.4%).
- Distillation: Distilled 모델 (DeepSeek-R1-Distill-Qwen-32B)이 직접 RL을 적용한 모델 (DeepSeek-R1-Zero-Qwen-32B) 및 기존 SOTA open-source 모델 (QwQ-32B-Preview) 대비 월등한 성능 향상 (Table 5, 6). 예: AIME Pass@1에서 각각 72.6% vs 47.0% vs 50.0%. Distilled 14B, 32B, 70B 모델들이 open-source reasoning benchmark에서 SOTA 달성.
- Self-evolution & Aha Moment: RL 과정에서 모델(R1-Zero)이 자연스럽게 더 긴 reasoning (longer CoT)을 생성하고 (Fig 3), 문제 해결 중 접근 방식을 재평가하는 'aha moment'와 같은 복잡한 행동이 자발적으로 나타남을 관찰 (Table 3).

Limitations and Future Work:

Limitation: General Capability: DeepSeek-R1이 function calling, complex role-playing, JSON output 등 특정 non-reasoning task에서는 DeepSeek-V3보다 성능이 낮음.
- Future Work: Long CoT reasoning 능력을 이러한 task 향상에 활용하는 방안 탐색.
Limitation: Language Mixing: 영어/중국어 외 다른 언어 처리 시 language mixing 문제가 발생할 수 있음.
- Future Work: 다국어 처리 능력 개선.
Limitation: Prompting Engineering: Prompt에 민감하며, few-shot prompting은 오히려 성능을 저하시킴. Zero-shot prompting 권장.
- Future Work: Prompt 변화에 더 강건한 모델 개발 (간접적).
Limitation: Software Engineering Tasks: 긴 evaluation 시간 때문에 Software Engineering (SE) task에 대한 대규모 RL 적용이 제한적이었고, V3 대비 큰 성능 향상은 없었음.
- Future Work: SE 데이터에 대한 rejection sampling 또는 asynchronous evaluation을 RL 과정에 도입하여 효율성 및 성능 개선.

Overall Summary:

이 논문은 SFT 없이 순수 RL만으로도 LLM의 reasoning 능력을 크게 향상시킬 수 있음(DeepSeek-R1-Zero)을 보여주고, 소량의 cold-start 데이터와 multi-stage RL/SFT pipeline (DeepSeek-R1)을 통해 SOTA 수준의 reasoning 성능을 달성했습니다. 더 나아가, RL로 학습된 reasoning 능력을 작은 모델로 효과적으로 distillation할 수 있음을 입증하며 강력한 open-source 모델들을 공개했습니다. 이 연구는 open-source 모델의 reasoning 능력 한계를 극복하고, RL과 distillation의 새로운 가능성을 제시했다는 점에서 중요한 의미를 갖습니다.

Abstract

우리는 우리의 1세대 reasoning models인 DeepSeek-R1-Zero와 DeepSeek-R1을 소개합니다.

DeepSeek-R1-Zero는 사전 단계로서 supervised fine-tuning (SFT) 없이 large-scale reinforcement learning (RL)을 통해 trained된 model로, 뛰어난 reasoning capabilities를 보여줍니다. RL을 통해 DeepSeek-R1-Zero는 자연스럽게 수많은 강력하고 흥미로운 reasoning behaviors를 나타냅니다. 그러나 이는 좋지 않은 readability 및 language mixing과 같은 어려움에 직면합니다.

이러한 문제들을 해결하고 reasoning performance를 더욱 향상시키기 위해, 우리는 RL 이전에 multi-stage training과 cold-start data를 통합한 DeepSeek-R1을 소개합니다. DeepSeek-R1은 reasoning tasks에서 OpenAI-o1-1217과 비슷한 performance를 달성합니다.

research community를 지원하기 위해, 우리는 DeepSeek-R1-Zero, DeepSeek-R1, 그리고 Qwen과 Llama를 기반으로 DeepSeek-R1에서 distilled된 6개의 dense models (1.5B, 7B, 8B, 14B, 32B, 70B)를 open-source합니다.

Introduction

최근 몇 년간, Large Language Models (LLMs)는 빠른 iteration과 evolution을 겪으며 Artificial General Intelligence (AGI)와의 격차를 점진적으로 줄여나가고 있습니다.

최근에는 post-training이 전체 training pipeline의 중요한 구성 요소로 부상했습니다. 이는 pre-training에 비해 상대적으로 최소한의 computational resources를 요구하면서도 reasoning tasks에서의 정확도를 향상시키고, 사회적 가치와 align하며, 사용자 선호도에 적응하는 것으로 나타났습니다. reasoning capabilities 맥락에서 OpenAI의 o1 시리즈 models는 Chain-of-Thought reasoning process의 길이를 늘려 inference-time scaling을 도입한 최초의 사례였습니다. 이 접근 방식은 mathematics, coding, scientific reasoning과 같은 다양한 reasoning tasks에서 상당한 개선을 이루었습니다. 그러나 효과적인 test-time scaling의 과제는 research community에게 여전히 열려 있는 문제입니다. 여러 이전 연구들에서는 process-based reward models, reinforcement learning, 그리고 Monte Carlo Tree Search 및 Beam Search와 같은 search algorithms 등 다양한 접근 방식을 탐색했습니다. 그러나 이러한 방법 중 어느 것도 OpenAI의 o1 시리즈 models와 비교할 만한 일반적인 reasoning performance를 달성하지 못했습니다.

본 논문에서 우리는 pure reinforcement learning (RL)을 사용하여 language model의 reasoning capabilities를 개선하는 첫걸음을 내딛습니다. 우리의 목표는 supervised data에 의존하지 않고 LLMs가 reasoning capabilities를 개발할 잠재력을 탐색하며, pure RL process를 통한 그들의 self-evolution에 초점을 맞추는 것입니다. 구체적으로, 우리는 DeepSeek-V3-Base를 base model로 사용하고 GRPO를 RL framework로 채택하여 reasoning에서 model performance를 향상시킵니다. training 동안 DeepSeek-R1-Zero는 자연스럽게 수많은 강력하고 흥미로운 reasoning behaviors를 나타냈습니다. 수천 번의 RL steps 이후, DeepSeek-R1-Zero는 reasoning benchmarks에서 뛰어난 performance를 보여줍니다. 예를 들어, AIME 2024에서의 pass@1 score는 15.6%에서 71.0%로 증가했으며, majority voting을 통해 점수는 86.7%로 더욱 향상되어 OpenAI-o1-0912의 performance와 일치합니다.

그러나 DeepSeek-R1-Zero는 좋지 않은 readability 및 language mixing과 같은 어려움에 직면합니다. 이러한 문제들을 해결하고 reasoning performance를 더욱 향상시키기 위해, 우리는 소량의 cold-start data와 multi-stage training pipeline을 통합한 DeepSeek-R1을 소개합니다. 구체적으로, 우리는 수천 개의 cold-start data를 수집하여 DeepSeek-V3-Base model을 fine-tune하는 것으로 시작합니다. 이어서 DeepSeek-R1-Zero와 같이 reasoning-oriented RL을 수행합니다. RL process에서 convergence에 가까워지면, 우리는 RL checkpoint에 rejection sampling을 통해 새로운 SFT data를 생성하고, 이를 writing, factual QA, self-cognition과 같은 domains의 DeepSeek-V3 supervised data와 결합한 다음, DeepSeek-V3-Base model을 retrain합니다. 새로운 data로 fine-tuning한 후, checkpoint는 모든 scenarios의 prompts를 고려하여 추가적인 RL process를 거칩니다. 이러한 단계를 거친 후, 우리는 DeepSeek-R1이라고 하는 checkpoint를 얻었으며, 이는 OpenAI-o1-1217과 동등한 performance를 달성합니다.

우리는 더 나아가 DeepSeek-R1에서 더 작은 dense models로의 distillation을 탐색합니다. Qwen2.5-32B를 base model로 사용할 때, DeepSeek-R1에서의 직접적인 distillation이 해당 model에 RL을 적용하는 것보다 우수한 성능을 보입니다. 이는 더 큰 base models에 의해 발견된 reasoning patterns이 reasoning capabilities 향상에 중요하다는 것을 보여줍니다. 우리는 distilled된 Qwen 및 Llama 시리즈를 open-source합니다. 특히, 우리의 distilled 14B model은 state-of-the-art open-source QwQ-32B-Preview를 큰 margin으로 능가하며, distilled 32B 및 70B models는 dense models 중 reasoning benchmarks에서 새로운 record를 세웠습니다.

1.1. Contributions

Post-Training: Large-Scale Reinforcement Learning on the Base Model

우리는 사전 단계로서 supervised fine-tuning (SFT)에 의존하지 않고 base model에 직접 RL을 적용합니다. 이 접근 방식은 model이 복잡한 문제를 해결하기 위해 chain-of-thought (CoT)를 탐색하도록 하여 DeepSeek-R1-Zero의 개발로 이어집니다. DeepSeek-R1-Zero는 self-verification, reflection, long CoTs 생성과 같은 capabilities를 보여주며 research community에게 중요한 milestone을 제시합니다. 특히, LLMs의 reasoning capabilities가 SFT 없이 순전히 RL을 통해 incentivized될 수 있음을 검증한 최초의 open research입니다. 이 breakthrough는 이 분야의 미래 발전을 위한 길을 열어줍니다.
우리는 DeepSeek-R1을 개발하기 위한 우리의 pipeline을 소개합니다. 이 pipeline은 개선된 reasoning patterns를 발견하고 human preferences와 align하는 것을 목표로 하는 두 개의 RL stages와, model의 reasoning 및 non-reasoning capabilities의 seed 역할을 하는 두 개의 SFT stages를 통합합니다. 우리는 이 pipeline이 더 나은 models를 만드는 데 industry에 도움이 될 것이라고 믿습니다.

Distillation: Smaller Models Can Be Powerful Too

우리는 더 큰 models의 reasoning patterns이 더 작은 models로 distilled될 수 있으며, 작은 models에 RL을 통해 발견된 reasoning patterns보다 더 나은 performance를 가져온다는 것을 보여줍니다. open source DeepSeek-R1과 그 API는 research community가 미래에 더 나은 작은 models를 distill하는 데 도움이 될 것입니다.
DeepSeek-R1에 의해 generated된 reasoning data를 사용하여, 우리는 research community에서 널리 사용되는 여러 dense models를 fine-tuned했습니다. evaluation results는 distilled된 더 작은 dense models가 benchmarks에서 매우 뛰어난 성능을 보인다는 것을 보여줍니다. DeepSeek-R1-Distill-Qwen-7B는 AIME 2024에서 55.5%를 달성하여 QwQ-32B-Preview를 능가합니다. 또한, DeepSeek-R1-Distill-Qwen-32B는 AIME 2024에서 72.6%, MATH-500에서 94.3%, LiveCodeBench에서 57.2%의 score를 기록합니다. 이러한 results는 이전 open-source models를 크게 능가하며 o1-mini와 비교할 만합니다. 우리는 Qwen2.5 및 Llama3 시리즈를 기반으로 distilled된 1.5B, 7B, 8B, 14B, 32B, 70B checkpoints를 community에 open-source합니다.

1.2. Summary of Evaluation Results

Reasoning tasks: (1) DeepSeek-R1은 AIME 2024에서 79.8% Pass@1 score를 달성하여 OpenAI-o1-1217을 약간 능가합니다. MATH-500에서는 97.3%라는 인상적인 score를 달성하여 OpenAI-o1-1217과 동등한 performance를 보이고 다른 models를 크게 능가합니다. (2) coding-related tasks에서 DeepSeek-R1은 code competition tasks에서 expert level을 보여주며, Codeforces에서 2,029 Elo rating을 달성하여 대회 참가자의 96.3%를 능가합니다. engineering-related tasks의 경우, DeepSeek-R1은 DeepSeek-V3보다 약간 더 나은 performance를 보여 real world tasks에서 developers에게 도움이 될 수 있습니다.
Knowledge: MMLU, MMLU-Pro, GPQA Diamond와 같은 benchmarks에서 DeepSeek-R1은 뛰어난 results를 달성하며, MMLU에서 90.8%, MMLU-Pro에서 84.0%, GPQA Diamond에서 71.5%의 scores로 DeepSeek-V3를 크게 능가합니다. 이러한 benchmarks에서 OpenAI-o1-1217보다는 performance가 약간 낮지만, DeepSeek-R1은 다른 closed-source models를 능가하여 educational tasks에서의 competitive edge를 보여줍니다. factual benchmark인 SimpleQA에서 DeepSeek-R1은 DeepSeek-V3를 능가하여 fact-based queries 처리 능력을 보여줍니다. OpenAI-o1이 이 benchmark에서 4o를 능가하는 유사한 trend가 관찰됩니다.
Others: DeepSeek-R1은 creative writing, general question answering, editing, summarization 등 광범위한 tasks에서도 뛰어난 성능을 보입니다. AlpacaEval 2.0에서 인상적인 87.6%의 length-controlled win-rate를 달성하고 ArenaHard에서 92.3%의 win-rate를 기록하여, non-exam-oriented queries를 지능적으로 처리하는 강력한 능력을 보여줍니다. 또한 DeepSeek-R1은 long-context understanding을 요구하는 tasks에서 뛰어난 performance를 보여주며, long-context benchmarks에서 DeepSeek-V3를 상당히 능가합니다.

Introduction 섹션 정리 노트 (AI 연구자용)

핵심 목표: Supervised data 없이 순수 RL(Reinforcement Learning)만으로 LLM의 reasoning capabilities를 향상시키는 가능성 탐색 및 OpenAI o1급 성능 달성.

주요 모델 및 접근법:

DeepSeek-R1-Zero:
- SFT(Supervised Fine-Tuning) 사전 단계 없이 base model (DeepSeek-V3-Base)에 직접 large-scale RL (GRPO framework 사용) 적용.
- 결과: 순수 RL만으로 강력한 reasoning behaviors (self-verification, reflection, long CoT generation 등) 발현 및 AIME 2024 pass@1 71.0% (majority voting 86.7%) 달성 (OpenAI-o1-0912 수준).
- 의의: LLM reasoning 능력 향상이 SFT 없이 순수 RL만으로 가능함을 입증한 최초의 open research.
- 한계: Readability 저하, language mixing 문제 발생.
DeepSeek-R1:
- R1-Zero의 한계 극복 및 성능 추가 향상을 위해 multi-stage training pipeline 도입.
- Pipeline:
  1. 소량의 cold-start data로 DeepSeek-V3-Base SFT.
  2. Reasoning-oriented RL 수행 (R1-Zero 방식).
  3. RL checkpoint에서 rejection sampling으로 새 SFT data 생성 + 기존 supervised data (writing, QA 등) 결합하여 retrain.
  4. 모든 scenario prompt를 고려한 추가 RL 수행.
- 결과: Reasoning tasks에서 OpenAI-o1-1217과 동등(on par)한 performance 달성 (AIME 79.8% Pass@1, MATH-500 97.3%). coding (Codeforces Elo 2,029), knowledge (MMLU 90.8%), 기타 (AlpacaEval 87.6% win-rate) 등 전반적 성능 우수.
Distillation:
- DeepSeek-R1의 reasoning patterns을 더 작은 dense models (Qwen, Llama 기반)로 distillation.
- 결과: 작은 model에 직접 RL을 적용하는 것보다 distillation이 더 효과적임을 입증.
- Open-source: DeepSeek-R1-Zero, DeepSeek-R1 및 distilled models (1.5B, 7B, 8B, 14B, 32B, 70B) 공개. Distilled 14B는 QwQ-32B-Preview 능가, 32B/70B는 dense model reasoning benchmark 신기록 수립. Distilled 7B는 AIME 55.5%, 32B는 AIME 72.6%, MATH-500 94.3%, LiveCodeBench 57.2% 달성 (o1-mini 수준).

핵심 Contribution:

Pure RL만으로 LLM의 reasoning 능력 향상 가능성 입증 (DeepSeek-R1-Zero).
SFT와 RL을 결합한 multi-stage pipeline (DeepSeek-R1)으로 SOTA급 reasoning performance 달성 및 파이프라인 공개.
대형 모델의 reasoning 능력을 소형 모델로 효과적으로 이전(distillation) 가능함을 보이고 SOTA급 open-source distilled models 제공.

쉬운 설명 :

최근 AI 언어 모델(LLM)들이 점점 똑똑해지고 있는데, 특히 '추론(reasoning)' 능력, 즉 복잡한 문제를 논리적으로 풀어가는 능력이 중요해지고 있습니다.

이 논문의 Introduction 섹션에서는 이 추론 능력을 어떻게 더 끌어올릴 수 있을지에 대한 새로운 시도를 소개합니다.

스스로 배우는 AI (DeepSeek-R1-Zero): 보통 AI를 똑똑하게 만들려면 사람이 만든 정답 데이터(supervised data)로 먼저 학습(SFT)시키는데, 여기서는 그걸 아예 안 하고 AI가 스스로 문제 풀이를 시도하고 보상을 받으며 배우는 방식(RL)만 사용해 봤습니다. 놀랍게도 이렇게만 해도 AI가 스스로 검증하고, 깊게 생각하는(long CoT) 등 뛰어난 추론 능력을 보여줬습니다. (AI 연구 분야에서는 이게 가능하다는 걸 처음으로 보여준 셈입니다!) 다만, 이 AI가 내놓는 답변이 사람이 읽기 좀 어렵다는 단점은 있었습니다.
더 똑똑하고 쓰기 편한 AI (DeepSeek-R1): 그래서 첫 번째 AI의 단점을 보완하고 성능을 더 높이기 위해, 약간의 초기 데이터 학습(cold-start SFT)과 스스로 배우는 과정(RL)을 여러 단계로 조합하는 방법(multi-stage training pipeline)을 사용했습니다. ①초기 학습 → ②스스로 학습(RL) → ③학습 결과 바탕으로 데이터 정제해서 다시 학습(SFT) → ④다시 스스로 학습(RL). 이렇게 했더니 세계 최고 수준인 OpenAI의 o1 모델과 맞먹는 추론 성능을 달성했습니다. 수학, 코딩 문제뿐만 아니라 지식 질문 답변 능력도 크게 향상되었습니다.
작지만 강한 AI 만들기 (Distillation): 이렇게 똑똑해진 큰 AI(DeepSeek-R1)가 배운 추론 방식을, 더 작고 가벼운 AI 모델(Qwen, Llama 기반)에게 전수(distillation)해 주었습니다. 그랬더니 작은 AI들도 직접 RL로 배우는 것보다 훨씬 더 추론을 잘하게 되었습니다. 이 작지만 강력한 AI 모델들도 여러 크기(1.5B ~ 70B)로 만들어서 공개(open-source)했습니다.

결론적으로 이 연구는 AI가 스스로 추론 능력을 키울 수 있다는 점을 보여주었고, 체계적인 학습 파이프라인을 통해 최고 수준의 성능을 달성했으며, 큰 AI의 지식을 작은 AI에게 효과적으로 옮기는 방법도 성공적으로 보여주었습니다. 연구자들이 활용할 수 있도록 여러 모델들도 공개했습니다.

2. Approach

2.1. Overview

이전 연구들은 model performance를 향상시키기 위해 대량의 supervised data에 크게 의존해 왔습니다. 본 연구에서는 cold start로서 supervised fine-tuning (SFT)을 사용하지 않더라도 large-scale reinforcement learning (RL)을 통해 reasoning capabilities가 크게 향상될 수 있음을 보여줍니다. 더욱이, 소량의 cold-start data를 포함하면 performance를 더욱 향상시킬 수 있습니다. 다음 섹션에서는 다음을 제시합니다: (1) SFT data 없이 base model에 직접 RL을 적용하는 DeepSeek-R1-Zero, (2) 수천 개의 긴 Chain-of-Thought (CoT) 예제로 fine-tuned된 checkpoint에서 시작하여 RL을 적용하는 DeepSeek-R1, (3) DeepSeek-R1의 reasoning capability를 작은 dense models로 Distill하는 방법.

2.2. DeepSeek-R1-Zero: Reinforcement Learning on the Base Model

Reinforcement learning은 우리의 이전 연구들에서 입증된 바와 같이 reasoning tasks에서 상당한 효과를 보여주었습니다. 그러나 이러한 연구들은 수집하는 데 시간이 많이 소요되는 supervised data에 크게 의존했습니다. 이 섹션에서는 supervised data 없이 LLMs가 reasoning capabilities를 개발할 잠재력을 탐색하며, pure reinforcement learning process를 통한 그들의 self-evolution에 초점을 맞춥니다. 우리는 RL algorithm에 대한 간략한 개요로 시작하여 몇 가지 흥미로운 결과를 제시하며, 이것이 community에 귀중한 통찰력을 제공하기를 바랍니다.

2.2.1. Reinforcement Learning Algorithm

Group Relative Policy Optimization RL의 training costs를 절약하기 위해, 우리는 일반적으로 policy model과 동일한 크기인 critic model을 사용하지 않고 대신 group scores로부터 baseline을 추정하는 Group Relative Policy Optimization (GRPO)을 채택합니다. 구체적으로, 각 question 에 대해 GRPO는 old policy 로부터 outputs 그룹 를 samples하고, 다음 objective를 maximizing하여 policy model 를 optimizes합니다: [ \begin{equation*} \begin{split} \mathcal{J}{G R P O}(\theta) & = \mathbb{E}\left[q \sim P(Q),\left{o{i}\right}{i=1}^{G} \sim \pi{\theta_{o l d}}(O \mid q)\right] \ & \frac{1}{G} \sum_{i=1}^{G}\left(\min \left(\frac{\pi_{\theta}\left(o_{i} \mid q\right)}{\pi_{\theta_{o l d}}\left(o_{i} \mid q\right)} A_{i}, clip \left(\frac{\pi_{\theta}\left(o_{i} \mid q\right)}{\pi_{\theta_{o l d}}\left(o_{i} \mid q\right)}, 1-\varepsilon, 1+\varepsilon\right) A_{i}\right)-\beta \mathbb{D}{K L}\left(\pi{\theta} | \pi_{r e f}\right)\right), \ \mathbb{D}{K L}\left(\pi{\theta} | \pi_{r e f}\right) & =\frac{\pi_{r e f}\left(o_{i} \mid q\right)}{\pi_{\theta}\left(o_{i} \mid q\right)}-\log \frac{\pi_{r e f}\left(o_{i} \mid q\right)}{\pi_{\theta}\left(o_{i} \mid q\right)}-1, \end{split} \end{equation*} ] 여기서 와 는 hyper-parameters이고, 는 각 group 내의 outputs에 해당하는 rewards 그룹 를 사용하여 계산된 advantage입니다: [ A_{i} = \frac{r_{i} - \mathrm{mean}({ r_{1}, r_{2}, \cdots , r_{G} })}{\mathrm{std}({ r_{1}, r_{2}, \cdots , r_{G} })}. ]

Table 1 | DeepSeek-R1-Zero용 Template. prompt는 training 중 특정 reasoning question으로 대체됩니다.

2.2.2. Reward Modeling

reward는 RL의 optimization direction을 결정하는 training signal의 원천입니다. DeepSeek-R1-Zero를 train하기 위해, 우리는 주로 두 가지 유형의 rewards로 구성된 rule-based reward system을 채택합니다:

Accuracy rewards: accuracy reward model은 response가 올바른지 평가합니다. 예를 들어, deterministic results를 가진 math problems의 경우, model은 지정된 format(예: 상자 안)으로 final answer를 제공해야 하며, 이를 통해 정확성에 대한 신뢰할 수 있는 rule-based verification이 가능합니다. 유사하게, LeetCode problems의 경우, compiler를 사용하여 predefined test cases를 기반으로 feedback을 generate할 수 있습니다.
Format rewards: accuracy reward model 외에도, 우리는 model이 ‘<think>’와 ‘</think>’ tags 사이에 thinking process를 넣도록 강제하는 format reward model을 사용합니다.

우리는 DeepSeek-R1-Zero 개발에 outcome 또는 process neural reward model을 적용하지 않았습니다. 왜냐하면 neural reward model이 large-scale reinforcement learning process에서 reward hacking으로 어려움을 겪을 수 있고, reward model을 retraining하는 데 추가적인 training resources가 필요하며 전체 training pipeline을 복잡하게 만들기 때문입니다.

2.2.3. Training Template

DeepSeek-R1-Zero를 train하기 위해, 우리는 base model이 지정된 instructions를 따르도록 안내하는 간단한 template을 설계하는 것으로 시작합니다. Table 1에서 묘사된 바와 같이, 이 template은 DeepSeek-R1-Zero가 먼저 reasoning process를 생성한 다음 final answer를 생성하도록 요구합니다. 우리는 RL process 동안 model의 자연스러운 progression을 정확하게 관찰할 수 있도록, reflective reasoning을 의무화하거나 특정 problem-solving strategies를 장려하는 것과 같은 content-specific biases를 피하면서 우리의 constraints를 이 structural format으로 의도적으로 제한합니다.

2.2.4. Performance, Self-evolution Process and Aha Moment of DeepSeek-R1-Zero

Performance of DeepSeek-R1-Zero Figure 2는 RL training process 전반에 걸친 AIME 2024 benchmark에서의 DeepSeek-R1-Zero의 performance trajectory를 보여줍니다. 그림에서 볼 수 있듯이, DeepSeek-R1-Zero는 RL training이 진행됨에 따라 꾸준하고 일관된 performance 향상을 보여줍니다. 특히, AIME 2024에서의 평균 pass@1 score는 초기 15.6%에서 인상적인 71.0%로 크게 증가하여 OpenAI-o1-0912와 비교할 만한 performance levels에 도달합니다. 이러한 상당한 improvement는 시간이 지남에 따라 model의 performance를 optimizing하는 데 있어 우리 RL algorithm의 efficacy를 강조합니다.

Table 2는 다양한 reasoning-related benchmarks에서 DeepSeek-R1-Zero와 OpenAI의 o1-0912 models 간의 comparative analysis를 제공합니다. findings는 RL이 supervised fine-tuning data 없이도 DeepSeek-R1-Zero가 강력한 reasoning capabilities를 갖도록 empower한다는 것을 보여줍니다. 이는 RL만으로 효과적으로 learn하고 generalize하는 model의 ability를 강조하므로 주목할 만한 achievement입니다. 또한, DeepSeek-R1-Zero의 performance는 majority voting의 application을 통해 더욱 augmented될 수 있습니다. 예를 들어, AIME benchmark에서 majority voting이 사용될 때, DeepSeek-R1-Zero의 performance는 71.0%에서 86.7%로 상승하여 OpenAI-o1-0912의 performance를 초과합니다. majority voting 유무에 관계없이 이러한 경쟁력 있는 performance를 달성하는 DeepSeek-R1-Zero의 능력은 강력한 foundational capabilities와 reasoning tasks에서의 추가적인 advancements를 위한 potential을 강조합니다.

Self-evolution Process of DeepSeek-R1-Zero DeepSeek-R1-Zero의 self-evolution process는 RL이 model이 autonomously reasoning capabilities를 improve하도록 어떻게 drive할 수 있는지에 대한 매혹적인 demonstration입니다. base model에서 직접 RL을 initiating함으로써, 우리는 supervised fine-tuning stage의 influence 없이 model의 progression을 면밀히 monitor할 수 있습니다. 이 approach는 model이 시간이 지남에 따라 어떻게 evolves하는지, 특히 complex reasoning tasks를 handle하는 ability 측면에서 명확한 view를 제공합니다.

Figure 3에서 묘사된 바와 같이, DeepSeek-R1-Zero의 thinking time은 training process 전반에 걸쳐 consistent improvement를 보여줍니다. 이 improvement는 external adjustments의 결과가 아니라 model 내의 intrinsic development입니다. DeepSeek-R1-Zero는 extended test-time computation을 leveraging하여 점진적으로 complex reasoning tasks를 solve하는 ability를 naturally acquires합니다. 이 computation은 수백에서 수천 개의 reasoning tokens를 generating하는 범위에 이르며, model이 thought processes를 더 깊이 explore하고 refine할 수 있도록 합니다.

이 self-evolution의 가장 주목할 만한 측면 중 하나는 test-time computation이 increases함에 따라 sophisticated behaviors가 emergence하는 것입니다. model이 previous steps를 revisiting하고 reevaluates하는 reflection과 같은 Behaviors 및 problem-solving에 대한 alternative approaches의 exploration은 spontaneously arise합니다. 이러한 behaviors는 explicitly programmed된 것이 아니라 reinforcement learning environment와의 model’s interaction의 result로 emerge합니다. 이 spontaneous development는 DeepSeek-R1-Zero의 reasoning capabilities를 significantly enhances하여 더 challenging tasks를 greater efficiency와 accuracy로 tackle할 수 있도록 합니다.

Aha Moment of DeepSeek-R1-Zero DeepSeek-R1-Zero의 training 중에 observed된 특히 intriguing phenomenon은 "aha moment"의 occurrence입니다. 이 moment는 Table 3에 illustrated된 바와 같이 model의 intermediate version에서 발생합니다. 이 phase 동안 DeepSeek-R1-Zero는 initial approach를 reevaluating하여 problem에 더 많은 thinking time을 allocate하는 법을 learns합니다. 이 behavior는 model의 growing reasoning abilities의 testament일 뿐만 아니라 reinforcement learning이 어떻게 unexpected하고 sophisticated outcomes로 이어질 수 있는지 보여주는 captivating example입니다.

이 moment는 model에게 "aha moment"일 뿐만 아니라 그 behavior를 observing하는 researchers에게도 그러합니다. 이는 reinforcement learning의 power와 beauty를 underscores합니다: model에게 problem을 solve하는 방법을 explicitly teaching하는 대신, 우리는 단순히 right incentives를 provide하고, model은 autonomously advanced problem-solving strategies를 develops합니다. "aha moment"는 RL이 artificial systems에서 new levels of intelligence를 unlock할 potential을 강력하게 상기시켜 주며, future에 more autonomous하고 adaptive models를 위한 길을 paving합니다.

Drawback of DeepSeek-R1-Zero DeepSeek-R1-Zero가 strong reasoning capabilities를 exhibits하고 autonomously unexpected하고 powerful reasoning behaviors를 develops하지만, 여러 issues에 직면합니다. 예를 들어, DeepSeek-R1-Zero는 poor readability 및 language mixing과 같은 challenges로 어려움을 겪습니다. reasoning processes를 more readable하게 만들고 open community와 share하기 위해, 우리는 human-friendly cold-start data와 함께 RL을 utilizes하는 method인 DeepSeek-R1을 explore합니다.

2.3. DeepSeek-R1: Reinforcement Learning with Cold Start

DeepSeek-R1-Zero의 promising results에 영감을 받아 두 가지 자연스러운 questions가 발생합니다: 1) 소량의 high-quality data를 cold start로 incorporating하여 reasoning performance를 더욱 improved하거나 convergence를 accelerated할 수 있는가? 2) 명확하고 coherent한 Chains of Thought (CoT)를 produces할 뿐만 아니라 strong general capabilities를 demonstrates하는 user-friendly model을 어떻게 train할 수 있는가? 이러한 questions에 address하기 위해, 우리는 DeepSeek-R1을 train하기 위한 pipeline을 design합니다. pipeline은 다음과 같이 outlined된 네 가지 stages로 consists합니다.

2.3.1. Cold Start

DeepSeek-R1-Zero와 달리, base model로부터의 RL training의 초기 불안정한 cold start phase를 prevent하기 위해, DeepSeek-R1에서는 초기 RL actor로서 model을 fine-tune하기 위해 소량의 long CoT data를 construct하고 collect합니다. 이러한 data를 collect하기 위해, 우리는 여러 approaches를 explored했습니다: long CoT를 example로 사용하는 few-shot prompting, reflection과 verification을 포함한 detailed answers를 generate하도록 models를 directly prompting하는 것, readable format으로 DeepSeek-R1-Zero outputs를 gathering하는 것, 그리고 human annotators에 의한 post-processing을 통해 results를 refining하는 것.

이 연구에서 우리는 RL의 starting point로서 DeepSeek-V3-Base를 fine-tune하기 위해 수천 개의 cold-start data를 collect합니다. DeepSeek-R1-Zero와 비교했을 때, cold start data의 advantages는 다음과 같습니다:

Readability: DeepSeek-R1-Zero의 key limitation은 그 content가 종종 reading에 적합하지 않다는 것입니다. Responses는 multiple languages를 mix하거나 users를 위해 answers를 highlight하는 markdown formatting이 lack할 수 있습니다. 반면, DeepSeek-R1용 cold-start data를 creating할 때, 우리는 각 response 끝에 summary를 includes하고 reader-friendly하지 않은 responses를 filters out하는 readable pattern을 design합니다. 여기서 우리는 output format을 |special_token|<reasoning_process>|special_token|<summary>로 define합니다. 여기서 reasoning_process는 query에 대한 CoT이고, summary는 reasoning results를 summarize하는 데 used됩니다.
Potential: human priors를 사용하여 cold-start data의 pattern을 carefully designing함으로써, 우리는 DeepSeek-R1-Zero 대비 better performance를 observe합니다. 우리는 iterative training이 reasoning models를 위한 better way라고 believe합니다.

2.3.2. Reasoning-oriented Reinforcement Learning

cold start data에 대해 DeepSeek-V3-Base를 fine-tuning한 후, 우리는 DeepSeek-R1-Zero에서 employed된 것과 same large-scale reinforcement learning training process를 apply합니다. 이 phase는 model의 reasoning capabilities를 enhancing하는 데 focuses하며, 특히 coding, mathematics, science, logic reasoning과 같이 well-defined problems와 clear solutions를 involve하는 reasoning-intensive tasks에서 그러합니다. training process 동안, 우리는 CoT가 종종 language mixing을 exhibits하는 것을 observe하며, 특히 RL prompts가 multiple languages를 involve할 때 그렇습니다. language mixing issue를 mitigate하기 위해, 우리는 RL training 동안 language consistency reward를 introduce하며, 이는 CoT에서 target language words의 proportion으로 calculated됩니다. ablation experiments는 이러한 alignment가 model’s performance에서 slight degradation을 results함을 show하지만, 이 reward는 human preferences와 aligns하여 more readable하게 만듭니다. 마지막으로, 우리는 reasoning tasks의 accuracy와 language consistency에 대한 reward를 directly summing하여 final reward를 form합니다. 그런 다음 우리는 fine-tuned model에 대해 reasoning tasks에서 convergence를 achieves할 때까지 RL training을 apply합니다.

2.3.3. Rejection Sampling and Supervised Fine-Tuning

reasoning-oriented RL이 converges하면, 우리는 resulting checkpoint를 utilize하여 subsequent round를 위한 SFT (Supervised Fine-Tuning) data를 collect합니다. 주로 reasoning에 focuses하는 initial cold-start data와 달리, 이 stage는 writing, role-playing 및 기타 general-purpose tasks에서 model’s capabilities를 enhance하기 위해 다른 domains의 data를 incorporates합니다. Specifically, 우리는 아래에 described된 대로 data를 generate하고 model을 fine-tune합니다.

Reasoning data 우리는 reasoning prompts를 curate하고 위의 RL training에서 얻은 checkpoint로부터 rejection sampling을 performing하여 reasoning trajectories를 generate합니다. previous stage에서는 rule-based rewards를 사용하여 evaluated될 수 있는 data만 included했지만, 이 stage에서는 additional data를 incorporating하여 dataset을 expand합니다. 이 중 일부는 ground-truth와 model predictions를 DeepSeek-V3에 feeding하여 judgment를 위한 generative reward model을 use합니다. Additionally, model output이 때때로 chaotic하고 read하기 difficult하기 때문에, 우리는 mixed languages, long parapraphs, code blocks가 있는 chain-of-thought를 filtered out했습니다. 각 prompt에 대해 multiple responses를 sample하고 correct ones만 retain합니다. total로, 우리는 약 600k의 reasoning related training samples를 collect합니다.

Non-Reasoning data writing, factual QA, self-cognition, translation과 같은 non-reasoning data의 경우, 우리는 DeepSeek-V3 pipeline을 adopt하고 DeepSeek-V3의 SFT dataset의 portions를 reuse합니다. 특정 non-reasoning tasks의 경우, prompting을 통해 question에 answering하기 전에 potential chain-of-thought를 generate하기 위해 DeepSeek-V3를 call합니다. However, "hello"와 같은 simpler queries의 경우, response에 CoT를 provide하지 않습니다. end에는 reasoning과 unrelated한 approximately 200k training samples를 total로 collected했습니다.

우리는 위의 curated dataset 약 800k samples를 사용하여 DeepSeek-V3-Base를 two epochs 동안 fine-tune합니다.

2.3.4. Reinforcement Learning for all Scenarios

model을 human preferences와 further align하기 위해, 우리는 model의 helpfulness와 harmlessness를 improving하는 동시에 reasoning capabilities를 simultaneously refining하는 것을 aimed하는 secondary reinforcement learning stage를 implement합니다. Specifically, 우리는 reward signals와 diverse prompt distributions의 combination을 using하여 model을 train합니다. reasoning data의 경우, 우리는 math, code, logical reasoning domains에서 learning process를 guide하기 위해 rule-based rewards를 utilizes하는 DeepSeek-R1-Zero에서 outlined된 methodology를 adhere합니다. general data의 경우, complex하고 nuanced scenarios에서 human preferences를 capture하기 위해 reward models를 resort합니다. 우리는 DeepSeek-V3 pipeline을 build upon하고 preference pairs와 training prompts의 similar distribution을 adopt합니다. helpfulness의 경우, 우리는 final summary에 exclusively focus하여 assessment가 user에 대한 response의 utility와 relevance를 emphasizes하는 동시에 underlying reasoning process와의 interference를 minimizing하도록 ensure합니다. harmlessness의 경우, generation process 동안 arise할 수 있는 any potential risks, biases, or harmful content를 identify하고 mitigate하기 위해 reasoning process와 summary를 including하여 model의 entire response를 evaluate합니다. Ultimately, reward signals와 diverse data distributions의 integration은 helpfulness와 harmlessness를 prioritizing하면서 reasoning에서 excels하는 model을 train할 수 있도록 enables합니다.

2.4. Distillation: Empower Small Models with Reasoning Capability

DeepSeek-R1과 같은 reasoning capabilities를 갖춘 more efficient smaller models를 equip하기 위해, §2.3.3에서 detailed된 바와 같이 DeepSeek-R1으로 curated된 800k samples를 using하여 Qwen과 Llama와 같은 open-source models를 directly fine-tuned했습니다. 우리의 findings는 이 straightforward distillation method가 smaller models의 reasoning abilities를 significantly enhances한다는 것을 indicate합니다. 여기서 use하는 base models는 Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen2.5-14B, Qwen2.5-32B, Llama-3.1-8B, Llama-3.3-70B-Instruct입니다. 우리는 Llama-3.3의 reasoning capability가 Llama-3.1보다 slightly better하기 때문에 Llama-3.3을 select합니다.

distilled models의 경우, RL을 incorporating하면 model performance를 substantially boost할 수 있음에도 불구하고 SFT만 apply하고 RL stage는 include하지 않습니다. 여기서 우리의 primary goal은 distillation technique의 effectiveness를 demonstrate하는 것이며, RL stage의 exploration은 broader research community에 남겨둡니다.

Approach 섹션 정리 노트 (AI 연구자용)

핵심 방법론: Large-scale RL을 중심으로, SFT cold start 유무에 따른 reasoning 능력 향상 전략 및 소형 모델로의 능력 이전(distillation) 방법 제시.

1. DeepSeek-R1-Zero (Base Model + Pure RL):

목표: SFT 없이 순수 RL만으로 LLM (DeepSeek-V3-Base)의 reasoning 잠재력 탐색 및 self-evolution 관찰.
RL Algorithm: Training cost 절감을 위해 critic model 없는 GRPO (Group Relative Policy Optimization) 사용. 그룹 내 샘플들의 상대적 점수로 baseline 추정. 수식: [ A_{i} = \frac{r_{i} - \mathrm{mean}({ r_{1}, \cdots , r_{G} })}{\mathrm{std}({ r_{1}, \cdots , r_{G} })} ]
Reward Modeling:
- Rule-based Reward System만 사용: Neural reward model의 reward hacking 및 training 복잡성 문제 회피.
- Accuracy Rewards: Math (결과 포맷 검증), Code (컴파일러/test case) 등 deterministic 결과에 대한 자동 평가.
- Format Rewards: <think>...</think> 태그 사용 강제.
Training Template: <think> 과정 + 최종 답변 구조만 강제하는 최소한의 template 사용 (content bias 배제).
주요 결과: RL만으로 long CoT 생성, reflection, self-verification 등 복잡한 reasoning behavior 및 "aha moment" 자발적 출현 확인. AIME pass@1 15.6% → 71.0% 달성.
한계점: Readability 저하, language mixing 발생.

2. DeepSeek-R1 (RL with Cold Start):

목표: R1-Zero 성능 개선, 수렴 가속화, user-friendly CoT 생성 및 general capabilities 확보.
4-Stage Pipeline:
1. Cold Start SFT: RL 초기 불안정성 회피 위해 수천 개의 long CoT data (few-shot, prompting, R1-Zero 결과 정제 등 활용)로 DeepSeek-V3-Base 사전 fine-tuning. Readability 강조 (|special_token|<reasoning_process>|special_token|<summary> 포맷 사용).
2. Reasoning-oriented RL: Cold-start된 모델에 GRPO 적용 (R1-Zero 방식). Reasoning-intensive tasks (math, code, science, logic) 중심. Language consistency reward 추가하여 language mixing 완화 (성능 약간 저하 감수). 최종 reward = accuracy + language consistency.
3. Rejection Sampling & SFT: RL 수렴 후 checkpoint에서 rejection sampling으로 SFT data 생성.
  - Reasoning data (~600k): Rule-based reward + generative reward model (DeepSeek-V3 사용) 평가 데이터 포함. 가독성 낮은 샘플 필터링.
  - Non-Reasoning data (~200k): Writing, QA 등 DeepSeek-V3 SFT 데이터셋 일부 재사용 및 생성.
  - 총 ~800k 데이터로 DeepSeek-V3-Base 2 epochs fine-tuning.
4. RL for all Scenarios: 2차 RL 단계. Human preference (helpfulness, harmlessness) 정렬 및 reasoning 능력 동시 개선 목표.
  - Reasoning data: Rule-based rewards (R1-Zero 방식).
  - General data: Neural reward models (DeepSeek-V3 pipeline 활용). Helpfulness는 final summary만, harmlessness는 전체 response 평가.

3. Distillation:

목표: DeepSeek-R1의 reasoning 능력을 더 작고 효율적인 모델로 이전.
방법: R1의 Stage 3에서 생성된 ~800k SFT 데이터를 사용하여 open-source 모델들 (Qwen2.5-Math-1.5B/7B, Qwen2.5-14B/32B, Llama-3.1-8B, Llama-3.3-70B-Instruct)을 직접 fine-tuning (SFT만 수행).
주요 결과: 추가적인 RL 단계 없이 SFT distillation만으로도 소형 모델의 reasoning 능력이 크게 향상됨을 입증. (RL 단계 추가 탐색은 community 과제로 남김).

쉬운 설명 :

이 섹션에서는 논문에서 제안하는 똑똑한 추론 AI 모델들을 어떻게 만들었는지 구체적인 방법을 설명합니다.

1. 스스로 배우게 한 AI (DeepSeek-R1-Zero):

어떻게? 똑똑한 AI를 만들 때 보통은 정답이 있는 문제집(supervised data)으로 먼저 공부시키는데(SFT), 여기서는 그런 거 없이 바로 AI(DeepSeek-V3-Base)한테 문제 풀이를 시키고 잘하면 상(reward)을 주는 방식(RL)만 사용했어요. 훈련 비용을 아끼려고 GRPO라는 특별한 RL 기술을 썼고요.
상(Reward)은 어떻게? "정답 맞췄니?" (수학/코딩은 자동으로 채점) 그리고 "생각 과정을 <think> 태그 안에 잘 썼니?" 같은 **간단한 규칙(rule-based)**으로만 상을 줬어요. AI 심판(neural reward model)을 쓰면 AI가 상만 받으려고 꼼수를 부릴 수 있어서 일부러 안 썼대요.
결과: 이렇게만 해도 AI가 스스로 복잡하게 생각하고(long CoT), 자기 생각을 검토하는(reflection) 등 놀라운 추론 능력을 보여줬어요. 하지만 가끔 여러 언어를 섞어 쓰거나 읽기 어려운 답변을 내놓는 문제가 있었죠.

2. 체계적으로 가르친 AI (DeepSeek-R1):

어떻게? 첫 번째 AI의 문제점을 해결하고 더 똑똑하게 만들려고, 4단계 교육 과정을 설계했어요.
1. 준비 학습 (Cold Start SFT): 일단 AI에게 읽기 쉽게 잘 쓰인 추론 예시 문제들(long CoT data)을 조금 풀게 해서 기본기를 다졌어요. (<요약>도 쓰는 등 읽기 좋은 형식 강조)
2. 추론 집중 훈련 (Reasoning-oriented RL): 그 다음엔 첫 번째 AI처럼 RL 훈련을 시키는데, 주로 수학/코딩 같은 추론 문제에 집중했어요. 중간에 다른 언어를 섞어 쓰면 약간 감점해서(language consistency reward) 답변을 더 읽기 좋게 만들려고 노력했죠.
3. 지식 확장 및 복습 (Rejection Sampling & SFT): 두 번째 훈련을 마친 AI한테 문제 풀이를 더 시켜서 좋은 답변 데이터(~600k개 추론 + ~200k개 일반 상식)를 많이 만들었어요. 이 데이터로 다시 처음의 기본 AI를 공부시켰죠(SFT).
4. 최종 다듬기 (RL for all Scenarios): 마지막으로 한 번 더 RL 훈련을 시켰어요. 추론 문제는 규칙 기반으로 상을 주고, 일반적인 대화는 AI 심판(reward model)을 써서 답변이 유용하고 안전한지(helpfulness, harmlessness) 확인하며 다듬었어요.

3. 작은 AI에게 지식 전수 (Distillation):

어떻게? DeepSeek-R1을 만들 때 썼던 좋은 데이터(~800k개)를 가지고, 이미 공개된 더 작은 AI 모델들(Qwen, Llama)을 학습(SFT)시켰어요.
결과: 이렇게 좋은 예시 문제들만 보여주는 것(SFT)만으로도 작은 AI들의 추론 능력이 크게 향상되는 걸 확인했어요. (여기서는 추가적인 RL 훈련은 안 시키고, 다른 연구자들이 해보도록 남겨뒀어요.)

요약하면, RL만으로도 AI 추론 능력을 키울 수 있고 (R1-Zero), 체계적인 SFT+RL 파이프라인으로 더 성능 좋고 쓰기 편한 AI를 만들 수 있으며 (R1), **큰 AI의 지식을 작은 AI에게 효과적으로 가르칠 수 있다 (Distillation)**는 것을 구체적인 방법으로 보여준 섹션입니다.

r이 보상인데 보상은 어떻게 주냐하면

정확성 보상 (Accuracy rewards): 예를 들어, 수학 문제의 경우 최종 답이 미리 정해진 형식(예: 박스 안)으로 나오고, 그 답이 실제 정답과 일치하는지를 규칙으로 확인하여 보상을 줍니다 (맞으면 높은 보상, 틀리면 낮은 보상). 코딩 문제의 경우 컴파일러나 미리 정의된 테스트 케이스를 통과하는지 여부로 보상을 줍니다.
형식 보상 (Format rewards): 모델이 생각하는 과정을 <think> 태그 안에, 최종 답변을 <answer> 태그 안에 넣는 등 정해진 형식을 잘 따랐는지 확인하여 보상을 줍니다.

"베이스라인의 능력(기존 지식/능력)은 지키면서도, (보상을 통해) 답을 잘 맞추고 형식도 잘 따르도록 꾸준히 학습하여, 결국 말을 잘 듣고 문제도 잘 푸는 똑똑한 모델이 되어가는 과정"