LLM : 논문 리뷰 : Direct Preference Optimization: Your Language Model is Secretly a Reward Model

논문리뷰

LLM : 논문 리뷰 : Direct Preference Optimization: Your Language Model is Secretly a Reward Model

AI바라기 2025. 1. 27. 16:50

Direct Preference Optimization (DPO): 정리 노트

Purpose of the Paper:

기존의 대규모 언어 모델(Large Language Models, LMs) fine-tuning 방법인 Reinforcement Learning from Human Feedback (RLHF)는 복잡하고 불안정하며, 계산 비용이 많이 드는 문제점이 있습니다. RLHF는 reward model을 먼저 학습하고, 이를 기반으로 강화 학습을 통해 LM을 fine-tuning하는 2단계 과정을 거치는데, 이 과정에서 reward model 학습의 불안정성, LM sampling의 높은 비용, hyperparameter tuning의 어려움 등이 발생합니다.

본 논문에서는 RLHF의 복잡성을 제거하고 안정성과 효율성을 높이는 새로운 방법인 Direct Preference Optimization (DPO)를 제안합니다. DPO는 reward model을 명시적으로 학습하는 대신, human preference data를 직접 최적화하는 방식으로 RLHF와 동일한 목표를 달성합니다. 특히, DPO는 간단한 classification loss만으로 학습이 가능하며, RL 과정에서의 LM sampling이나 복잡한 hyperparameter tuning이 필요 없어 RLHF의 단점을 효과적으로 해결합니다.

Key Contributions:

DPO Algorithm 제안: RLHF의 reward model parameterization을 재구성하여 optimal policy를 closed form으로 추출하고, classification loss 기반의 direct preference optimization 알고리즘인 DPO를 제안했습니다.
RL-free Preference Learning: DPO는 reward model 학습과 RL fine-tuning 단계를 단일 classification loss 최적화 단계로 통합하여 RLHF pipeline을 크게 단순화했습니다.
Stable & Performant: DPO는 학습 안정성이 높고, PPO 기반 RLHF 대비 성능이 동등하거나 우수하며, 계산 비용이 훨씬 적습니다.
Theoretical Justification: DPO가 Bradley-Terry preference model과 Plackett-Luce model 하에서 RLHF objective를 implicit하게 최적화함을 이론적으로 증명했습니다.
Empirical Validation: sentiment modulation, summarization, dialogue task에서 DPO의 우수성을 실험적으로 입증했습니다. 특히, DPO는 PPO 기반 RLHF보다 sentiment control 성능이 뛰어나고, summarization과 dialogue task에서 동등 이상의 성능을 보이면서 구현 및 학습이 훨씬 간단합니다.

Novelty:

Implicit Reward Modeling: DPO는 reward model을 명시적으로 학습하는 기존 RLHF 방법과 달리, policy network 자체가 implicit reward model 역할을 수행하도록 parameterization을 재구성했습니다. 이는 reward model 학습의 불안정성 문제를 근본적으로 해결하고, end-to-end preference learning을 가능하게 합니다.
Classification Loss for RLHF Objective: RLHF의 복잡한 RL objective를 간단한 binary cross-entropy objective로 대체하여 학습 pipeline을 혁신적으로 단순화했습니다. 이는 RLHF의 구현 및 hyperparameter tuning의 어려움을 크게 완화하고, 더 많은 연구자와 실무자가 preference learning을 쉽게 활용할 수 있도록 합니다.
Direct Policy Optimization: DPO는 policy를 직접 최적화하는 방식으로, 기존 RLHF의 actor-critic architecture나 policy gradient estimation의 복잡성을 피했습니다. 이는 학습 안정성을 높이고, 계산 효율성을 향상시키는 데 기여합니다.

Experimental Highlights:

IMDb Sentiment Generation: controlled setting에서 DPO는 PPO, Preferred-FT 등 baseline 대비 reward-KL frontier 측면에서 우수한 성능을 보였습니다. 특히, ground truth reward (sentiment classifier)를 사용하는 PPO-GT보다도 더 나은 frontier를 달성하여 DPO의 optimization quality가 뛰어남을 입증했습니다.
TL;DR Summarization & Anthropic-HH Dialogue: real-world preference datasets에서 DPO는 PPO 기반 RLHF, Preferred-FT, Best-of-N baselines 대비 동등하거나 우수한 성능을 보였습니다. 특히, summarization task에서 DPO는 PPO 최고 성능을 능가하고 sampling temperature 변화에 더 robust한 성능을 보였습니다. dialogue task에서는 Anthropic-HH dataset에서 preferred completions 대비 성능 향상을 보이는 유일한 방법이었습니다.
Out-of-distribution Generalization: Reddit TL;DR summarization task에서 학습한 policy를 CNN/DailyMail news summarization task에 zero-shot transfer한 결과, DPO는 PPO 대비 더 나은 generalization 성능을 보였습니다.

Limitations and Future Work:

Generalization to new input distribution: DPO policy의 out-of-distribution generalization 성능에 대한 추가 연구가 필요합니다. 특히, explicit reward function 기반 방법 대비 generalization 특성을 비교 분석하고, self-labeling을 활용한 unlabeled prompts 활용 가능성을 탐색하는 것이 중요합니다.
Reward over-optimization: DPO의 direct preference optimization setting에서 reward over-optimization 문제 발생 가능성을 조사하고, Figure 3-right에서 관찰된 성능 감소가 over-optimization의 한 사례인지 분석해야 합니다.
Scaling to larger models: DPO를 state-of-the-art 모델 (6B parameters 이상)에 scaling하는 연구가 필요합니다. 대규모 모델에서 DPO의 성능 및 효율성을 검증하고, scaling 과정에서의 challenges를 해결해야 합니다.
Evaluation metric refinement: GPT-4 win rate judgments의 prompt 의존성을 완화하고, automated system으로부터 high-quality judgments를 eliciting하는 최적의 방법을 연구해야 합니다.
Broader applications: language model fine-tuning 외에 generative models training, other modalities training 등 DPO의 다양한 응용 분야를 탐색해야 합니다.

총평:

DPO는 RLHF의 복잡성을 획기적으로 개선하고, preference learning의 안정성과 효율성을 높인 획기적인 방법입니다. 간단하면서도 강력한 DPO 알고리즘은 향후 language model alignment 연구 및 응용에 큰 영향을 미칠 것으로 기대됩니다. 특히, RL 전문 지식 없이도 human preference를 반영한 LM fine-tuning을 쉽게 수행할 수 있도록 함으로써, preference learning의 접근성을 크게 높였다는 점에서 의의가 큽니다.

정리노트가 도움이 되셨기를 바랍니다. 혹시 더 궁금한 점이나 필요한 내용이 있으시면 언제든지 말씀해주세요.

Abstract

large-scale unsupervised language models (LMs)은 광범위한 세계 지식과 약간의 reasoning skills을 학습하지만, 그들의 행동에 대한 정확한 제어를 달성하는 것은 그들의 training의 완전히 unsupervised한 특성 때문에 어렵습니다. 그러한 steerability를 얻기 위한 기존의 방법들은 model generations의 상대적인 품질에 대한 human labels을 수집하고, 종종 reinforcement learning from human feedback (RLHF)을 사용하여 이러한 preferences에 맞게 unsupervised LM을 fine-tune합니다. 그러나 RLHF는 복잡하고 종종 불안정한 절차이며, 먼저 human preferences를 반영하는 reward model을 fitting하고, 그런 다음 원래 model에서 너무 멀리 벗어나지 않으면서 이 estimated reward를 maximize하기 위해 reinforcement learning을 사용하여 large unsupervised LM을 fine-tune합니다.

본 논문에서는 closed form으로 해당하는 optimal policy의 extraction을 가능하게 하는 RLHF에서 reward model의 새로운 parameterization을 소개하며, 이를 통해 간단한 classification loss만으로 standard RLHF problem을 해결할 수 있습니다. 우리가 Direct Preference Optimization (DPO)라고 부르는 결과 algorithm은 stable하고, performant하며, computationally lightweight하여, fine-tuning 중에 LM에서 sampling하거나 중요한 hyperparameter tuning을 수행할 필요가 없습니다. 우리의 experiments는 DPO가 기존 방법만큼 또는 더 나은 정도로 human preferences에 맞게 LMs을 fine-tune할 수 있음을 보여줍니다.

특히, DPO를 사용한 fine-tuning은 generations의 sentiment를 제어하는 능력에서 PPO-based RLHF를 능가하며, summarization 및 single-turn dialogue에서 response quality를 비슷하게 맞추거나 향상시키면서 구현 및 train하기에 훨씬 간단합니다.

1 Introduction

very large datasets으로 train된 large unsupervised language models (LMs)은 놀라운 capabilities을 획득합니다. 그러나, 이러한 models은 광범위한 goals, priorities, 그리고 skillsets을 가진 humans에 의해 generation된 data에 대해 train됩니다. 이러한 goals과 skillsets 중 일부는 imitate하는 것이 바람직하지 않을 수 있습니다. 예를 들어, 우리는 우리의 AI coding assistant가 common programming mistakes를 understand하여 correct하기를 원할 수 있지만, 그럼에도 불구하고 code를 generation할 때, 우리는 model을 training data에 존재하는 (potentially rare) high-quality coding ability 쪽으로 bias하고 싶습니다. 마찬가지로, 우리는 language model이 people의 50%가 believe하는 common misconception을 aware하기를 원할 수 있지만, 우리는 model이 그것에 대한 queries의 50%에서 이 misconception이 true라고 claim하는 것을 certainly 원하지 않습니다!
In other words, model의 desired responses와 behavior를 매우 wide한 knowledge와 abilities에서 selecting하는 것은 safe하고, performant하며, controllable한 AI systems을 building하는 데 crucial합니다. 기존 방법들은 typically reinforcement learning (RL)을 사용하여 human preferences에 match하도록 LMs을 steer하는 반면, 우리는 기존 방법들에 의해 사용되는 RL-based objective가 simple binary cross-entropy objective로 exactly optimized될 수 있음을 show할 것이며, preference learning pipeline을 greatly simplifying합니다.

At a high level, 기존 방법들은 humans이 safe하고 helpful하다고 find하는 behaviors의 types을 representing하는 curated sets of human preferences를 사용하여 language model에 desired behaviors를 instill합니다. 이 preference learning stage는 large text dataset에 대한 initial stage of large-scale unsupervised pre-training 이후에 occurs합니다. preference learning에 대한 가장 straightforward approach는 high quality responses의 human demonstrations에 대한 supervised fine-tuning이지만, 가장 successful class of methods는 human (or AI) feedback (RLHF/RLAIF; )으로부터의 reinforcement learning입니다. RLHF methods는 human preferences의 dataset에 reward model을 fitting하고, 그런 다음 RL을 사용하여 original model에서 excessively far drifting하지 않고 high reward assigned된 responses를 produce하도록 language model policy를 optimize합니다. RLHF는 impressive conversational 및 coding abilities를 가진 models을 produce하지만, RLHF pipeline은 multiple LMs training과 training loop에서 LM policy로부터 sampling을 involving하여 supervised learning보다 considerably more complex하며, significant computational costs를 incurring합니다.

본 논문에서는 explicit reward modeling 또는 reinforcement learning 없이, human preferences를 adhere하도록 language model을 directly optimize하는 방법을 show합니다. 우리는 기존 RLHF algorithms (KL-divergence constraint가 있는 reward maximization)과 동일한 objective를 implicitly optimize하지만 implement하기에 simple하고 train하기에 straightforward한 algorithm인 Direct Preference Optimization (DPO)를 propose합니다. Intuitively, DPO update는 preferred responses에 대한 relative log probability를 dispreferred responses에 비해 increases시키지만, naive probability ratio objective에서 occurs하는 model degeneration을 prevent하는 dynamic, per-example importance weight를 incorporates합니다. 기존 algorithms과 마찬가지로, DPO는 주어진 reward function이 empirical preference data와 얼마나 well align하는지를 measures하는 theoretical preference model (such as the Bradley-Terry model; )에 relies합니다. However, 기존 방법들은 reward model을 train하기 위해 preference loss를 define하기 위해 preference model을 사용하고, 그런 다음 learned reward model을 optimize하는 policy를 train하는 반면, DPO는 policy의 function으로 preference loss를 define하기 위해 change of variables를 사용합니다. model responses에 대한 human preferences의 dataset이 주어지면, DPO는 simple binary cross entropy objective를 사용하여 policy를 optimize하여, preference data에 fit된 implicit reward function에 대한 optimal policy를 producing할 수 있습니다.

Our main contribution은 preferences로부터 language models을 training하기 위한 simple RL-free algorithm인 Direct Preference Optimization (DPO)입니다. 우리의 experiments는 DPO가 sentiment modulation, summarization, and dialogue와 같은 tasks에서 preferences로부터 learning하는 데 PPO-based RLHF를 including하여 기존 methods만큼 at least as effective함을 show하며, up to 6B parameters를 가진 language models을 사용합니다.

Figure 1: DPO는 reinforcement learning을 avoiding하면서 human preferences에 대해 optimizes합니다. human feedback으로 language models을 fine-tuning하기 위한 기존 방법들은 먼저 prompts 및 pairs of responses에 대한 human preferences의 dataset에 reward model을 fit하고, 그런 다음 learned reward를 maximize하는 policy를 find하기 위해 RL을 use합니다.
In contrast, DPO는 simple classification objective로 preferences를 best satisfying하는 policy에 대해 directly optimizes하며, corresponding optimal policy가 closed form으로 extracted될 수 있는 implicit reward model을 fitting합니다.

DPO 논문 Introduction 섹션 정리 노트 (AI 연구자 Quick Summary)

핵심 문제: Large Language Models (LMs)는 방대한 지식을 학습하지만, unsupervised training 특성상 원하는 대로 제어하기 어렵다. 다양한 인간 데이터에 학습되어 바람직하지 않은 행동 (잘못된 코딩 스타일, 오해 등)을 모방할 수 있음. 안전하고, 성능 좋고, 제어 가능한 AI 시스템 구축을 위해 LM의 행동 제어가 중요.

기존 방식 (RLHF)의 문제점:

**Reinforcement Learning from Human Feedback (RLHF)**가 현재 human preference 반영을 위한 주요 방법론.
RLHF:
1. Human preference 데이터셋으로 reward model 학습.
2. RL을 통해 reward model이 높은 reward를 주는 response를 생성하도록 LM fine-tuning. (original model에서 너무 벗어나지 않도록 제약)
RLHF의 복잡성 및 단점:
- 복잡한 pipeline: reward model 학습 + RL fine-tuning (multiple LM 학습 필요)
- 불안정성, hyperparameter tuning 어려움
- 학습 루프 내 LM sampling으로 인한 높은 계산 비용

본 논문의 핵심 아이디어 (Direct Preference Optimization - DPO):

RLHF의 복잡성을 제거하고, human preference를 직접적으로 LM에 최적화하는 새로운 알고리즘 DPO 제안.
핵심: Reward model 학습 및 RL 과정 없이, simple classification loss 만으로 RLHF와 동일한 목표 (KL 제약 조건 하 reward maximization) 달성.
DPO 작동 방식 (직관적 설명):
- 선호하는 response의 log probability는 높이고, 비선호 response의 log probability는 낮추도록 학습.
- 단순 확률 비율 objective의 문제점 (model degeneration)을 해결하기 위해, dynamic per-example importance weight 적용.
이론적 기반: Bradley-Terry model 등 preference model 기반 (기존 RLHF와 동일).
차별점: preference model을 reward model 학습에 사용하는 RLHF와 달리, DPO는 policy를 직접 preference function으로 정의하여 최적화.

DPO의 장점:

Simple: 구현 및 학습 용이 (simple binary cross-entropy objective).
Stable: 학습 안정성 높음.
Computationally Lightweight: LM sampling 불필요, hyperparameter tuning 부담 감소.
Performant: RLHF (PPO-based) 만큼 효과적이거나 더 우수함 (sentiment 제어, summarization, dialogue task).

결론: DPO는 RLHF의 복잡성을 극복하고, 더 간단하고 효율적으로 human preference를 LM에 반영할 수 있는 novel algorithm. preference learning pipeline을 크게 단순화하며, 다양한 task에서 기존 방법 대비 경쟁력 있는 성능을 보여줌. RL-free preference learning의 가능성을 제시.

2 Related Work

increasing scale의 Self-supervised language models은 일부 tasks zero-shot 또는 few-shot prompts로 complete하는 것을 학습합니다. However, downstream tasks 및 user intent와의 alignment에 대한 그들의 performance는 instructions 및 human-written completions의 datasets에 대한 fine-tuning에 의해 significantly improved될 수 있습니다. 이 'instruction-tuning' procedure는 LLMs이 instruction-tuning set outside의 instructions로 generalize하고 generally 그들의 usability를 increase하는 것을 enable합니다. instruction tuning의 success에도 불구하고, response quality의 relative human judgments는 expert demonstrations보다 often easier to collect하며, 따라서 subsequent works는 human preferences의 datasets으로 LLMs을 fine-tuned하여, translation, summarization, story-telling, and instruction-following에서 proficiency를 improving했습니다. 이러한 methods는 먼저 Bradley-Terry model과 같은 preference model 하에서 preferences의 dataset과의 compatibility를 위해 neural network reward function을 optimize한 다음, reinforcement learning algorithms, commonly REINFORCE, proximal policy optimization (PPO), or variants를 사용하여 주어진 reward를 maximize하도록 language model을 fine-tune합니다. closely-related line of work는 safety or harmlessness와 같은 targeted attributes에 대한 additional synthetic preference data를 generation하기 위해 human feedback으로 instruction following을 위해 fine-tuned된 LLMs을 leverages하며, LLM's annotations에 대한 text rubric 형태의 weak supervision from humans만을 using합니다. 이러한 methods는 two bodies of work의 convergence를 represent합니다: variety of objectives에 대한 reinforcement learning으로 language models을 training하는 것에 대한 one body of work와 human preferences로부터 learning하기 위한 general methods에 대한 another body of work입니다. relative human preferences를 using하는 것의 appeal에도 불구하고, reinforcement learning으로 large language models을 fine-tuning하는 것은 major practical challenge로 remains합니다; this work는 RL 없이 relative preferences를 optimizing하는 theoretically-justified approach를 provides합니다.

language의 context outside에서, preferences로부터 learning policies는 bandit and reinforcement learning settings 모두에서 studied되어 왔으며, several approaches가 proposed되었습니다. rewards보다는 actions의 preferences 또는 rankings를 using하는 Contextual bandit learning은 contextual dueling bandit (CDB)로 known됩니다. absolute rewards의 absence에서, CDBs의 theoretical analysis는 optimal policy의 notion을 von Neumann winner로 substitutes하며, von Neumann winner는 any other policy에 대한 expected win rate가 at least 50%인 policy입니다. However, CDB setting에서, preference labels는 online으로 given되는 반면, human preferences로부터 learning할 때, 우리는 typically offline preference-annotated action pairs의 fixed batch로부터 learn합니다. Similarly, preference-based RL (PbRL)은 rewards보다는 unknown 'scoring' function에 의해 generation된 binary preferences로부터 learn합니다. off-policy preference data를 reuse할 수 있는 methods를 including하여 PbRL에 대한 Various algorithms이 exist하지만, generally 먼저 latent scoring function (i.e. reward model)을 explicitly estimating하고 subsequently optimizing하는 것을 involve합니다. We instead preferences를 satisfy하도록 policy를 directly optimizing하는 single stage policy learning approach를 present합니다.

DPO 논문 Related Work 섹션 정리 노트 (AI 연구자 Quick Summary)

핵심 맥락: 기존 Large Language Model (LLM) fine-tuning 연구는 instruction tuning과 human preference learning 두 갈래로 진행됨. 본 논문은 human preference learning에 집중하며, 특히 기존 방법인 RLHF (Reinforcement Learning from Human Feedback)의 한계를 극복하고자 함.

Instruction Tuning (간략 언급 & 차별점 강조):

Instruction tuning은 LLM 성능 향상에 효과적이지만, expert demonstration 데이터가 필요.
Human preference 데이터 (response 품질에 대한 상대적 판단)가 expert demonstration보다 수집 용이.
본 논문은 instruction tuning이 아닌, preference learning에 집중.

RLHF: 기존 Human Preference Learning의 주류 방식 및 문제점:

RLHF 프로세스:
1. Human preference 데이터셋으로 reward model 학습 (Bradley-Terry model 등 preference model 기반).
2. Reinforcement Learning (RL) 알고리즘 (REINFORCE, PPO 등)으로 reward model이 높은 reward를 주는 response를 생성하도록 LM fine-tuning.
RLHF의 주요 한계 (본 논문이 극복하고자 하는 점):
- 복잡성: reward model 학습 + RL fine-tuning의 multi-stage pipeline.
- 구현 및 학습 어려움: RL 알고리즘의 inherent한 불안정성, hyperparameter tuning 난이도.
- 계산 비용: RL 학습 과정에서 LM sampling으로 인한 높은 비용 발생.
- Practical Challenge: Large language model에 RLHF 적용은 여전히 major practical challenge로 남아있음 (본 논문에서 RL 없이 preference optimization 접근법 제시 이유).

Preference Learning 연구의 넓은 맥락 (Bandit/RL):

Language 모델 외 분야 (Bandit, RL)에서도 preference learning 연구 활발.
Contextual Dueling Bandit (CDB): reward 대신 action preference/ranking 사용. Optimal policy 개념을 von Neumann winner로 대체. 하지만 CDB는 online learning, 본 논문은 offline preference data 사용.
Preference-based RL (PbRL): reward 대신 unknown scoring function 기반 binary preference 사용. Latent scoring function (reward model) explicit estimation 후 최적화하는 방식이 일반적. 본 논문은 reward model explicit estimation 없이 single-stage policy learning 접근.

본 논문의 차별성 및 기여 (DPO 관점에서 강조):

RLHF의 복잡성 및 단점 극복: RL 없이 human preference 최적화하는 Direct Preference Optimization (DPO) 제안.
Single-stage policy learning: reward model explicit estimation 및 RL 과정 없이 policy를 직접 최적화.
Theoretical Justification: RL 없이 preference optimization을 가능하게 하는 이론적 근거 제시.
Practical Benefit: RLHF의 practical challenge 해소, simpler and more efficient preference learning 가능성 제시.

핵심 결론: 본 논문은 기존 RLHF 방식의 복잡성과 어려움을 지적하며, RL 없이 human preference를 효과적으로 학습할 수 있는 DPO 알고리즘을 제시하여 preference learning 분야에 새로운 방향성을 제시함.

3 Preliminaries

We review the RLHF pipeline in Ziegler et al. (and later). It usually includes three phases: 1) supervised fine-tuning (SFT); 2) preference sampling and reward learning and 3) RL optimization.

SFT: RLHF는 typically downstream task(s) of interest (dialogue, summarization, etc.)에 대한 high-quality data로 supervised learning으로 pre-trained LM을 fine-tuning하여 model π SFT를 obtain하는 것으로 begins합니다.

Reward Modelling Phase: In the second phase에서 SFT model은 prompts x로 prompted되어 pairs of answers (y1, y2) ∼ π SFT(y | x)를 produce합니다. These는 then human labelers에게 presented되어 one answer에 대한 preferences를 express하며, yw ≻ yl | x 로 denoted됩니다. 여기서 yw와 yl은 (y1, y2) 중에서 preferred and dispreferred completion을 denotes합니다. The preferences는 우리가 access할 수 없는 some latent reward model r ∗ (y, x)에 의해 generated되는 것으로 assumed됩니다. There are a number of approaches used to model preferences가 있으며, Bradley-Terry (BT) model이 popular choice입니다 (although more general Plackett-Luce ranking models도 우리가 several ranked answers에 access할 수 있다면 framework과 compatible합니다). BT model은 human preference distribution p ∗가 다음과 같이 written될 수 있다고 stipulates합니다:

p ∗ (y1 ≻ y2 | x) = exp (r ∗ (x, y1)) / (exp (r ∗ (x, y1)) + exp (r ∗ (x, y2))). (1)

p ∗로부터 sampled된 comparisons D = { (x (i) , y (i) w , y (i) l ) } N i=1 의 static dataset에 access한다고 assuming하면, 우리는 reward model rϕ(x, y)를 parametrize하고 maximum likelihood를 통해 parameters를 estimate할 수 있습니다. problem을 binary classification으로 framing하면 우리는 negative log-likelihood loss를 가집니다:

LR(rϕ, D) = −E(x,yw,yl)∼D log σ(rϕ(x, yw) − rϕ(x, yl)) (2)

여기서 σ는 logistic function입니다. LMs의 context에서, network rϕ(x, y)는 often reward value에 대한 single scalar prediction을 produce하는 final transformer layer on top에 linear layer의 addition과 함께 SFT model π SFT(y | x)로부터 initialized됩니다. lower variance를 가진 reward function을 ensure하기 위해, prior works는 rewards를 normalize하여, Ex,y∼D [rϕ(x, y)] = 0 for all x가 되도록 합니다.

RL Fine-Tuning Phase: RL phase 동안, learned reward function은 language model에 feedback을 provide하기 위해 used됩니다. prior works를 Following하면, optimization은 다음과 같이 formulated됩니다:

max πθ Ex∼D,y∼πθ(y|x) [rϕ(x, y) − βDKL(πθ(y | x) || πref(y | x))] (3)

여기서 β는 base reference policy πref, namely initial SFT model π SFT로부터의 deviation을 controlling하는 parameter입니다. In practice, language model policy πθ도 π SFT로 initialized됩니다. The added constraint는 important합니다. as it prevents the model from deviating too far from the distribution on which reward model이 accurate하며, as well as generation diversity를 maintaining하고 single high-reward answers로의 mode-collapse를 preventing합니다. Due to language generation의 discrete nature 때문에, this objective는 differentiable하지 않으며 typically reinforcement learning으로 optimized됩니다. standard approach는 reward function r(x, y) = rϕ(x, y) − β(log πθ(y | x) − log πref(y | x))를 construct하고, PPO를 using하여 maximize하는 것이었습니다.

DPO 논문 3 Preliminaries 섹션 정리 노트 (AI 연구자 Quick Summary)

핵심: 본 섹션은 DPO가 개선하고자 하는 기존 RLHF (Reinforcement Learning from Human Feedback) 파이프라인을 설명. DPO의 motivation을 이해하기 위한 필수 배경 지식 제공.

RLHF 파이프라인 (3단계):

Supervised Fine-Tuning (SFT):
- 목적: Pre-trained LM을 downstream task (대화, 요약 등)에 맞춰 초기 fine-tuning. π_SFT 모델 획득.
- 핵심: 고품질 데이터 사용, supervised learning 방식.
Reward Modeling Phase:
- 목적: Human preference를 반영하는 reward model r_φ(x, y) 학습.
- 과정:
  - π_SFT 모델로 prompt x에 대한 답변 쌍 (y1, y2) 생성.
  - Human labeler가 답변 쌍에 대한 선호도 표시 (yw ≻ yl | x).
  - Bradley-Terry (BT) 모델 등 preference model 가정 하에 reward model r_φ(x, y) 학습 (maximum likelihood estimation).
  - Loss 함수: Binary classification 형태의 negative log-likelihood loss 사용 (logistic function σ 활용).
  - Reward Model 구조: SFT 모델 기반 (final transformer layer 위에 linear layer 추가). Reward variance 감소를 위해 reward normalization 적용.
- 핵심: Human preference 데이터 (pairwise 비교)를 이용하여 reward function 근사. Bradley-Terry 모델 가정이 중요.
RL Fine-Tuning Phase:
- 목적: 학습된 reward function r_φ(x, y)를 이용하여 LM policy π_θ를 fine-tuning, human preference에 부합하는 답변 생성 유도.
- Objective Function: Reward maximization과 KL divergence penalty 결합.
  - max_(π_θ) E_(x~D, y~π_θ(y|x)) [r_φ(x, y) - β * D_KL(π_θ(y|x) || π_ref(y|x))]
  - β: reference policy (π_ref = π_SFT) 로부터의 deviation 제어 파라미터.
  - D_KL: KL divergence penalty (reference policy와의 괴리 방지, generation diversity 유지, mode collapse 방지).
- 최적화 방법: Reinforcement Learning (RL) 사용 (objective function이 미분 불가능). PPO (Proximal Policy Optimization) 알고리즘이 standard approach.
- Reward Function 구성 (PPO에 사용): r(x, y) = r_φ(x, y) - β(log π_θ(y|x) - log π_ref(y|x))
- 핵심: Reward function과 RL을 사용하여 policy를 human preference 방향으로 업데이트. KL penalty를 통해 안정성 및 다양성 확보.

RLHF의 한계 (DPO motivation):

복잡한 Multi-stage Pipeline: SFT, Reward Modeling, RL Fine-tuning의 3단계 구성.
RL의 어려움: 불안정성, hyperparameter tuning, 높은 계산 비용.
Implicit Reward Function: Reward model은 human preference를 간접적으로 근사하는 역할.

DPO의 방향성 (Implicitly hinted):

RLHF의 복잡성을 제거하고, RL 없이 human preference를 직접적으로 policy에 반영하는 simpler algorithm 필요성 암시.
Reward model을 명시적으로 학습하는 대신, preference data로부터 직접 policy를 최적화하는 방식 모색 가능성 제시.

결론: 3 Preliminaries 섹션은 DPO 논문의 핵심 동기인 RLHF의 복잡성과 한계를 명확히 제시하고, 앞으로 제시될 DPO 알고리즘의 필요성을 강조하는 중요한 배경 지식 제공.

4 Direct Preference Optimization

large-scale problems에서 reinforcement learning algorithms을 applying하는 것의 challenges에 motivated되어, 우리의 goal은 preferences를 directly using하여 policy optimization을 위한 simple approach를 derive하는 것입니다. reward를 learn하고 then RL을 통해 optimize하는 prior RLHF methods와 달리, 우리의 approach는 RL training loop 없이 closed form으로 optimal policy의 extraction을 enable하는 reward model parameterization의 particular choice를 leverages합니다. 우리가 next in detail에서 describe할 것처럼, 우리의 key insight는 reward functions에서 optimal policies로의 analytical mapping을 leverage하는 것이며, 이는 reward functions over loss function을 policies over loss function으로 transform하는 것을 enable합니다. This change-of-variables approach는 Bradley-Terry model과 같은 human preferences의 existing models 하에서 still optimizing하면서, explicit, standalone reward model을 fitting하는 것을 avoids합니다. In essence, policy network는 language model과 (implicit) reward를 both represents합니다.

DPO objective derivation. 우리는 general reward function r 하에서 prior work, Eq. 3과 same RL objective로 start합니다. prior work를 Following하면, Eq. 3에서 KL-constrained reward maximization objective에 대한 optimal solution이 다음 form을 취하는 것을 show하는 것은 straightforward합니다:

πr(y | x) = 1 / Z(x) * πref(y | x) * exp ( (1/β) * r(x, y) ) , (4)

여기서 Z(x) = P y πref(y | x) exp ( (1/β) * r(x, y) ) 는 partition function입니다. Appendix A.1에서 complete derivation을 See하십시오. Even if we ground-truth reward function r ∗ 의 MLE estimate rϕ를 use하더라도, partition function Z(x)를 estimate하는 것은 still expensive하며, 이는 this representation을 practice에서 utilize하기 hard하게 만듭니다. However, 우리는 corresponding optimal policy πr, reference policy πref, and unknown partition function Z(·)의 terms로 reward function을 express하기 위해 Eq. 4를 rearrange할 수 있습니다. Specifically, we first Eq. 4의 both sides의 logarithm을 take하고 then some algebra를 사용하면 우리는 obtain합니다:

r(x, y) = β log ( πr(y | x) / πref(y | x) ) + β log Z(x). (5)

We can apply this reparameterization을 ground-truth reward r ∗ 와 corresponding optimal model π ∗ 에 apply할 수 있습니다. Fortunately, Bradley-Terry model은 two completions 사이의 rewards의 difference에만 depends합니다. i.e., p ∗ (y1 ≻ y2 | x) = σ(r ∗ (x, y1) − r ∗ (x, y2)). preference model Eq. 1에 r ∗ (x, y)에 대한 Eq. 5에서 reparameterization을 Substituting하면, partition function이 cancels되고, 우리는 only optimal policy π ∗ 와 reference policy πref의 terms로 human preference probability를 express할 수 있습니다. Thus, Bradley-Terry model 하에서 optimal RLHF policy π ∗ 는 preference model을 satisfies합니다:

Derivation은 Appendix A.2에 있습니다. Eq. 6은 Bradley-Terry model을 use하는 반면, 우리는 similarly more general Plackett-Luce models 하에서 expressions를 derive할 수 있으며, Appendix A.3에 shown되어 있습니다. Now that we have reward model보다는 optimal policy의 terms로 human preference data의 probability를 have했으므로, 우리는 parametrized policy πθ에 대한 maximum likelihood objective를 formulate할 수 있습니다. reward modeling approach (i.e. Eq. 2)와 Analogous하게, our policy objective는 다음과 같이 becomes됩니다:

LDPO(πθ; πref) = −E(x,yw,yl)∼D [ log σ ( β log (πθ(yw | x) / πref(yw | x)) − β log (πθ(yl | x) / πref(yl | x)) ) ]. (7)

This way, we fit an implicit reward를 alternative parameterization을 using하여 fit하며, whose optimal policy는 simply πθ입니다. Moreover, since our procedure는 reparametrized Bradley-Terry model을 fitting하는 것과 equivalent하므로, it preference data distribution의 suitable assumption 하에서 consistencies와 같은 certain theoretical properties를 enjoys합니다. Section 5에서, we further other works와 relation하여 DPO의 theoretical properties를 discuss합니다.

What does the DPO update do? DPO의 mechanistic understanding을 위해, loss function LDPO의 gradient를 analyze하는 것이 useful합니다. parameters θ에 대한 gradient는 다음과 같이 written될 수 있습니다:

∇θLDPO(πθ; πref) = − βE(x,yw,yl)∼D [ σ(ˆrθ(x, yl) − rˆθ(x, yw)) * ∇θ log π(yw | x) − ∇θ log π(yl | x) ],

where rˆθ(x, y) = β log (πθ(y|x) / πref(y|x)) 는 language model πθ와 reference model πref에 의해 implicitly defined된 reward입니다 (Section 5에서 more). Intuitively, loss function LDPO의 gradient는 preferred completions yw의 likelihood를 increases시키고 dispreferred completions yl의 likelihood를 decreases시킵니다. Importantly, examples는 implicit reward model rˆθ가 dispreferred completions를 얼마나 higher rates하는지에 의해 weighed되며, β로 scaled됩니다. i.e., implicit reward model이 completions를 얼마나 incorrectly orders하는지, KL constraint의 strength를 accounting합니다. Our experiments는 weighting coefficient 없는 this method의 naive version이 language model을 degenerate하게 cause할 수 있으므로, this weighting의 importance를 suggest합니다 (Appendix Table 3).

DPO outline. general DPO pipeline은 다음과 같습니다: 1) every prompt x에 대해 completions y1, y2 ∼ πref(· | x)를 Sample하고, preferences의 offline dataset D = {x (i) , y (i) w , yl) (i)} N i=1 를 construct하기 위해 human preferences로 label하고 2) given πref and D and desired β에 대해 LDPO를 minimize하기 위해 language model πθ를 optimize합니다. In practice, one would like to publicly available preference datasets를 reuse하며, samples를 generation하고 human preferences를 gathering하기보다는 publicly available preference datasets를 reuse하고 싶을 것입니다. preference datasets가 π SFT를 using하여 sampled되므로, we initialize πref = π SFT whenever available합니다. However, π SFT가 not available할 때, we initialize πref by preferred completions (x, yw)의 likelihood를 maximizing하여, that is, πref = arg maxπ Ex,yw∼D [log π(yw | x)]. This procedure는 true reference distribution (unavailable한)과 DPO에 의해 used되는 πref 사이의 distribution shift를 mitigate하는 데 helps합니다. implementation 및 hyperparameters과 관련된 Further details는 Appendix B에서 found될 수 있습니다.

DPO 논문 4 Direct Preference Optimization 섹션 정리 노트 (AI 연구자 Quick Summary)

핵심: Direct Preference Optimization (DPO): RLHF의 복잡성을 제거하고, Reinforcement Learning 없이 human preference를 직접 policy에 최적화하는 새로운 알고리즘 제시. 핵심 아이디어는 reward function과 optimal policy 간의 분석적 관계를 활용, reward function에 대한 loss를 policy에 대한 loss로 변환하는 것.

DPO의 Motivation:

기존 RLHF의 문제점: 복잡한 RL 파이프라인, 불안정성, 높은 계산 비용.
DPO 목표: RLHF와 동등한 preference 최적화 효과를 더 간단하고 효율적인 방식으로 달성.

DPO의 핵심 아이디어: Reward Function Reparameterization & Change of Variables

출발점: RLHF와 동일한 KL-constrained reward maximization objective (Eq. 3)에서 시작.
핵심 통찰 1: Optimal Policy의 Closed-Form: KL 제약 조건 하 reward maximization 문제의 optimal policy는 reference policy와 reward function으로 표현되는 closed-form solution 존재 (Eq. 4).
- π_r(y|x) = (1/Z(x)) * π_ref(y|x) * exp((1/β) * r(x, y))
핵심 통찰 2: Reward Function Reparameterization: 위 optimal policy 식을 역으로 이용하여, reward function을 optimal policy, reference policy, partition function으로 표현 가능 (Eq. 5).
- r(x, y) = β * log(π_r(y|x) / π_ref(y|x)) + β * log(Z(x))
핵심 통찰 3: Change of Variables & Partition Function Cancellation: Bradley-Terry preference model (Eq. 1)에 위 reward function reparameterization을 대입하면, partition function Z(x)가 상쇄됨. Human preference probability를 reward function 대신 policy 만으로 표현 가능 (Eq. 6).
- p*(y1 ≻ y2 | x) = 1 / (1 + exp(β * log(π*(y2|x) / π_ref(y2|x)) - β * log(π*(y1|x) / π_ref(y1|x))))
결론: Reward function에 대한 최적화 문제를 policy에 대한 최적화 문제로 변환 성공. 더 이상 explicit reward model 학습 불필요.

DPO Loss Function (Eq. 7):

Bradley-Terry preference model 기반으로, policy π_θ를 위한 maximum likelihood objective (DPO loss) 정의.
Reward model 학습 없이, pairwise preference 데이터셋 D를 사용하여 policy π_θ를 직접 최적화.
Loss 형태: Binary cross-entropy loss와 유사, logit으로 policy 비율의 log 차이 활용.

DPO Update Mechanism (Gradient 분석):

DPO loss gradient 분석을 통해, update 방향 직관적 이해.
Gradient 역할: Preferred completion yw의 확률 증가, dispreferred completion yl의 확률 감소.
Weighting의 중요성: Implicit reward model이 dispreferred completion을 얼마나 잘못 높게 평가하는지에 따라 example에 weight 부여. Naive version (weighting 없는)은 model degeneration 유발 가능.

DPO Pipeline:

Preference 데이터셋 D 구축 (reference policy π_ref 샘플링 기반).
DPO loss (Eq. 7) 최소화하여 policy π_θ 최적화.

DPO의 핵심 장점:

Simplicity: RLHF 대비 훨씬 간단한 파이프라인 (RL, reward model 학습 불필요).
Efficiency: 계산 비용 감소, 학습 안정성 향상 기대.
Theoretical grounding: Bradley-Terry model 등 기존 preference model 기반, 이론적 정당성 확보.

결론: DPO는 RLHF의 복잡성을 획기적으로 개선하고, human preference learning을 더 쉽고 효율적으로 만들 수 있는 promising algorithm. Reward function을 policy로 reparameterization하는 change of variables 트릭이 핵심 아이디어.