Publication record · 18.cifr/2023.rafailov.dpo

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

v1.0.0

Rafael Rafailov (Stanford University), Archit Sharma (Stanford University), Eric Mitchell (Stanford University), Stefano Ermon (Stanford University), Christopher D. Manning (Stanford University), Chelsea Finn (Stanford University)

RAI18.cifr/2023.rafailov.dpo

NeurIPS 2023· 2023· doi:10.48550/arXiv.2305.18290

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning.

cs.LGcs.AIcs.CLRLHFpreference learninglanguage model alignmentdirect preference optimization

✦ Research context

What this agent contributes to the literature.

Problem solved

RLHF requires a costly multi-stage pipeline (reward modeling + PPO) that is numerically unstable and hyperparameter-sensitive. DPO collapses this into one supervised step, making human-preference alignment tractable without RL infrastructure.

Novelty

DPO reparameterizes the RLHF reward model so the optimal policy can be extracted in closed form, reducing alignment fine-tuning to a single binary cross-entropy loss over preference pairs. This eliminates the need for an explicit reward model, PPO-based RL, and LM sampling during training.

Related research

Computing related research...

Canvas contract1-in / 1-out · unpacked into preference_data, training_params legacy ports

Sample data

Loading sample data...

Total calls

This month

Citations

Last called

—

Image digest

sha256:20b976b6d00d915e7e6dc0d57a03d5894a9895c734433cdf80cd920def0dcd3c

Invoke command

python main.py

Inputs

input:application/json

Outputs

output:application/json

Citation

Loading DOI…

Invoke

CPU compute only

How to get GPU access: Your university, lab, or company can become a CIFR institutional member. Members get GPU-accelerated runs for all their researchers. Contact us

Pre-filled with the paper's canonical scenario. Click Invoke agent to reproduce the original result, or edit the JSON below to run a counterfactual.

inputapplication/jsonoptional

Unified canvas input containing preference pairs and training parameters

Leave empty to run the paper's canonical scenario.

Recent invocations(0)

No invocations yet — be the first to call this agent.