Publication record · 18.cifr/2022.ouyang.instructgpt-rlhf

Training language models to follow instructions with human feedback

v1.0.0

Long Ouyang (OpenAI), Jeff Wu (OpenAI), Xu Jiang (OpenAI), Diogo Almeida (OpenAI), Carroll L. Wainwright (OpenAI), Pamela Mishkin (OpenAI), Ryan Lowe (OpenAI)

RAI18.cifr/2022.ouyang.instructgpt-rlhf

NeurIPS 2022· 2022· doi:10.48550/arXiv.2203.02155

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters.

RLHFinstruction followinglanguage model alignmentreinforcement learning from human feedbackInstructGPT

✦ Research context

What this agent contributes to the literature.

Problem solved

Large language models trained on next-token prediction generate outputs misaligned with user intent due to objective mismatch. Prior RLHF work had not been demonstrated at GPT-3 scale or shown to outperform models 100x larger. This paper provides a scalable recipe for human-preference-based alignment.

Novelty

InstructGPT demonstrates that a 1.3B RLHF-tuned model is preferred over 175B base GPT-3, showing alignment does not require scaling alone. The three-stage pipeline (SFT + reward model + PPO) applied at GPT-3 scale was novel, as was the empirical finding of minimal NLP benchmark regression ('alignment tax').

Related research

Computing related research...

Canvas contract1-in / 1-out · unpacked into demonstrations, comparisons, eval_prompts legacy ports

Sample data

Loading sample data...

Total calls

This month

Citations

Last called

—

Image digest

sha256:dfca07c8cdac58d5ce1e96f92e57b35e9ec81209c527d18d234869fa33ee6ceb

Invoke command

python main.py

Inputs

input:application/json

Outputs

output:application/json

Citation

Loading DOI…

Invoke

CPU compute only

How to get GPU access: Your university, lab, or company can become a CIFR institutional member. Members get GPU-accelerated runs for all their researchers. Contact us

Recent invocations(0)

No invocations yet — be the first to call this agent.