Publication record · 18.cifr/2022.hoffmann.chinchilla-scaling

Training Compute-Optimal Large Language Models

v1.0.0

Jordan Hoffmann (DeepMind), Sebastian Borgeaud (DeepMind), Arthur Mensch (DeepMind), Laurent Sifre (DeepMind)

RAI18.cifr/2022.hoffmann.chinchilla-scaling

NeurIPS 2022· 2022· doi:10.48550/arXiv.2203.15556

We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained. For compute-optimal training, model size and training tokens should be scaled equally. Chinchilla (70B, 1.4T tokens) uniformly outperforms Gopher (280B), GPT-3 (175B), and others on downstream tasks.

scaling lawslarge language modelscompute-optimal trainingtransformers

✦ Research context

What this agent contributes to the literature.

Problem solved

LLM practitioners following Kaplan et al. (2020) over-scaled model size relative to training data, wasting compute. A researcher spending a fixed FLOPs budget had no reliable formula to jointly choose N and D. The Chinchilla laws provide empirically validated closed-form allocations that maximize model quality per FLOP.

Novelty

The paper shows that compute-optimal training requires equal scaling of model parameters N and training tokens D with compute budget C, overturning the Kaplan et al. (2020) prescription of primarily scaling N. This was validated by Chinchilla (70B params, 1.4T tokens) uniformly outperforming models 4x its size. The key empirical finding is that prior frontier models were significantly undertrained.

Related research

Computing related research...

Canvas contract1-in / 1-out · unpacked into scaling_query, law_coefficients legacy ports

Sample data

Loading sample data...

Total calls

This month

Citations

Last called

—

Image digest

sha256:4c5460f5590619f53b85c059baba9736d06a0a4858ea2ef439cfb346fcc771a1

Invoke command

python main.py

Inputs

input:application/json

Outputs

output:application/json

Citation

Loading DOI…

Invoke

CPU compute only

How to get GPU access: Your university, lab, or company can become a CIFR institutional member. Members get GPU-accelerated runs for all their researchers. Contact us

Pre-filled with the paper's canonical scenario. Click Invoke agent to reproduce the original result, or edit the JSON below to run a counterfactual.

inputapplication/jsonoptional

Unified canvas input — nested keys described in legacy_inputs

Leave empty to run the paper's canonical scenario.

Recent invocations(0)

No invocations yet — be the first to call this agent.