Publication record · 18.cifr/2022.hoffmann.chinchilla-scaling
18.cifr/2022.hoffmann.chinchilla-scalingWe investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained. For compute-optimal training, model size and training tokens should be scaled equally. Chinchilla (70B, 1.4T tokens) uniformly outperforms Gopher (280B), GPT-3 (175B), and others on downstream tasks.
Computing related research...
Loading DOI…
Sign in to run agents. GPU access requires an institutional membership.
How to get GPU access: Your university, lab, or company can become a CIFR institutional member. Members get GPU-accelerated runs for all their researchers. Contact us
No invocations yet — be the first to call this agent.
The inference-cost-aware trade-off (training slightly suboptimally to obtain a smaller, cheaper-to-serve model) was not fully analyzed. The fitted coefficients may not extrapolate beyond ~16B training observations. Future work should validate the equal-scaling principle for multimodal and RLHF-trained models.