Chinchilla Scaling Laws

2026-05-30 ยท Source
We find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled.
Not only does Chinchilla outperform its much larger counterpart, Gopher, but its reduced model size reduces inference cost considerably and greatly facilitates downstream uses on smaller hardware.
ModelSize (# Parameters)Training Tokens
LaMDA (Thoppilan et al., 2022)137 Billion168 Billion
GPT-3 (Brown et al., 2020)175 Billion300 Billion
Jurassic (Lieber et al., 2021)178 Billion300 Billion
Gopher (Rae et al., 2021)280 Billion300 Billion
MT-NLG 530B (Smith et al., 2022)530 Billion270 Billion
Chinchilla70 Billion1.4 Trillion

MassiveText data makeup

SourceDisk SizeDocumentsSampling proportionEpochs in 1.4T tokens
MassiveWeb1.9 TB604M45% (48%)1.24
Books2.1 TB4M30% (27%)0.75
C40.75 TB361M10% (10%)0.77
News2.7 TB1.1B10% (10%)0.21
GitHub3.1 TB142M4% (3%)0.13
Wikipedia0.001 TB6M1% (2%)3.40
Both Chinchilla and Gopher have been trained for the same number of FLOPs but differ in the size of the model and the number of training tokens.
While pre-training a large language model has a considerable compute cost, downstream finetuning and inference also make up substantial compute usage. Due to being 4ร— smaller than Gopher, both the memory footprint and inference cost of Chinchilla are also smaller.
Speculatively, we expect that scaling to larger and larger datasets is only beneficial when the data is high-quality. This calls for responsibly collecting larger datasets with a high focus on dataset quality.
While we have applied our methodology towards the training of auto-regressive language models, we expect that there is a similar trade-off between model size and the amount of data in other modalities.

โ† All readings