Scaling Laws (2020)

The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude.

Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data.

Language modeling performance improves smoothly as we increase the model size, dataset size, and amount of compute used for training.

We observe no signs of deviation from these trends on the upper end, though performance must flatten out eventually before reaching zero loss.

Performance improves predictably as long as we scale up N (number of parameters) and D (dataset size) in tandem, but enters a regime of diminishing returns if either N or D is held fixed while the other increases.

The original WebText dataset was a web scrape of outbound links from Reddit through December 2017 which received at least 3 karma. The karma threshold served as a heuristic for whether people found the link interesting or useful. The text of the new links was extracted with the Newspaper3k python library. In total, the dataset consists of 20.3M documents containing 96 GB of text and 1.62 × 10¹⁰ words (as defined by wc). We then apply the reversible tokenizer ... which yields 2.29 × 10¹⁰ tokens. We reserve 6.6 × 10⁸ of these tokens for use as a test set. and we also test on similarly prepared samples of Books Corpus [ZKZ+15], Common Crawl [Fou], English Wikipedia, and a collection of publicly-available Internet Books.

We found that results at convergence were largely independent of learning rate schedule.

The LSTMs perform as well as Transformers for tokens appearing early in the context, but cannot match the Transformer performance for later tokens.

Transformer performance depends very weakly on the shape parameters n_layer, n_heads, and d_ff when we hold the total non-embedding parameter count N fixed.

Smaller models require more steps to train, while larger models require fewer.

Each value of the compute budget C_min has an associated optimal model size N. Optimal model size grows very rapidly with C_min, increasing by 5x for each 10x increase in compute. The number of data examples processed makes up the remainder of the increase, growing relatively modestly by only 2x.

Thus we conclude that as we scale up language modeling with an optimal allocation of computation, we should predominantly increase the model size N.

We used these relations to derive the compute scaling, magnitude of overfitting, early stopping step, and data requirements when training large language models. So our scaling relations go beyond mere observation to provide a predictive framework.

It will be interesting to test these relations on other domains, such as images, audio, and video models... At this point we do not know which of our results depend on the structure of natural language data, and which are universal.

The smooth improvements in language model loss may hide seemingly qualitative changes in capability.

Our results strongly suggest that larger models will continue to perform better, and will also be much more sample efficient than has been previously appreciated.