Chinchilla Scaling Law

Chinchilla model (70B parameters) was trained with 1.4T token. ie. a 1:20 parameters:tokens ratio. And it outperformed other models under same compute budget.

Compute optimal (i.e. most accuracy under a fixed FLOPs budget) can be got with propertionally increasing the tokens with the parameters size.

From the paper: Training Compute-Optimal Large Language Models

We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for computeoptimal training, the model size and the number of training tokens should be scaled equally

Compute Optimal doesn't imply convergence. Or Optimality in terms of model size.

https://x.com/karpathy/status/1781033433336262691

Regarding LLaMA3 (8B parameters, 70B Tokens), Andrej Karpathy says:

Scaling laws. Very notably, 15T is a very very large dataset to train with for a model as "small" as 8B parameters, and this is not normally done and is new and very welcome. The Chinchilla "compute optimal" point for an 8B model would be train it for ~200B tokens. (if you were only interested to get the most "bang-for-the-buck" w.r.t. model performance at that size). So this is training ~75X beyond that point, which is unusual but personally, I think extremely welcome. Because we all get a very capable model that is very small, easy to work with and inference. Meta mentions that even at this point, the model doesn't seem to be "converging" in a standard sense. In other words, the LLMs we work with all the time are significantly undertrained by a factor of maybe 100-1000X or more, nowhere near their point of convergence. Actually, I really hope people carry forward the trend and start training and releasing even more long-trained, even smaller models.

Backlinks

Evaluation of Pre-training LLMs on Supercomputers