In 2020, OpenAI published a study titled, ‘Scaling Laws for Neural Language Models’ that demonstrated how increasing the model size resulted in improved performance. It was found that larger models were far more sample-efficient, so optimal compute-efficient training meant training large models on a comparatively smaller amount of data and stopping before convergence. In the recent past, all the important tech companies led the way with creating bigger large language models. The large language model trend culminated with dense models like GPT-3, which has 175 billion parameters, LaMDA, which has 137 billion parameters and Megatron-Turing NLG, which has 530 billion parameters.
Smaller models, more training tokens
To counter this viewpoint, DeepMind submitted a paper called ‘Training Compute-Optimal Large Language Models’ towards the end of March, which demonstrated that instead of just relying on the model size, the number of training tokens should also increase. The paper notes that usually for, when the computational budget increases by ten times, the size of the model is increased by 5.5 times while the number of training tokens is scaled by 1.8 times. However, the study suggests that the size of the model and the number of training tokens should increase proportionately.
This theory was tested on a predicted compute-optimal model Chinchilla. The study compared Chinchilla’s 70-billion parameter model to Gopher’s 280-billion parameter model. Despite the smaller size, Chinchilla was trained on four times more data and outperformed Gopher with a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, which is 7 per cent higher.
Source: DeepMind blog
Large language models as a norm keep the number of training tokens fixed at around 300 billion. Interestingly, while the cost incurred to train Gopher and Chinchilla was the same, Chinchilla was trained with 1.3 trillion tokens.
Source: DeepMind blog
Higher budget, different approach
DeepMind’s claim that large language models were being trained with a suboptimal use of compute was also verified independently by Google AI’s research. At the beginning of the month, Google AI’s research team announced a new architecture called PaLM or the Pathways Language Model, a 540-billion parameter, decoder-only transformer model. Google stated in its findings that PaLM performed very well at English NLP tasks like sentence completion, comprehension and natural language inference, as well as multilingual NLP tasks like translation. The blog stated that the vision for Pathways was for a single AI system to be able to generalise across thousands of tasks with efficiency.
Incidentally, PaLM was trained on 768 billion tokens, much less than Chinchilla but used five times the compute budget that Chinchilla demanded. PaLM was trained on a combination of data and model parallelism. At the Pod level, the model was trained over two Cloud TPU v4 Pods. This state-of-the-art training achieved a training efficiency of 57.8 per cent hardware FLOPs utilisation, which is the maximum efficiency for LLMs at this scale.
Source: Google AI blog
PaLM was fed English and multilingual datasets, including books, web documents, Wikipedia, casual conversations and GitHub code.
PaLM was tested on a set of NLP tasks alongside older large models like Chinchilla, GLaM, GPT-3, Megatron-Turing NLG and Gopher. Of the 29 tasks that included sentence completion, question-answer, reading comprehension and common-sense reasoning tasks, PaLM outperformed all other models in 28 tasks. PaLM was also compared to other LLMs on a range of 150 new language modelling tasks known as the Beyond the Imitation Game Benchmark (BIG-bench).
While Chinchilla and PaLM were trained on different corpora, PaLM’s 540-billion model performed well at a range of tasks, including coding, where it was on par with OpenAI’s fine-tuned Codex 12B despite being trained on 50 times lesser Python code. At reasoning, PaLM was able to solve 58 per cent of the problems in GSM8K, a benchmark dataset of tough school-level maths questions. The model beat the previous best score set by GPT-3’s 55 per cent.
PaLM was set against Chinchilla and Gopher across a subset of 58 of these tasks. Again, PaLM emerged on top. The study also found that PaLM’s performance as a “function of scale” follows a log-linear behaviour similar to older models. This signalled that the increase in performance from scale hadn’t reached a plateau yet.
Source: Google AI blog
DeepMind later admitted that despite PaLM not being compute-optimal, it would beat Chinchilla if trained on their data. It also predicted that given PaLM’s bigger compute budget, a 140-billion parameter model trained on 3 trillion tokens would give optimal performance and be more efficient for inference.