David vs. Goliath: Does Chinchilla fare well against Google AI’s PaLM?

DeepMind’s claim that large language models were being trained with a suboptimal use of compute was also verified independently later by Google AI’s research.

In 2020, OpenAI published a study titled, ‘Scaling Laws for Neural Language Models’ that demonstrated how increasing the model size resulted in improved performance. It was found that larger models were far more sample-efficient, so optimal compute-efficient training meant training large models on a comparatively smaller amount of data and stopping before convergence. In the recent past, all the important tech companies led the way with creating bigger large language models. The large language model trend culminated with dense models like GPT-3, which has 175 billion parameters, LaMDA, which has 137 billion parameters and Megatron-Turing NLG, which has 530 billion parameters. 

Smaller models, more training tokens

To counter this viewpoint, DeepMind submitted a paper called ‘Training Compute-Optimal Large Language Models’ towards the end of March, which demonstrated that instead of just relying on the model size, the number of training tokens should also increase. The paper notes that usually for, when the computational budget increases by ten times, the size of the model is increased by 5.5 times while the number of training tokens is scaled by 1.8 times. However, the study suggests that the size of the model and the number of training tokens should increase proportionately. 

This theory was tested on a predicted compute-optimal model Chinchilla. The study compared Chinchilla’s 70-billion parameter model to Gopher’s 280-billion parameter model. Despite the smaller size, Chinchilla was trained on four times more data and outperformed Gopher with a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, which is 7 per cent higher. 

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

       Source: DeepMind blog

Large language models as a norm keep the number of training tokens fixed at around 300 billion. Interestingly, while the cost incurred to train Gopher and Chinchilla was the same, Chinchilla was trained with 1.3 trillion tokens. 

          Source: DeepMind blog

Higher budget, different approach

DeepMind’s claim that large language models were being trained with a suboptimal use of compute was also verified independently by Google AI’s research. At the beginning of the month, Google AI’s research team announced a new architecture called PaLM or the Pathways Language Model, a 540-billion parameter, decoder-only transformer model. Google stated in its findings that PaLM performed very well at English NLP tasks like sentence completion, comprehension and natural language inference, as well as multilingual NLP tasks like translation. The blog stated that the vision for Pathways was for a single AI system to be able to generalise across thousands of tasks with efficiency. 

Incidentally, PaLM was trained on 768 billion tokens, much less than Chinchilla but used five times the compute budget that Chinchilla demanded. PaLM was trained on a combination of data and model parallelism. At the Pod level, the model was trained over two Cloud TPU v4 Pods. This state-of-the-art training achieved a training efficiency of 57.8 per cent hardware FLOPs utilisation, which is the maximum efficiency for LLMs at this scale.  

Source: Google AI blog

PaLM was fed English and multilingual datasets, including books, web documents, Wikipedia, casual conversations and GitHub code. 

Conclusion

PaLM was tested on a set of NLP tasks alongside older large models like Chinchilla, GLaM, GPT-3, Megatron-Turing NLG and Gopher. Of the 29 tasks that included sentence completion, question-answer, reading comprehension and common-sense reasoning tasks, PaLM outperformed all other models in 28 tasks. PaLM was also compared to other LLMs on a range of 150 new language modelling tasks known as the Beyond the Imitation Game Benchmark (BIG-bench). 

While Chinchilla and PaLM were trained on different corpora, PaLM’s 540-billion model performed well at a range of tasks, including coding, where it was on par with OpenAI’s fine-tuned Codex 12B despite being trained on 50 times lesser Python code. At reasoning, PaLM was able to solve 58 per cent of the problems in GSM8K, a benchmark dataset of tough school-level maths questions. The model beat the previous best score set by GPT-3’s 55 per cent. 

PaLM was set against Chinchilla and Gopher across a subset of 58 of these tasks. Again, PaLM emerged on top. The study also found that PaLM’s performance as a “function of scale” follows a log-linear behaviour similar to older models. This signalled that the increase in performance from scale hadn’t reached a plateau yet. 

    Source: Google AI blog
DeepMind later admitted that despite PaLM not being compute-optimal, it would beat Chinchilla if trained on their data. It also predicted that given PaLM’s bigger compute budget, a 140-billion parameter model trained on 3 trillion tokens would give optimal performance and be more efficient for inference.

More Great AIM Stories

Poulomi Chatterjee
Poulomi is a Technology Journalist with Analytics India Magazine. Her fascination with tech and eagerness to dive into new areas led her to the dynamic world of AI and data analytics.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM