Researchers at DeepMind have proposed a new predicted compute-optimal model called Chinchilla that uses the same compute budget as Gopher but with 70 billion parameters and 4 times more data.
Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks. It uses substantially less computing for fine-tuning and inference, greatly facilitating downstream usage.
Chinchilla showed a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, a 7% improvement over Gopher.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
The dominant trend in large language model training has been to increase the model size, without increasing the number of training tokens. The largest dense transformer, MT-NLG 530B, is now over 3× larger than GPT-3’s 170 billion parameters.
DeepMind’s Chinchilla, as well as the majority of existing large models, have all been trained for a comparable number of tokens—around 300 billion. While the desire to train these mega-models has led to substantial engineering innovation, the researchers said the race to train larger and larger models is resulting in models that are substantially underperforming compared to what could be achieved with the same compute budget.