Language modelling involves the use of statistical and probabilistic techniques to determine the probability of a given sequence of words in a sentence. To make word predictions, language models analyse preceding text data. Language modelling is usually used in applications such as machine translations and question-answer tasks. Many researchers and developers working on building robust and efficient language models posit that larger models, trained on a higher number of parameters, produce better outcomes. In this article, we compare three massive language models to find out if the theory holds.
Microsoft introduced Turing NLG in early 2020. At that time, it held the distinction of being the largest model ever published, with 17 billion parameters. A Transformer-based generative language model, Turing NLG or T-NLG is part of the Turing project of Microsoft, announced in 2020.
T-NLG can generate words to complete open-ended textual tasks and unfinished sentences. Microsoft has claimed the model can generate direct answers to questions and summarise documents. The team behind T-NLG believes that the bigger the model, the better it performs with fewer training examples. It is also more efficient to train a large centralised multi-task model rather than a new model for every task individually.
T-NLG is trained on the same type of data as NVIDIA’s Megatron-LM and has a maximum learning rate of 1.5×10^-4. Microsoft has used DeepSpeed, trained on 256 NVIDIA GPUs for more efficient training of large models with fewer GPUs.
In July last year, OpenAI released GPT-3–an autoregressive language model trained on public datasets with 500 billion tokens and 175 billion parameters– at least ten times bigger than previous non-sparse language models.To put things into perspective, its predecessor GPT-2 was trained on just 1.5 billion parameters.
GPT-3 is applied without any gradient updates or fine-tuning. It achieves strong performance on many NLP datasets and can perform tasks such as translation, question-answer, reasoning, and 3-digit arithmetic operations.
OpenAI’s language model achieved promising results in the zero-shot and one-shot settings, and occasionally surpassed state-of-the-art models in the few-shot setting.
GPT-3 has a lot of diverse applications, including:
- The Guardian published an entire article written using GPT-3 titled “A robot wrote this entire article. Are you scared yet, human?” The footnote said the model was given specific instructions on word count, language choice, and a short prompt.
- A short film of approximately 4 minutes–Solicitors was written by GPT-3.
- A bot powered by GPT-3 was found to be interacting with people in a Reddit thread.
The industry’s reaction towards GPT-3 has been mixed. The language model has courted controversy over inherent biases, tendency to go rogue when left to its own devices, and its overhyped capabilities.
Wu Dao 2.0
Wu Dao 2.0 is the latest offering from the China government-backed Beijing Academy of Artificial Intelligence (BAAI). It is the latest and the largest language model till date with 1.75 trillion parameters. It has surpassed previous models such as GPT-3, Google’s Switch Transformer in size. Unlike GPT-3 , Wu Dao 2.0 covers both Chinese and English with skills acquired by studying 4.9 terabytes of texts and images, including 1.2 terabytes of Chinese and English texts.
It can perform tasks such as simulating conversational speech, writing poetry, understanding pictures, and even generating recipes. It can also predict the 3D structures of proteins like DeepMind’s AlphaFold. China’s first virtual student Hua Zhibing was built on Wu Dao 2.0.
Wu Dao 2.0 was trained with FastMoE, a Fast Mixture-of-Expert (training system). FastMoE is a PyTorch-based open source system akin to Google’s Mixture of Experts. It offers a hierarchical interface for flexible model design and easy adoption to applications such as Transformer-XL and Megatron-LM.
Are bigger models better?
The size of the language models are increasing. Bigger models are assumed to be better at generalising and taking us a step closer towards artificial general intelligence.
Former Google AI researcher Timnit Gebru detailed the associated risks of large language models in her controversial paper “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?”. The paper argued although these models were extraordinarily good and could produce meaningful results, they carry risks such as huge carbon footprints.
Echoing similar sentiments, Facebook’s Yann LeCun said, “It’s entertaining, and perhaps mildly useful as a creative help. But trying to build intelligent machines by scaling up language models is like building high-altitude airplanes to go to the moon. You might beat altitude records, but going to the moon will require a completely different approach.”
All the three discussed language models have been introduced within a span of just one and a half years. The researcher communities around the world are gearing up to develop the next ‘biggest’ language model to achieve unparalleled efficiency at task execution and getting close to the AGI holy grail. However, the lingering question here is whether this is the right way to achieve AGI, especially when in the face of risks including bias, discrimination, and environmental costs.