Language models have been the talk of the AI town for the past couple of years. In 2003, the first feed-forward neural network language model was proposed by Bengio et al., followed by the introduction of Transformers by Google in 2016, changing the playing field completely. While Google’s BERT was one of the first large language models with 100 million parameters, today, we have large language models built by big tech companies ranging in trillions of parameters. Analytics India Magazine has listed all the big tech companies and their biggest language models.
Open AI: GPT3
Released in May 2020 by OpenAI, GPT-3 has remained among the most significant AI language models ever created. The Generative Pre-trained Transformer can generate unique human-like text on demand. The third version, GPT-3, was built on 570 GB of data crawled from the internet, including Wikipedia. GPT-3 is popularly known for its ability to generate text given limited context; the text is in the forms of essays, tweets, memos, translations, and even computer code. It is built on 175 billion parameters, making it one of the largest language models to date.
OpenAI: DALL.E
In 2021, OpenAI released DALL·E, a 12-billion parameter version of GPT-3 trained to generate images from text descriptions, using a dataset of text-image pairs. OpenAI said that DALL·E is a transformer language model that receives both the text and the image as a single stream of data containing up to 1280 tokens. It added that DALL.E could render an image from scratch and alter its aspects using text prompts.
Google: Switch Transformer
In 2021, Google researchers introduced Switch Transformer, a language model based on T5-Base and T5-Large models. With 1.6 trillion parameters, researchers found this to perform better than the smaller T5-XXL model with 400 billion parameters. It is also claimed to be the largest of its kind. Switch Transformer uses a mixture-of-experts (MoE) routing algorithm and design-intuitive improved models with reduced communication and computational costs.
Google: GLaM
Google’s Generalist Language Model is a trillion weight model that uses sparsity. Its full version has 1.2T total parameters across 64 experts per mixture of experts (MoE) layer with 32 MoE layers in total. Still, it only activates a subnetwork of 97B (8% of 1.2T) parameters per token prediction during inference. As a result, GLaM has improved learning efficiency across 29 public NLP benchmarks in seven categories: language completion, open domain question answering, and inference tasks.
Microsoft: Turing NLG
Microsoft’s Turing NLG, with its 17 billion parameters, was one of the largest models in 2020. The transformer can complete open-ended textual tasks and unfinished sentences by generating words. Additionally, it can provide direct answers to questions and summarise documents.
Beijing Academy of Artificial Intelligence (BAAI): Wu Dao 2.0
Wu Dao 2.0, built by the China government-backed Beijing Academy of Artificial Intelligence (BAAI), is the latest and most extensive language model. It is built on 1.75 trillion parameters, easily surpassing GPT-3 or Google’s Switch Transformer. Wu Dao 2.0 covers English and Chinese with training done by studying 4.9 terabytes of texts and images in both languages. The model’s abilities include simulating conversational speech, writing poetry, understanding pictures, and generating recipes.
AI2: Macaw
AI2’s Macaw is a QA model based on a multi-angle approach, leveraging different inputs and outputs to achieve the results. Trained on 11 billion parameters, the model can successfully tackle various question types, including general knowledge, meta reasoning, hypothetical, and story understanding. Despite its lesser parameters, AI2 claims Macaw outperformed GPT-3 by over 10% on a suite of 300 challenge questions.
DeepMind: Gopher
DeepMind introduced their competitor to GPT-3, Gopher, a 280 billion parameter transformer language model. The team claims that Gopher almost halves the accuracy gap from GPT-3 to human expert performance and exceeds forecaster expectations. Furthermore, Gopher lifts performance over current state-of-the-art language models across roughly 81% of tasks containing comparable results.
AI21: Jurassic-1
AI21’s Jurassic-1 is claimed to be ‘the largest and most sophisticated language model ever released for general use by developers.’ Trained on 178 billion parameters, it is slightly bigger than GPT-3 and can recognise 250,000 lexical items, making its capacity 5x that of the other language models. Jurassic-1’s training dataset, Jumbo, consisted of 300 billion tokens from English-language websites.
Huawei: PanGu Alpha
Designed by Chinese company Huawei, PanGu Alpha is a 750-gigabyte model containing 200 billion parameters. The company has touted it as China’s equivalent of GPT-3 since it can deal with tasks in English and Chinese. It was trained on 1.1 terabytes of Chinese language ebooks, encyclopedias, news, social media posts, and websites and is claimed to achieve “superior” performance in Chinese-language tasks. For example, it can summarise text, answer questions and generate dialogue.
Microsoft + NVIDIA: Megatron-Turing NLG 530B
Microsoft and NVIDIA have collaborated to train one of the largest, monolithic transformer-based language models, Megatron-Turing NLG (MT-NLG), with 530 billion parameters. The companies claim to have established state-of-the-art results, alongside SOTA accuracies in natural language processing (NLP), by adapting to downstream tasks via few-shot, zero-shot, and fine-tuning techniques. In addition, it has 3x the number of parameters compared to the existing largest models.
Baidu: ERNIE 3.0 Titan
Built by Baidu and Peng Cheng Laboratory, a Shenzhen-based scientific research institution, ERNIE 3.0 Titan is a pre-training language model with 260 billion parameters. The model was trained on tons of unstructured data and a huge knowledge graph, allowing it to excel at natural language understanding and generation. Baidu claims the model to be the world’s first knowledge enhanced multi-hundred billion parameter model and largest Chinese singleton model. Their results proved the model could obtain state-of-the-art results in more than 60 natural language processing tasks and generalise across various downstream tasks (given a limited amount of labelled data).
LG: Exaone
Introduced by LG, Exaone can tune 300 billion different parameters or variables. Exaone, standing for “expert AI for everyone” can process the data through efficiency and advanced language skills made available to the system. LG AI Research has also trained the language model to curate, ingest and interpret massive datasets. In addition, it has a more advanced natural language processing to reach a “human-like” language performance. A unique point about Exaone is that it has been trained to perform both in Korean and English. Due to this, it has the potential for wider adoption globally.