2021 has been a transformational year for large language models, and it is getting more and more intense. A day after innovation leader DeepMind came out with Gopher, a 280 billion parameter transformer language model, tech mammoth Google introduced the Generalist Language Model (GLaM) – a trillion weight model that uses sparsity. The full version of GLaM has 1.2T total parameters across 64 experts per mixture of experts (MoE) layer with 32 MoE layers in total but only activates a subnetwork of 97B (8% of 1.2T) parameters per token prediction during inference.
After the release of the path-breaking GPT-3 autoregressive language model with 175 billion machine learning parameters from Open AI in 2020, tech giants have all stepped up and released their large language model versions to keep up with the competition. AI21 Labs released Jurassic-1, which has 178 billion parameters. Microsoft and NVIDIA went a step further and introduced the Megatron-Turing Natural Language Generation (MT-NLG) model with an astounding 530 billion parameters. Even in the early part of this year, Google released Switch Transformers, a technique to train language models with over a trillion parameters.
Now with Gopher and GLaM introduced, back to back comparisons between the two are bound to happen. Let us look at how they fare against each other.
GLaM serves efficiently in terms of computation and energy use
Training and serving large language models can be computationally intensive. Google says that the GLaM model can be trained and served efficiently in terms of computation and energy use (due to sparsity). It achieves competitive performance on multiple few-shot learning tasks. GLaM is a mixture of experts (MoE) models with different submodels (or experts) specialized for different inputs.
Trained on MassiveText
Gopher, on the other hand, is trained on MassiveText (a collection of large English-language text datasets from web pages, books, news articles, and code). It says that the pipeline includes text quality filtering, removal of repetitious text, deduplication of similar documents, and removal of documents with significant test-set overlap. DeepMind found out that successive stages of this pipeline improve language model downstream performance. MassiveText contains 2.35 billion documents or about 10.5 TB of text.
Difference in their dataset and architecture
Feedforward network replaced with MoE layer
As per Google, for GLaM, they built a high-quality 1.6 trillion
token dataset that has language usage representative of a diverse range of downstream use-cases for the model. We all know that web pages consist of a lot of unlabelled data with quality ranging from low-quality comments to professional writing.
Google AI has developed a text quality filter trained on a collection of text from Wikipedia and books that determines the quality of the content for a webpage. They applied the filter to generate the final subset of web pages. It was combined with books and Wikipedia to create the final training dataset. Each input token is dynamically routed to two selected expert networks out of 64 for prediction.
Google says that they replaced the single feedforward network with an MoE layer. Even though this MoE layer has many more parameters, the experts are sparsely activated. For a given input token, only two experts are used, giving the model more capacity while limiting computation.
The final learned representation of a token will be the weighted combination of the outputs from the two experts. This allows different experts to activate different types of inputs.
|Image: Google (The architecture of GLaM where each input token is dynamically routed to two selected expert networks out of 64 for prediction)|
Content filtering, text extraction, quality filtering, and repetition removal
In the MassiveText subsets, the non-English documents are filtered out, and the data is processed into a homogeneous text-only format. The documents are deduplicated, and the research team filtered out documents too similar to those in their test sets.
For MassiveWeb curated by them, the team obtained web data in a text-only format using a custom HTML scraper and then applied an extra filter to remove explicit content at the initial stages. Then, they applied simple heuristics to filter out low-quality text.
Image: Data processing stages (Scaling Language Models: Methods, Analysis & Insights from Training Gopher)
- Content Filtering – The non-English documents are filtered. Pages from MassiveWeb that do not pass Google’s SafeSearch filter are also removed.
- Text Extraction (MassiveWeb) – The text from web pages are extracted using the tree structure of the HTML markup. DeepMind said, “We extract text from web pages using the tree structure of the HTML markup. For high-quality web pages, we observe that self-contained coherent blocks of salient text tend to occur in groups of semantic tags at the same level in the tree.” These tags are converted to plain text. This gives a huge volume of text documents.
- Quality Filtering (MassiveWeb) – A huge chunk of the web has social media content, which can variously lack context and be of low quality. Through filters, DeepMind removes any document that does not contain between 50 and 100,000 words, or whose mean word length is outside the range of 3 to 10 characters or any document with a symbol-to-word ratio greater than 0.1 for either the hash symbol or the ellipsis.
They also remove any document with more than 90% of lines starting with a bullet point or more than 30% ending with an ellipsis. It is also required that 80% of words in a document contain at least one alphabetic character.
- Repetition Removal (MassiveWeb) – They remove documents with a high proportion of repeated lines, paragraphs, or n-grams. They also remove documents containing many short duplicate passages and ones with fewer, larger sections of duplicate content.
Gopher, GLaM Vs Other Large Language Models
DeepMind’s research went on to say that Gopher almost halves the accuracy gap from GPT-3 to human expert performance and exceeds forecaster expectations. It stated that Gopher lifts performance over current state-of-the-art language models across roughly 81% of tasks containing comparable results. This works notably in knowledge-intensive domains like fact-checking and general knowledge.
DeepMind’s Gopher does not outperform the state-of-the-art on 8 of 19 tasks; under-performs on Ubuntu IRC and DM Mathematics in particular. This may be due to a poor tokenizer representation for numbers. Gopher demonstrates improved modelling on 11 of 19 tasks, in particular books and articles. It is said that this can happen due to the heavy use of book data in MassiveText (sampling proportion of 27% compared to 16% in GPT-3).
In the research paper, DeepMind tries to draw a comparison between the models that exist and Gopher. It said that Gopher outperforms the current state-of-the-art for 100 tasks (81% of all tasks). The baseline model includes large language models such as GPT-3 (175 billion parameters), Jurassic-1 (178B parameters), and Megatron-Turing NLG (530 billion parameters). They found that Gopher showed the most uniform improvement across reading comprehension, humanities, ethics, STEM and medicine categories. It showed a general improvement in fact-checking.
GLaM’s performance compares favourably to GPT-3 (175B), with significantly improved learning efficiency across 29 public NLP benchmarks in seven categories. This covers language completion, open-domain question answering, and natural language inference tasks.
Compared with the Megatron-Turing model, GLaM is on-par on the seven respective tasks if using a 5% margin while using 5x less computation during inference.
|Image: Google (Average score for GLaM and GPT-3 on NLG (left) and NLU (right) tasks (higher is better).|
With the two models from two big names coming out back to back, seeing how they perform when deployed across sectors will be interesting. Competitors are surely taking notice, and we can expect more such large language models to get released in the near future.